This the multi-page printable view of this section. Click here to print.
Documentation
- 1: Kubernetes Documentation
- 2: Getting started
- 2.1: Learning environment
- 2.2: Production environment
- 2.2.1: Container runtimes
- 2.2.2: Installing Kubernetes with deployment tools
- 2.2.2.1: Bootstrapping clusters with kubeadm
- 2.2.2.1.1: Installing kubeadm
- 2.2.2.1.2: Troubleshooting kubeadm
- 2.2.2.1.3: Creating a cluster with kubeadm
- 2.2.2.1.4: Customizing components with the kubeadm API
- 2.2.2.1.5: Options for Highly Available Topology
- 2.2.2.1.6: Creating Highly Available Clusters with kubeadm
- 2.2.2.1.7: Set up a High Availability etcd Cluster with kubeadm
- 2.2.2.1.8: Configuring each kubelet in your cluster using kubeadm
- 2.2.2.1.9: Dual-stack support with kubeadm
- 2.2.2.2: Installing Kubernetes with kops
- 2.2.2.3: Installing Kubernetes with Kubespray
- 2.2.3: Turnkey Cloud Solutions
- 2.2.4: Windows in Kubernetes
- 2.3: Best practices
- 2.3.1: Considerations for large clusters
- 2.3.2: Running in multiple zones
- 2.3.3: Validate node setup
- 2.3.4: Enforcing Pod Security Standards
- 2.3.5: PKI certificates and requirements
- 3: Concepts
- 3.1: Overview
- 3.1.1: What is Kubernetes?
- 3.1.2: Kubernetes Components
- 3.1.3: The Kubernetes API
- 3.1.4: Working with Kubernetes Objects
- 3.1.4.1: Understanding Kubernetes Objects
- 3.1.4.2: Kubernetes Object Management
- 3.1.4.3: Object Names and IDs
- 3.1.4.4: Namespaces
- 3.1.4.5: Labels and Selectors
- 3.1.4.6: Annotations
- 3.1.4.7: Field Selectors
- 3.1.4.8: Finalizers
- 3.1.4.9: Owners and Dependents
- 3.1.4.10: Recommended Labels
- 3.2: Cluster Architecture
- 3.2.1: Nodes
- 3.2.2: Control Plane-Node Communication
- 3.2.3: Controllers
- 3.2.4: Cloud Controller Manager
- 3.2.5: Container Runtime Interface (CRI)
- 3.2.6: Garbage Collection
- 3.3: Containers
- 3.3.1: Images
- 3.3.2: Container Environment
- 3.3.3: Runtime Class
- 3.3.4: Container Lifecycle Hooks
- 3.4: Workloads
- 3.4.1: Pods
- 3.4.1.1: Pod Lifecycle
- 3.4.1.2: Init Containers
- 3.4.1.3: Pod Topology Spread Constraints
- 3.4.1.4: Disruptions
- 3.4.1.5: Ephemeral Containers
- 3.4.2: Workload Resources
- 3.4.2.1: Deployments
- 3.4.2.2: ReplicaSet
- 3.4.2.3: StatefulSets
- 3.4.2.4: DaemonSet
- 3.4.2.5: Jobs
- 3.4.2.6: Automatic Clean-up for Finished Jobs
- 3.4.2.7: CronJob
- 3.4.2.8: ReplicationController
- 3.5: Services, Load Balancing, and Networking
- 3.5.1: Service
- 3.5.2: Topology-aware traffic routing with topology keys
- 3.5.3: DNS for Services and Pods
- 3.5.4: Connecting Applications with Services
- 3.5.5: Ingress
- 3.5.6: Ingress Controllers
- 3.5.7: EndpointSlices
- 3.5.8: Service Internal Traffic Policy
- 3.5.9: Topology Aware Hints
- 3.5.10: Network Policies
- 3.5.11: IPv4/IPv6 dual-stack
- 3.6: Storage
- 3.6.1: Volumes
- 3.6.2: Persistent Volumes
- 3.6.3: Projected Volumes
- 3.6.4: Ephemeral Volumes
- 3.6.5: Storage Classes
- 3.6.6: Dynamic Volume Provisioning
- 3.6.7: Volume Snapshots
- 3.6.8: Volume Snapshot Classes
- 3.6.9: CSI Volume Cloning
- 3.6.10: Storage Capacity
- 3.6.11: Node-specific Volume Limits
- 3.6.12: Volume Health Monitoring
- 3.7: Configuration
- 3.7.1: Configuration Best Practices
- 3.7.2: ConfigMaps
- 3.7.3: Secrets
- 3.7.4: Resource Management for Pods and Containers
- 3.7.5: Organizing Cluster Access Using kubeconfig Files
- 3.8: Security
- 3.8.1: Overview of Cloud Native Security
- 3.8.2: Pod Security Standards
- 3.8.3: Pod Security Admission
- 3.8.4: Controlling Access to the Kubernetes API
- 3.9: Policies
- 3.9.1: Limit Ranges
- 3.9.2: Resource Quotas
- 3.9.3: Pod Security Policies
- 3.9.4: Process ID Limits And Reservations
- 3.9.5: Node Resource Managers
- 3.10: Scheduling, Preemption and Eviction
- 3.10.1: Kubernetes Scheduler
- 3.10.2: Assigning Pods to Nodes
- 3.10.3: Pod Overhead
- 3.10.4: Taints and Tolerations
- 3.10.5: Pod Priority and Preemption
- 3.10.6: Node-pressure Eviction
- 3.10.7: API-initiated Eviction
- 3.10.8: Resource Bin Packing for Extended Resources
- 3.10.9: Scheduling Framework
- 3.10.10: Scheduler Performance Tuning
- 3.11: Cluster Administration
- 3.11.1: Certificates
- 3.11.2: Managing Resources
- 3.11.3: Cluster Networking
- 3.11.4: Logging Architecture
- 3.11.5: Metrics For Kubernetes System Components
- 3.11.6: System Logs
- 3.11.7: Traces For Kubernetes System Components
- 3.11.8: Proxies in Kubernetes
- 3.11.9: API Priority and Fairness
- 3.11.10: Installing Addons
- 3.12: Extending Kubernetes
- 3.12.1: Extending the Kubernetes API
- 3.12.1.1: Custom Resources
- 3.12.1.2: Kubernetes API Aggregation Layer
- 3.12.2: Compute, Storage, and Networking Extensions
- 3.12.2.1: Network Plugins
- 3.12.2.2: Device Plugins
- 3.12.3: Operator pattern
- 3.12.4: Service Catalog
- 4: Tasks
- 4.1: Install Tools
- 4.1.1: Install and Set Up kubectl on Linux
- 4.1.2: Install and Set Up kubectl on macOS
- 4.1.3: Install and Set Up kubectl on Windows
- 4.1.4: Tools Included
- 4.1.4.1: bash auto-completion on Linux
- 4.1.4.2: bash auto-completion on macOS
- 4.1.4.3: fish auto-completion
- 4.1.4.4: kubectl-convert overview
- 4.1.4.5: PowerShell auto-completion
- 4.1.4.6: verify kubectl install
- 4.1.4.7: What's next?
- 4.1.4.8: zsh auto-completion
- 4.2: Administer a Cluster
- 4.2.1: Administration with kubeadm
- 4.2.1.1: Certificate Management with kubeadm
- 4.2.1.2: Configuring a cgroup driver
- 4.2.1.3: Upgrading kubeadm clusters
- 4.2.1.4: Adding Windows nodes
- 4.2.1.5: Upgrading Windows nodes
- 4.2.2: Migrating from dockershim
- 4.2.2.1: Changing the Container Runtime on a Node from Docker Engine to containerd
- 4.2.2.2: Find Out What Container Runtime is Used on a Node
- 4.2.2.3: Check whether Dockershim deprecation affects you
- 4.2.2.4: Migrating telemetry and security agents from dockershim
- 4.2.3: Certificates
- 4.2.4: Manage Memory, CPU, and API Resources
- 4.2.4.1: Configure Default Memory Requests and Limits for a Namespace
- 4.2.4.2: Configure Default CPU Requests and Limits for a Namespace
- 4.2.4.3: Configure Minimum and Maximum Memory Constraints for a Namespace
- 4.2.4.4: Configure Minimum and Maximum CPU Constraints for a Namespace
- 4.2.4.5: Configure Memory and CPU Quotas for a Namespace
- 4.2.4.6: Configure a Pod Quota for a Namespace
- 4.2.5: Install a Network Policy Provider
- 4.2.5.1: Use Antrea for NetworkPolicy
- 4.2.5.2: Use Calico for NetworkPolicy
- 4.2.5.3: Use Cilium for NetworkPolicy
- 4.2.5.4: Use Kube-router for NetworkPolicy
- 4.2.5.5: Romana for NetworkPolicy
- 4.2.5.6: Weave Net for NetworkPolicy
- 4.2.6: Access Clusters Using the Kubernetes API
- 4.2.7: Advertise Extended Resources for a Node
- 4.2.8: Autoscale the DNS Service in a Cluster
- 4.2.9: Change the default StorageClass
- 4.2.10: Change the Reclaim Policy of a PersistentVolume
- 4.2.11: Cloud Controller Manager Administration
- 4.2.12: Configure Quotas for API Objects
- 4.2.13: Control CPU Management Policies on the Node
- 4.2.14: Control Topology Management Policies on a node
- 4.2.15: Customizing DNS Service
- 4.2.16: Debugging DNS Resolution
- 4.2.17: Declare Network Policy
- 4.2.18: Developing Cloud Controller Manager
- 4.2.19: Enable Or Disable A Kubernetes API
- 4.2.20: Enabling Service Topology
- 4.2.21: Encrypting Secret Data at Rest
- 4.2.22: Guaranteed Scheduling For Critical Add-On Pods
- 4.2.23: IP Masquerade Agent User Guide
- 4.2.24: Limit Storage Consumption
- 4.2.25: Migrate Replicated Control Plane To Use Cloud Controller Manager
- 4.2.26: Namespaces Walkthrough
- 4.2.27: Operating etcd clusters for Kubernetes
- 4.2.28: Reconfigure a Node's Kubelet in a Live Cluster
- 4.2.29: Reserve Compute Resources for System Daemons
- 4.2.30: Running Kubernetes Node Components as a Non-root User
- 4.2.31: Safely Drain a Node
- 4.2.32: Securing a Cluster
- 4.2.33: Set Kubelet parameters via a config file
- 4.2.34: Share a Cluster with Namespaces
- 4.2.35: Upgrade A Cluster
- 4.2.36: Use Cascading Deletion in a Cluster
- 4.2.37: Using a KMS provider for data encryption
- 4.2.38: Using CoreDNS for Service Discovery
- 4.2.39: Using NodeLocal DNSCache in Kubernetes clusters
- 4.2.40: Using sysctls in a Kubernetes Cluster
- 4.2.41: Utilizing the NUMA-aware Memory Manager
- 4.3: Configure Pods and Containers
- 4.3.1: Assign Memory Resources to Containers and Pods
- 4.3.2: Assign CPU Resources to Containers and Pods
- 4.3.3: Configure GMSA for Windows Pods and containers
- 4.3.4: Configure RunAsUserName for Windows pods and containers
- 4.3.5: Create a Windows HostProcess Pod
- 4.3.6: Configure Quality of Service for Pods
- 4.3.7: Assign Extended Resources to a Container
- 4.3.8: Configure a Pod to Use a Volume for Storage
- 4.3.9: Configure a Pod to Use a PersistentVolume for Storage
- 4.3.10: Configure a Pod to Use a Projected Volume for Storage
- 4.3.11: Configure a Security Context for a Pod or Container
- 4.3.12: Configure Service Accounts for Pods
- 4.3.13: Pull an Image from a Private Registry
- 4.3.14: Configure Liveness, Readiness and Startup Probes
- 4.3.15: Assign Pods to Nodes
- 4.3.16: Assign Pods to Nodes using Node Affinity
- 4.3.17: Configure Pod Initialization
- 4.3.18: Attach Handlers to Container Lifecycle Events
- 4.3.19: Configure a Pod to Use a ConfigMap
- 4.3.20: Share Process Namespace between Containers in a Pod
- 4.3.21: Create static Pods
- 4.3.22: Translate a Docker Compose File to Kubernetes Resources
- 4.3.23: Enforce Pod Security Standards by Configuring the Built-in Admission Controller
- 4.3.24: Enforce Pod Security Standards with Namespace Labels
- 4.3.25: Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller
- 4.4: Manage Kubernetes Objects
- 4.4.1: Declarative Management of Kubernetes Objects Using Configuration Files
- 4.4.2: Declarative Management of Kubernetes Objects Using Kustomize
- 4.4.3: Managing Kubernetes Objects Using Imperative Commands
- 4.4.4: Imperative Management of Kubernetes Objects Using Configuration Files
- 4.4.5: Update API Objects in Place Using kubectl patch
- 4.5: Managing Secrets
- 4.5.1: Managing Secrets using kubectl
- 4.5.2: Managing Secrets using Configuration File
- 4.5.3: Managing Secrets using Kustomize
- 4.6: Inject Data Into Applications
- 4.6.1: Define a Command and Arguments for a Container
- 4.6.2: Define Dependent Environment Variables
- 4.6.3: Define Environment Variables for a Container
- 4.6.4: Expose Pod Information to Containers Through Environment Variables
- 4.6.5: Expose Pod Information to Containers Through Files
- 4.6.6: Distribute Credentials Securely Using Secrets
- 4.7: Run Applications
- 4.7.1: Run a Stateless Application Using a Deployment
- 4.7.2: Run a Single-Instance Stateful Application
- 4.7.3: Run a Replicated Stateful Application
- 4.7.4: Scale a StatefulSet
- 4.7.5: Delete a StatefulSet
- 4.7.6: Force Delete StatefulSet Pods
- 4.7.7: Horizontal Pod Autoscaling
- 4.7.8: HorizontalPodAutoscaler Walkthrough
- 4.7.9: Specifying a Disruption Budget for your Application
- 4.7.10: Accessing the Kubernetes API from a Pod
- 4.8: Run Jobs
- 4.8.1: Running Automated Tasks with a CronJob
- 4.8.2: Coarse Parallel Processing Using a Work Queue
- 4.8.3: Fine Parallel Processing Using a Work Queue
- 4.8.4: Indexed Job for Parallel Processing with Static Work Assignment
- 4.8.5: Parallel Processing using Expansions
- 4.9: Access Applications in a Cluster
- 4.9.1: Deploy and Access the Kubernetes Dashboard
- 4.9.2: Accessing Clusters
- 4.9.3: Configure Access to Multiple Clusters
- 4.9.4: Use Port Forwarding to Access Applications in a Cluster
- 4.9.5: Use a Service to Access an Application in a Cluster
- 4.9.6: Connect a Frontend to a Backend Using Services
- 4.9.7: Create an External Load Balancer
- 4.9.8: List All Container Images Running in a Cluster
- 4.9.9: Set up Ingress on Minikube with the NGINX Ingress Controller
- 4.9.10: Communicate Between Containers in the Same Pod Using a Shared Volume
- 4.9.11: Configure DNS for a Cluster
- 4.9.12: Access Services Running on Clusters
- 4.10: Monitoring, Logging, and Debugging
- 4.10.1: Application Introspection and Debugging
- 4.10.2: Auditing
- 4.10.3: Debug a StatefulSet
- 4.10.4: Debug Init Containers
- 4.10.5: Debug Pods and ReplicationControllers
- 4.10.6: Debug Running Pods
- 4.10.7: Debug Services
- 4.10.8: Debugging Kubernetes nodes with crictl
- 4.10.9: Determine the Reason for Pod Failure
- 4.10.10: Developing and debugging services locally
- 4.10.11: Get a Shell to a Running Container
- 4.10.12: Monitor Node Health
- 4.10.13: Resource metrics pipeline
- 4.10.14: Tools for Monitoring Resources
- 4.10.15: Troubleshoot Applications
- 4.10.16: Troubleshoot Clusters
- 4.10.17: Troubleshooting
- 4.11: Extend Kubernetes
- 4.11.1: Configure the Aggregation Layer
- 4.11.2: Use Custom Resources
- 4.11.2.1: Extend the Kubernetes API with CustomResourceDefinitions
- 4.11.2.2: Versions in CustomResourceDefinitions
- 4.11.3: Set up an Extension API Server
- 4.11.4: Configure Multiple Schedulers
- 4.11.5: Use an HTTP Proxy to Access the Kubernetes API
- 4.11.6: Set up Konnectivity service
- 4.12: TLS
- 4.12.1: Configure Certificate Rotation for the Kubelet
- 4.12.2: Manage TLS Certificates in a Cluster
- 4.12.3: Manual Rotation of CA Certificates
- 4.13: Manage Cluster Daemons
- 4.14: Service Catalog
- 4.14.1: Install Service Catalog using Helm
- 4.14.2: Install Service Catalog using SC
- 4.15: Networking
- 4.16: Configure a kubelet image credential provider
- 4.17: Extend kubectl with plugins
- 4.18: Manage HugePages
- 4.19: Schedule GPUs
- 5: Tutorials
- 5.1: Hello Minikube
- 5.2: Learn Kubernetes Basics
- 5.2.1: Create a Cluster
- 5.2.1.1: Using Minikube to Create a Cluster
- 5.2.1.2: Interactive Tutorial - Creating a Cluster
- 5.2.2: Deploy an App
- 5.2.2.1: Using kubectl to Create a Deployment
- 5.2.2.2: Interactive Tutorial - Deploying an App
- 5.2.3: Explore Your App
- 5.2.3.1: Viewing Pods and Nodes
- 5.2.3.2: Interactive Tutorial - Exploring Your App
- 5.2.4: Expose Your App Publicly
- 5.2.4.1: Using a Service to Expose Your App
- 5.2.4.2: Interactive Tutorial - Exposing Your App
- 5.2.5: Scale Your App
- 5.2.6: Update Your App
- 5.2.6.1: Performing a Rolling Update
- 5.2.6.2: Interactive Tutorial - Updating Your App
- 5.3: Configuration
- 5.3.1: Example: Configuring a Java Microservice
- 5.3.1.1: Externalizing config using MicroProfile, ConfigMaps and Secrets
- 5.3.1.2: Interactive Tutorial - Configuring a Java Microservice
- 5.3.2: Configuring Redis using a ConfigMap
- 5.4: Security
- 5.4.1: Apply Pod Security Standards at the Cluster Level
- 5.4.2: Apply Pod Security Standards at the Namespace Level
- 5.4.3: Restrict a Container's Access to Resources with AppArmor
- 5.4.4: Restrict a Container's Syscalls with seccomp
- 5.5: Stateless Applications
- 5.5.1: Exposing an External IP Address to Access an Application in a Cluster
- 5.5.2: Example: Deploying PHP Guestbook application with Redis
- 5.6: Stateful Applications
- 5.6.1: StatefulSet Basics
- 5.6.2: Example: Deploying WordPress and MySQL with Persistent Volumes
- 5.6.3: Example: Deploying Cassandra with a StatefulSet
- 5.6.4: Running ZooKeeper, A Distributed System Coordinator
- 5.7: Services
- 5.7.1: Using Source IP
- 6: Reference
- 6.1: Glossary
- 6.2: API Overview
- 6.2.1: Kubernetes API Concepts
- 6.2.2: Server-Side Apply
- 6.2.3: Client Libraries
- 6.2.4: Kubernetes Deprecation Policy
- 6.2.5: Deprecated API Migration Guide
- 6.2.6: Kubernetes API health endpoints
- 6.3: API Access Control
- 6.3.1: Authenticating
- 6.3.2: Authenticating with Bootstrap Tokens
- 6.3.3: Certificate Signing Requests
- 6.3.4: Using Admission Controllers
- 6.3.5: Dynamic Admission Control
- 6.3.6: Managing Service Accounts
- 6.3.7: Authorization Overview
- 6.3.8: Using RBAC Authorization
- 6.3.9: Using ABAC Authorization
- 6.3.10: Using Node Authorization
- 6.3.11: Mapping PodSecurityPolicies to Pod Security Standards
- 6.3.12: Webhook Mode
- 6.4: Well-Known Labels, Annotations and Taints
- 6.4.1: Audit Annotations
- 6.5: Kubernetes API
- 6.5.1: Workload Resources
- 6.5.1.1: Pod
- 6.5.1.2: PodTemplate
- 6.5.1.3: ReplicationController
- 6.5.1.4: ReplicaSet
- 6.5.1.5: Deployment
- 6.5.1.6: StatefulSet
- 6.5.1.7: ControllerRevision
- 6.5.1.8: DaemonSet
- 6.5.1.9: Job
- 6.5.1.10: CronJob
- 6.5.1.11: HorizontalPodAutoscaler
- 6.5.1.12: HorizontalPodAutoscaler
- 6.5.1.13: HorizontalPodAutoscaler v2beta2
- 6.5.1.14: PriorityClass
- 6.5.2: Service Resources
- 6.5.2.1: Service
- 6.5.2.2: Endpoints
- 6.5.2.3: EndpointSlice
- 6.5.2.4: Ingress
- 6.5.2.5: IngressClass
- 6.5.3: Config and Storage Resources
- 6.5.3.1: ConfigMap
- 6.5.3.2: Secret
- 6.5.3.3: Volume
- 6.5.3.4: PersistentVolumeClaim
- 6.5.3.5: PersistentVolume
- 6.5.3.6: StorageClass
- 6.5.3.7: VolumeAttachment
- 6.5.3.8: CSIDriver
- 6.5.3.9: CSINode
- 6.5.3.10: CSIStorageCapacity v1beta1
- 6.5.4: Authentication Resources
- 6.5.4.1: ServiceAccount
- 6.5.4.2: TokenRequest
- 6.5.4.3: TokenReview
- 6.5.4.4: CertificateSigningRequest
- 6.5.5: Authorization Resources
- 6.5.5.1: LocalSubjectAccessReview
- 6.5.5.2: SelfSubjectAccessReview
- 6.5.5.3: SelfSubjectRulesReview
- 6.5.5.4: SubjectAccessReview
- 6.5.5.5: ClusterRole
- 6.5.5.6: ClusterRoleBinding
- 6.5.5.7: Role
- 6.5.5.8: RoleBinding
- 6.5.6: Policy Resources
- 6.5.6.1: LimitRange
- 6.5.6.2: ResourceQuota
- 6.5.6.3: NetworkPolicy
- 6.5.6.4: PodDisruptionBudget
- 6.5.6.5: PodSecurityPolicy v1beta1
- 6.5.7: Extend Resources
- 6.5.7.1: CustomResourceDefinition
- 6.5.7.2: MutatingWebhookConfiguration
- 6.5.7.3: ValidatingWebhookConfiguration
- 6.5.8: Cluster Resources
- 6.5.8.1: Node
- 6.5.8.2: Namespace
- 6.5.8.3: Event
- 6.5.8.4: APIService
- 6.5.8.5: Lease
- 6.5.8.6: RuntimeClass
- 6.5.8.7: FlowSchema v1beta2
- 6.5.8.8: PriorityLevelConfiguration v1beta2
- 6.5.8.9: Binding
- 6.5.8.10: ComponentStatus
- 6.5.9: Common Definitions
- 6.5.9.1: DeleteOptions
- 6.5.9.2: LabelSelector
- 6.5.9.3: ListMeta
- 6.5.9.4: LocalObjectReference
- 6.5.9.5: NodeSelectorRequirement
- 6.5.9.6: ObjectFieldSelector
- 6.5.9.7: ObjectMeta
- 6.5.9.8: ObjectReference
- 6.5.9.9: Patch
- 6.5.9.10: Quantity
- 6.5.9.11: ResourceFieldSelector
- 6.5.9.12: Status
- 6.5.9.13: TypedLocalObjectReference
- 6.5.10: Common Parameters
- 6.6: Kubernetes Issues and Security
- 6.7: Node Reference Information
- 6.8: Ports and Protocols
- 6.9: Setup tools
- 6.9.1: Kubeadm
- 6.9.1.1: Kubeadm Generated
- 6.9.1.1.1:
- 6.9.1.1.2:
- 6.9.1.1.3:
- 6.9.1.1.4:
- 6.9.1.1.5:
- 6.9.1.1.6:
- 6.9.1.1.7:
- 6.9.1.1.8:
- 6.9.1.1.9:
- 6.9.1.1.10:
- 6.9.1.1.11:
- 6.9.1.1.12:
- 6.9.1.1.13:
- 6.9.1.1.14:
- 6.9.1.1.15:
- 6.9.1.1.16:
- 6.9.1.1.17:
- 6.9.1.1.18:
- 6.9.1.1.19:
- 6.9.1.1.20:
- 6.9.1.1.21:
- 6.9.1.1.22:
- 6.9.1.1.23:
- 6.9.1.1.24:
- 6.9.1.1.25:
- 6.9.1.1.26:
- 6.9.1.1.27:
- 6.9.1.1.28:
- 6.9.1.1.29:
- 6.9.1.1.30:
- 6.9.1.1.31:
- 6.9.1.1.32:
- 6.9.1.1.33:
- 6.9.1.1.34:
- 6.9.1.1.35:
- 6.9.1.1.36:
- 6.9.1.1.37:
- 6.9.1.1.38:
- 6.9.1.1.39:
- 6.9.1.1.40:
- 6.9.1.1.41:
- 6.9.1.1.42:
- 6.9.1.1.43:
- 6.9.1.1.44:
- 6.9.1.1.45:
- 6.9.1.1.46:
- 6.9.1.1.47:
- 6.9.1.1.48:
- 6.9.1.1.49:
- 6.9.1.1.50:
- 6.9.1.1.51:
- 6.9.1.1.52:
- 6.9.1.1.53:
- 6.9.1.1.54:
- 6.9.1.1.55:
- 6.9.1.1.56:
- 6.9.1.1.57:
- 6.9.1.1.58:
- 6.9.1.1.59:
- 6.9.1.1.60:
- 6.9.1.1.61:
- 6.9.1.1.62:
- 6.9.1.1.63:
- 6.9.1.1.64:
- 6.9.1.1.65:
- 6.9.1.1.66:
- 6.9.1.1.67:
- 6.9.1.1.68:
- 6.9.1.1.69:
- 6.9.1.1.70:
- 6.9.1.1.71:
- 6.9.1.1.72:
- 6.9.1.1.73:
- 6.9.1.1.74:
- 6.9.1.1.75:
- 6.9.1.1.76:
- 6.9.1.1.77:
- 6.9.1.1.78:
- 6.9.1.1.79:
- 6.9.1.1.80:
- 6.9.1.1.81:
- 6.9.1.1.82:
- 6.9.1.1.83:
- 6.9.1.1.84:
- 6.9.1.1.85:
- 6.9.1.1.86:
- 6.9.1.1.87:
- 6.9.1.1.88:
- 6.9.1.1.89:
- 6.9.1.1.90:
- 6.9.1.1.91:
- 6.9.1.1.92:
- 6.9.1.1.93:
- 6.9.1.1.94:
- 6.9.1.1.95:
- 6.9.1.1.96:
- 6.9.1.1.97:
- 6.9.1.1.98:
- 6.9.1.1.99:
- 6.9.1.1.100:
- 6.9.1.1.101:
- 6.9.1.1.102:
- 6.9.1.1.103:
- 6.9.1.1.104:
- 6.9.1.1.105:
- 6.9.1.1.106:
- 6.9.1.1.107:
- 6.9.1.1.108:
- 6.9.1.2: kubeadm init
- 6.9.1.3: kubeadm join
- 6.9.1.4: kubeadm upgrade
- 6.9.1.5: kubeadm config
- 6.9.1.6: kubeadm reset
- 6.9.1.7: kubeadm token
- 6.9.1.8: kubeadm version
- 6.9.1.9: kubeadm alpha
- 6.9.1.10: kubeadm certs
- 6.9.1.11: kubeadm init phase
- 6.9.1.12: kubeadm join phase
- 6.9.1.13: kubeadm kubeconfig
- 6.9.1.14: kubeadm reset phase
- 6.9.1.15: kubeadm upgrade phase
- 6.9.1.16: Implementation details
- 6.10: Command line tool (kubectl)
- 6.10.1: kubectl Cheat Sheet
- 6.10.2: kubectl Commands
- 6.10.3: kubectl
- 6.10.4: JSONPath Support
- 6.10.5: kubectl for Docker Users
- 6.10.6: kubectl Usage Conventions
- 6.11: Component tools
- 6.11.1: Feature Gates
- 6.11.2: kubelet
- 6.11.3: kube-apiserver
- 6.11.4: kube-controller-manager
- 6.11.5: kube-proxy
- 6.11.6: kube-scheduler
- 6.11.7: Kubelet authentication/authorization
- 6.11.8: TLS bootstrapping
- 6.12: Configuration APIs
- 6.12.1: Client Authentication (v1)
- 6.12.2: Client Authentication (v1beta1)
- 6.12.3: kube-apiserver Audit Configuration (v1)
- 6.12.4: kube-apiserver Configuration (v1)
- 6.12.5: kube-apiserver Configuration (v1alpha1)
- 6.12.6: kube-apiserver Encryption Configuration (v1)
- 6.12.7: kube-proxy Configuration (v1alpha1)
- 6.12.8: kube-scheduler Configuration (v1beta2)
- 6.12.9: kube-scheduler Configuration (v1beta3)
- 6.12.10: kubeadm Configuration (v1beta2)
- 6.12.11: kubeadm Configuration (v1beta3)
- 6.12.12: Kubelet Configuration (v1alpha1)
- 6.12.13: Kubelet Configuration (v1beta1)
- 6.12.14: Kubelet CredentialProvider (v1alpha1)
- 6.12.15: WebhookAdmission Configuration (v1)
- 6.13: Scheduling
- 6.13.1: Scheduler Configuration
- 6.13.2: Scheduling Policies
- 6.14: Other Tools
- 6.14.1: Mapping from dockercli to crictl
- 7: Contribute to K8s docs
- 7.1: Suggesting content improvements
- 7.2: Contributing new content
- 7.2.1: Opening a pull request
- 7.2.2: Documenting a feature for a release
- 7.2.3: Submitting blog posts and case studies
- 7.3: Reviewing changes
- 7.3.1: Reviewing pull requests
- 7.3.2: Reviewing for approvers and reviewers
- 7.4: Localizing Kubernetes documentation
- 7.5: Participating in SIG Docs
- 7.5.1: Roles and responsibilities
- 7.5.2: PR wranglers
- 7.6: Documentation style overview
- 7.6.1: Documentation Content Guide
- 7.6.2: Documentation Style Guide
- 7.6.3: Diagram Guide
- 7.6.4: Writing a new topic
- 7.6.5: Page content types
- 7.6.6: Content organization
- 7.6.7: Custom Hugo Shortcodes
- 7.7: Reference Docs Overview
- 7.7.1: Contributing to the Upstream Kubernetes Code
- 7.7.2: Quickstart
- 7.7.3: Generating Reference Documentation for the Kubernetes API
- 7.7.4: Generating Reference Documentation for kubectl Commands
- 7.7.5: Generating Reference Pages for Kubernetes Components and Tools
- 7.7.6:
- 7.8: Advanced contributing
- 7.9: Viewing Site Analytics
- 8: Docs smoke test page
1 - Kubernetes Documentation
1.1 - Available Documentation Versions
This website contains documentation for the current version of Kubernetes and the four previous versions of Kubernetes.
2 - Getting started
This section lists the different ways to set up and run Kubernetes. When you install Kubernetes, choose an installation type based on: ease of maintenance, security, control, available resources, and expertise required to operate and manage a cluster.
You can download Kubernetes to deploy a Kubernetes cluster on a local machine, into the cloud, or for your own datacenter.
If you don't want to manage a Kubernetes cluster yourself, you could pick a managed service, including certified platforms. There are also other standardized and custom solutions across a wide range of cloud and bare metal environments.
Learning environment
If you're learning Kubernetes, use the tools supported by the Kubernetes community, or tools in the ecosystem to set up a Kubernetes cluster on a local machine. See Install tools.
Production environment
When evaluating a solution for a production environment, consider which aspects of operating a Kubernetes cluster (or abstractions) you want to manage yourself and which you prefer to hand off to a provider.
For a cluster you're managing yourself, the officially supported tool for deploying Kubernetes is kubeadm.
What's next
- Download Kubernetes
- Download and install tools including
kubectl
- Select a container runtime for your new cluster
- Learn about best practices for cluster setup
Kubernetes is designed for its control plane to run on Linux. Within your cluster you can run applications on Linux or other operating systems, including Windows.
- Learn to set up clusters with Windows nodes
2.1 - Learning environment
2.2 - Production environment
A production-quality Kubernetes cluster requires planning and preparation. If your Kubernetes cluster is to run critical workloads, it must be configured to be resilient. This page explains steps you can take to set up a production-ready cluster, or to promote an existing cluster for production use. If you're already familiar with production setup and want the links, skip to What's next.
Production considerations
Typically, a production Kubernetes cluster environment has more requirements than a personal learning, development, or test environment Kubernetes. A production environment may require secure access by many users, consistent availability, and the resources to adapt to changing demands.
As you decide where you want your production Kubernetes environment to live (on premises or in a cloud) and the amount of management you want to take on or hand to others, consider how your requirements for a Kubernetes cluster are influenced by the following issues:
-
Availability: A single-machine Kubernetes learning environment has a single point of failure. Creating a highly available cluster means considering:
- Separating the control plane from the worker nodes.
- Replicating the control plane components on multiple nodes.
- Load balancing traffic to the cluster’s API server.
- Having enough worker nodes available, or able to quickly become available, as changing workloads warrant it.
-
Scale: If you expect your production Kubernetes environment to receive a stable amount of demand, you might be able to set up for the capacity you need and be done. However, if you expect demand to grow over time or change dramatically based on things like season or special events, you need to plan how to scale to relieve increased pressure from more requests to the control plane and worker nodes or scale down to reduce unused resources.
-
Security and access management: You have full admin privileges on your own Kubernetes learning cluster. But shared clusters with important workloads, and more than one or two users, require a more refined approach to who and what can access cluster resources. You can use role-based access control (RBAC) and other security mechanisms to make sure that users and workloads can get access to the resources they need, while keeping workloads, and the cluster itself, secure. You can set limits on the resources that users and workloads can access by managing policies and container resources.
Before building a Kubernetes production environment on your own, consider handing off some or all of this job to Turnkey Cloud Solutions providers or other Kubernetes Partners. Options include:
- Serverless: Just run workloads on third-party equipment without managing a cluster at all. You will be charged for things like CPU usage, memory, and disk requests.
- Managed control plane: Let the provider manage the scale and availability of the cluster's control plane, as well as handle patches and upgrades.
- Managed worker nodes: Configure pools of nodes to meet your needs, then the provider makes sure those nodes are available and ready to implement upgrades when needed.
- Integration: There are providers that integrate Kubernetes with other services you may need, such as storage, container registries, authentication methods, and development tools.
Whether you build a production Kubernetes cluster yourself or work with partners, review the following sections to evaluate your needs as they relate to your cluster’s control plane, worker nodes, user access, and workload resources.
Production cluster setup
In a production-quality Kubernetes cluster, the control plane manages the cluster from services that can be spread across multiple computers in different ways. Each worker node, however, represents a single entity that is configured to run Kubernetes pods.
Production control plane
The simplest Kubernetes cluster has the entire control plane and worker node services running on the same machine. You can grow that environment by adding worker nodes, as reflected in the diagram illustrated in Kubernetes Components. If the cluster is meant to be available for a short period of time, or can be discarded if something goes seriously wrong, this might meet your needs.
If you need a more permanent, highly available cluster, however, you should consider ways of extending the control plane. By design, one-machine control plane services running on a single machine are not highly available. If keeping the cluster up and running and ensuring that it can be repaired if something goes wrong is important, consider these steps:
- Choose deployment tools: You can deploy a control plane using tools such as kubeadm, kops, and kubespray. See Installing Kubernetes with deployment tools to learn tips for production-quality deployments using each of those deployment methods. Different Container Runtimes are available to use with your deployments.
- Manage certificates: Secure communications between control plane services are implemented using certificates. Certificates are automatically generated during deployment or you can generate them using your own certificate authority. See PKI certificates and requirements for details.
- Configure load balancer for apiserver: Configure a load balancer to distribute external API requests to the apiserver service instances running on different nodes. See Create an External Load Balancer for details.
- Separate and backup etcd service: The etcd services can either run on the same machines as other control plane services or run on separate machines, for extra security and availability. Because etcd stores cluster configuration data, backing up the etcd database should be done regularly to ensure that you can repair that database if needed. See the etcd FAQ for details on configuring and using etcd. See Operating etcd clusters for Kubernetes and Set up a High Availability etcd cluster with kubeadm for details.
- Create multiple control plane systems: For high availability, the control plane should not be limited to a single machine. If the control plane services are run by an init service (such as systemd), each service should run on at least three machines. However, running control plane services as pods in Kubernetes ensures that the replicated number of services that you request will always be available. The scheduler should be fault tolerant, but not highly available. Some deployment tools set up Raft consensus algorithm to do leader election of Kubernetes services. If the primary goes away, another service elects itself and take over.
- Span multiple zones: If keeping your cluster available at all times is critical, consider creating a cluster that runs across multiple data centers, referred to as zones in cloud environments. Groups of zones are referred to as regions. By spreading a cluster across multiple zones in the same region, it can improve the chances that your cluster will continue to function even if one zone becomes unavailable. See Running in multiple zones for details.
- Manage on-going features: If you plan to keep your cluster over time, there are tasks you need to do to maintain its health and security. For example, if you installed with kubeadm, there are instructions to help you with Certificate Management and Upgrading kubeadm clusters. See Administer a Cluster for a longer list of Kubernetes administrative tasks.
To learn about available options when you run control plane services, see kube-apiserver, kube-controller-manager, and kube-scheduler component pages. For highly available control plane examples, see Options for Highly Available topology, Creating Highly Available clusters with kubeadm, and Operating etcd clusters for Kubernetes. See Backing up an etcd cluster for information on making an etcd backup plan.
Production worker nodes
Production-quality workloads need to be resilient and anything they rely on needs to be resilient (such as CoreDNS). Whether you manage your own control plane or have a cloud provider do it for you, you still need to consider how you want to manage your worker nodes (also referred to simply as nodes).
- Configure nodes: Nodes can be physical or virtual machines. If you want to
create and manage your own nodes, you can install a supported operating system,
then add and run the appropriate
Node services. Consider:
- The demands of your workloads when you set up nodes by having appropriate memory, CPU, and disk speed and storage capacity available.
- Whether generic computer systems will do or you have workloads that need GPU processors, Windows nodes, or VM isolation.
- Validate nodes: See Valid node setup for information on how to ensure that a node meets the requirements to join a Kubernetes cluster.
- Add nodes to the cluster: If you are managing your own cluster you can add nodes by setting up your own machines and either adding them manually or having them register themselves to the cluster’s apiserver. See the Nodes section for information on how to set up Kubernetes to add nodes in these ways.
- Add Windows nodes to the cluster: Kubernetes offers support for Windows worker nodes, allowing you to run workloads implemented in Windows containers. See Windows in Kubernetes for details.
- Scale nodes: Have a plan for expanding the capacity your cluster will eventually need. See Considerations for large clusters to help determine how many nodes you need, based on the number of pods and containers you need to run. If you are managing nodes yourself, this can mean purchasing and installing your own physical equipment.
- Autoscale nodes: Most cloud providers support Cluster Autoscaler to replace unhealthy nodes or grow and shrink the number of nodes as demand requires. See the Frequently Asked Questions for how the autoscaler works and Deployment for how it is implemented by different cloud providers. For on-premises, there are some virtualization platforms that can be scripted to spin up new nodes based on demand.
- Set up node health checks: For important workloads, you want to make sure that the nodes and pods running on those nodes are healthy. Using the Node Problem Detector daemon, you can ensure your nodes are healthy.
Production user management
In production, you may be moving from a model where you or a small group of people are accessing the cluster to where there may potentially be dozens or hundreds of people. In a learning environment or platform prototype, you might have a single administrative account for everything you do. In production, you will want more accounts with different levels of access to different namespaces.
Taking on a production-quality cluster means deciding how you want to selectively allow access by other users. In particular, you need to select strategies for validating the identities of those who try to access your cluster (authentication) and deciding if they have permissions to do what they are asking (authorization):
- Authentication: The apiserver can authenticate users using client certificates, bearer tokens, an authenticating proxy, or HTTP basic auth. You can choose which authentication methods you want to use. Using plugins, the apiserver can leverage your organization’s existing authentication methods, such as LDAP or Kerberos. See Authentication for a description of these different methods of authenticating Kubernetes users.
- Authorization: When you set out to authorize your regular users, you will probably choose between RBAC and ABAC authorization. See Authorization Overview to review different modes for authorizing user accounts (as well as service account access to your cluster):
- Role-based access control (RBAC): Lets you assign access to your cluster by allowing specific sets of permissions to authenticated users. Permissions can be assigned for a specific namespace (Role) or across the entire cluster (ClusterRole). Then using RoleBindings and ClusterRoleBindings, those permissions can be attached to particular users.
- Attribute-based access control (ABAC): Lets you create policies based on resource attributes in the cluster and will allow or deny access based on those attributes. Each line of a policy file identifies versioning properties (apiVersion and kind) and a map of spec properties to match the subject (user or group), resource property, non-resource property (/version or /apis), and readonly. See Examples for details.
As someone setting up authentication and authorization on your production Kubernetes cluster, here are some things to consider:
- Set the authorization mode: When the Kubernetes API server (kube-apiserver) starts, the supported authentication modes must be set using the --authorization-mode flag. For example, that flag in the kube-adminserver.yaml file (in /etc/kubernetes/manifests) could be set to Node,RBAC. This would allow Node and RBAC authorization for authenticated requests.
- Create user certificates and role bindings (RBAC): If you are using RBAC authorization, users can create a CertificateSigningRequest (CSR) that can be signed by the cluster CA. Then you can bind Roles and ClusterRoles to each user. See Certificate Signing Requests for details.
- Create policies that combine attributes (ABAC): If you are using ABAC authorization, you can assign combinations of attributes to form policies to authorize selected users or groups to access particular resources (such as a pod), namespace, or apiGroup. For more information, see Examples.
- Consider Admission Controllers: Additional forms of authorization for requests that can come in through the API server include Webhook Token Authentication. Webhooks and other special authorization types need to be enabled by adding Admission Controllers to the API server.
Set limits on workload resources
Demands from production workloads can cause pressure both inside and outside of the Kubernetes control plane. Consider these items when setting up for the needs of your cluster's workloads:
- Set namespace limits: Set per-namespace quotas on things like memory and CPU. See Manage Memory, CPU, and API Resources for details. You can also set Hierarchical Namespaces for inheriting limits.
- Prepare for DNS demand: If you expect workloads to massively scale up, your DNS service must be ready to scale up as well. See Autoscale the DNS service in a Cluster.
- Create additional service accounts: User accounts determine what users can
do on a cluster, while a service account defines pod access within a particular
namespace. By default, a pod takes on the default service account from its namespace.
See Managing Service Accounts
for information on creating a new service account. For example, you might want to:
- Add secrets that a pod could use to pull images from a particular container registry. See Configure Service Accounts for Pods for an example.
- Assign RBAC permissions to a service account. See ServiceAccount permissions for details.
What's next
- Decide if you want to build your own production Kubernetes or obtain one from available Turnkey Cloud Solutions or Kubernetes Partners.
- If you choose to build your own cluster, plan how you want to handle certificates and set up high availability for features such as etcd and the API server.
- Choose from kubeadm, kops or Kubespray deployment methods.
- Configure user management by determining your Authentication and Authorization methods.
- Prepare for application workloads by setting up resource limits, DNS autoscaling and service accounts.
2.2.1 - Container runtimes
You need to install a container runtime into each node in the cluster so that Pods can run there. This page outlines what is involved and describes related tasks for setting up nodes.
Kubernetes 1.23 requires that you use a runtime that conforms with the Container Runtime Interface (CRI).
See CRI version support for more information.
This page lists details for using several common container runtimes with Kubernetes, on Linux:
Cgroup drivers
Control groups are used to constrain resources that are allocated to processes.
When systemd is chosen as the init
system for a Linux distribution, the init process generates and consumes a root control group
(cgroup
) and acts as a cgroup manager.
Systemd has a tight integration with cgroups and allocates a cgroup per systemd unit. It's possible
to configure your container runtime and the kubelet to use cgroupfs
. Using cgroupfs
alongside
systemd means that there will be two different cgroup managers.
A single cgroup manager simplifies the view of what resources are being allocated
and will by default have a more consistent view of the available and in-use resources.
When there are two cgroup managers on a system, you end up with two views of those resources.
In the field, people have reported cases where nodes that are configured to use cgroupfs
for the kubelet and Docker, but systemd
for the rest of the processes, become unstable under
resource pressure.
Changing the settings such that your container runtime and kubelet use systemd
as the cgroup driver
stabilized the system. To configure this for Docker, set native.cgroupdriver=systemd
.
Changing the cgroup driver of a Node that has joined a cluster is a sensitive operation. If the kubelet has created Pods using the semantics of one cgroup driver, changing the container runtime to another cgroup driver can cause errors when trying to re-create the Pod sandbox for such existing Pods. Restarting the kubelet may not solve such errors.
If you have automation that makes it feasible, replace the node with another using the updated configuration, or reinstall it using automation.
Cgroup v2
Cgroup v2 is the next version of the cgroup Linux API. Differently than cgroup v1, there is a single hierarchy instead of a different one for each controller.
The new version offers several improvements over cgroup v1, some of these improvements are:
- cleaner and easier to use API
- safe sub-tree delegation to containers
- newer features like Pressure Stall Information
Even if the kernel supports a hybrid configuration where some controllers are managed by cgroup v1 and some others by cgroup v2, Kubernetes supports only the same cgroup version to manage all the controllers.
If systemd doesn't use cgroup v2 by default, you can configure the system to use it by adding
systemd.unified_cgroup_hierarchy=1
to the kernel command line.
# This example is for a Linux OS that uses the DNF package manager
# Your system might use a different method for setting the command line
# that the Linux kernel uses.
sudo dnf install -y grubby && \
sudo grubby \
--update-kernel=ALL \
--args="systemd.unified_cgroup_hierarchy=1"
If you change the command line for the kernel, you must reboot the node before your change takes effect.
There should not be any noticeable difference in the user experience when switching to cgroup v2, unless users are accessing the cgroup file system directly, either on the node or from within the containers.
In order to use it, cgroup v2 must be supported by the CRI runtime as well.
Migrating to the systemd
driver in kubeadm managed clusters
Follow this Migration guide
if you wish to migrate to the systemd
cgroup driver in existing kubeadm managed clusters.
CRI version support
Your container runtime must support at least v1alpha2 of the container runtime interface.
Kubernetes 1.23 defaults to using v1 of the CRI API. If a container runtime does not support the v1 API, the kubelet falls back to using the (deprecated) v1alpha2 API instead.
Container runtimes
containerd
This section contains the necessary steps to use containerd as CRI runtime.
Use the following commands to install Containerd on your system:
Install and configure prerequisites:
cat <<EOF | sudo tee /etc/modules-load.d/containerd.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
# Setup required sysctl params, these persist across reboots.
cat <<EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF
# Apply sysctl params without reboot
sudo sysctl --system
Install containerd:
-
Install the
containerd.io
package from the official Docker repositories. Instructions for setting up the Docker repository for your respective Linux distribution and installing thecontainerd.io
package can be found at Install Docker Engine. -
Configure containerd:
sudo mkdir -p /etc/containerd containerd config default | sudo tee /etc/containerd/config.toml
-
Restart containerd:
sudo systemctl restart containerd
Start a Powershell session, set $Version
to the desired version (ex: $Version="1.4.3"
),
and then run the following commands:
-
Download containerd:
curl.exe -L https://github.com/containerd/containerd/releases/download/v$Version/containerd-$Version-windows-amd64.tar.gz -o containerd-windows-amd64.tar.gz tar.exe xvf .\containerd-windows-amd64.tar.gz
-
Extract and configure:
Copy-Item -Path ".\bin\" -Destination "$Env:ProgramFiles\containerd" -Recurse -Force cd $Env:ProgramFiles\containerd\ .\containerd.exe config default | Out-File config.toml -Encoding ascii # Review the configuration. Depending on setup you may want to adjust: # - the sandbox_image (Kubernetes pause image) # - cni bin_dir and conf_dir locations Get-Content config.toml # (Optional - but highly recommended) Exclude containerd from Windows Defender Scans Add-MpPreference -ExclusionProcess "$Env:ProgramFiles\containerd\containerd.exe"
-
Start containerd:
.\containerd.exe --register-service Start-Service containerd
Using the systemd
cgroup driver
To use the systemd
cgroup driver in /etc/containerd/config.toml
with runc
, set
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
...
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
If you apply this change make sure to restart containerd again:
sudo systemctl restart containerd
When using kubeadm, manually configure the cgroup driver for kubelet.
CRI-O
This section contains the necessary steps to install CRI-O as a container runtime.
Use the following commands to install CRI-O on your system:
Install and configure prerequisites:
# Create the .conf file to load the modules at bootup
cat <<EOF | sudo tee /etc/modules-load.d/crio.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
# Set up required sysctl params, these persist across reboots.
cat <<EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF
sudo sysctl --system
To install CRI-O on the following operating systems, set the environment variable OS
to the appropriate value from the following table:
Operating system | $OS |
---|---|
Debian Unstable | Debian_Unstable |
Debian Testing | Debian_Testing |
Then, set $VERSION
to the CRI-O version that matches your Kubernetes version.
For instance, if you want to install CRI-O 1.20, set VERSION=1.20
.
You can pin your installation to a specific release.
To install version 1.20.0, set VERSION=1.20:1.20.0
.
Then run
cat <<EOF | sudo tee /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list
deb https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/$OS/ /
EOF
cat <<EOF | sudo tee /etc/apt/sources.list.d/devel:kubic:libcontainers:stable:cri-o:$VERSION.list
deb http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable:/cri-o:/$VERSION/$OS/ /
EOF
curl -L https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable:cri-o:$VERSION/$OS/Release.key | sudo apt-key --keyring /etc/apt/trusted.gpg.d/libcontainers.gpg add -
curl -L https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/$OS/Release.key | sudo apt-key --keyring /etc/apt/trusted.gpg.d/libcontainers.gpg add -
sudo apt-get update
sudo apt-get install cri-o cri-o-runc
To install on the following operating systems, set the environment variable OS
to the appropriate field in the following table:
Operating system | $OS |
---|---|
Ubuntu 20.04 | xUbuntu_20.04 |
Ubuntu 19.10 | xUbuntu_19.10 |
Ubuntu 19.04 | xUbuntu_19.04 |
Ubuntu 18.04 | xUbuntu_18.04 |
Then, set $VERSION
to the CRI-O version that matches your Kubernetes version.
For instance, if you want to install CRI-O 1.20, set VERSION=1.20
.
You can pin your installation to a specific release.
To install version 1.20.0, set VERSION=1.20:1.20.0
.
Then run
cat <<EOF | sudo tee /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list
deb https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/$OS/ /
EOF
cat <<EOF | sudo tee /etc/apt/sources.list.d/devel:kubic:libcontainers:stable:cri-o:$VERSION.list
deb http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable:/cri-o:/$VERSION/$OS/ /
EOF
curl -L https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/$OS/Release.key | sudo apt-key --keyring /etc/apt/trusted.gpg.d/libcontainers.gpg add -
curl -L https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable:cri-o:$VERSION/$OS/Release.key | sudo apt-key --keyring /etc/apt/trusted.gpg.d/libcontainers-cri-o.gpg add -
sudo apt-get update
sudo apt-get install cri-o cri-o-runc
To install on the following operating systems, set the environment variable OS
to the appropriate field in the following table:
Operating system | $OS |
---|---|
Centos 8 | CentOS_8 |
Centos 8 Stream | CentOS_8_Stream |
Centos 7 | CentOS_7 |
Then, set $VERSION
to the CRI-O version that matches your Kubernetes version.
For instance, if you want to install CRI-O 1.20, set VERSION=1.20
.
You can pin your installation to a specific release.
To install version 1.20.0, set VERSION=1.20:1.20.0
.
Then run
sudo curl -L -o /etc/yum.repos.d/devel:kubic:libcontainers:stable.repo https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/$OS/devel:kubic:libcontainers:stable.repo
sudo curl -L -o /etc/yum.repos.d/devel:kubic:libcontainers:stable:cri-o:$VERSION.repo https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable:cri-o:$VERSION/$OS/devel:kubic:libcontainers:stable:cri-o:$VERSION.repo
sudo yum install cri-o
sudo zypper install cri-o
Set $VERSION
to the CRI-O version that matches your Kubernetes version.
For instance, if you want to install CRI-O 1.20, VERSION=1.20
.
You can find available versions with:
sudo dnf module list cri-o
CRI-O does not support pinning to specific releases on Fedora.
Then run
sudo dnf module enable cri-o:$VERSION
sudo dnf install cri-o
Start CRI-O:
sudo systemctl daemon-reload
sudo systemctl enable crio --now
Refer to the CRI-O installation guide for more information.
cgroup driver
CRI-O uses the systemd cgroup driver per default. To switch to the cgroupfs
cgroup driver, either edit /etc/crio/crio.conf
or place a drop-in
configuration in /etc/crio/crio.conf.d/02-cgroup-manager.conf
, for example:
[crio.runtime]
conmon_cgroup = "pod"
cgroup_manager = "cgroupfs"
Please also note the changed conmon_cgroup
, which has to be set to the value
pod
when using CRI-O with cgroupfs
. It is generally necessary to keep the
cgroup driver configuration of the kubelet (usually done via kubeadm) and CRI-O
in sync.
Docker Engine
Docker Engine is the container runtime that started it all. Formerly known just as Docker, this container runtime is available in various forms. Install Docker Engine explains your options for installing this runtime.
Docker Engine is directly compatible with Kubernetes 1.23, using the deprecated dockershim
component. For more information
and context, see the Dockershim deprecation FAQ.
You can also find third-party adapters that let you use Docker Engine with Kubernetes through the supported Container Runtime Interface (CRI).
The following CRI adaptors are designed to work with Docker Engine:
cri-dockerd
from Mirantis
Mirantis Container Runtime
Mirantis Container Runtime (MCR) is a commercially available container runtime that was formerly known as Docker Enterprise Edition.
You can use Mirantis Container Runtime with Kubernetes using the open source
cri-dockerd
component, included with MCR.
2.2.2 - Installing Kubernetes with deployment tools
2.2.2.1 - Bootstrapping clusters with kubeadm
2.2.2.1.1 - Installing kubeadm
This page shows how to install the kubeadm
toolbox.
For information on how to create a cluster with kubeadm once you have performed this installation process, see the Using kubeadm to Create a Cluster page.
Before you begin
- A compatible Linux host. The Kubernetes project provides generic instructions for Linux distributions based on Debian and Red Hat, and those distributions without a package manager.
- 2 GB or more of RAM per machine (any less will leave little room for your apps).
- 2 CPUs or more.
- Full network connectivity between all machines in the cluster (public or private network is fine).
- Unique hostname, MAC address, and product_uuid for every node. See here for more details.
- Certain ports are open on your machines. See here for more details.
- Swap disabled. You MUST disable swap in order for the kubelet to work properly.
Verify the MAC address and product_uuid are unique for every node
- You can get the MAC address of the network interfaces using the command
ip link
orifconfig -a
- The product_uuid can be checked by using the command
sudo cat /sys/class/dmi/id/product_uuid
It is very likely that hardware devices will have unique addresses, although some virtual machines may have identical values. Kubernetes uses these values to uniquely identify the nodes in the cluster. If these values are not unique to each node, the installation process may fail.
Check network adapters
If you have more than one network adapter, and your Kubernetes components are not reachable on the default route, we recommend you add IP route(s) so Kubernetes cluster addresses go via the appropriate adapter.
Letting iptables see bridged traffic
Make sure that the br_netfilter
module is loaded. This can be done by running lsmod | grep br_netfilter
. To load it explicitly call sudo modprobe br_netfilter
.
As a requirement for your Linux Node's iptables to correctly see bridged traffic, you should ensure net.bridge.bridge-nf-call-iptables
is set to 1 in your sysctl
config, e.g.
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
br_netfilter
EOF
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sudo sysctl --system
For more details please see the Network Plugin Requirements page.
Check required ports
These required ports need to be open in order for Kubernetes components to communicate with each other. You can use telnet to check if a port is open. For example:
telnet 127.0.0.1 6443
The pod network plugin you use (see below) may also require certain ports to be open. Since this differs with each pod network plugin, please see the documentation for the plugins about what port(s) those need.
Installing runtime
To run containers in Pods, Kubernetes uses a container runtime.
By default, Kubernetes uses the Container Runtime Interface (CRI) to interface with your chosen container runtime.
If you don't specify a runtime, kubeadm automatically tries to detect an installed container runtime by scanning through a list of well known Unix domain sockets. The following table lists container runtimes and their associated socket paths:
Runtime | Path to Unix domain socket |
---|---|
Docker | /var/run/dockershim.sock |
containerd | /run/containerd/containerd.sock |
CRI-O | /var/run/crio/crio.sock |
If both Docker and containerd are detected, Docker takes precedence. This is
needed because Docker 18.09 ships with containerd and both are detectable even if you only
installed Docker.
If any other two or more runtimes are detected, kubeadm exits with an error.
The kubelet integrates with Docker through the built-in dockershim
CRI implementation.
See container runtimes for more information.
By default, kubeadm uses Docker as the container runtime.
The kubelet integrates with Docker through the built-in dockershim
CRI implementation.
See container runtimes for more information.
Installing kubeadm, kubelet and kubectl
You will install these packages on all of your machines:
-
kubeadm
: the command to bootstrap the cluster. -
kubelet
: the component that runs on all of the machines in your cluster and does things like starting pods and containers. -
kubectl
: the command line util to talk to your cluster.
kubeadm will not install or manage kubelet
or kubectl
for you, so you will
need to ensure they match the version of the Kubernetes control plane you want
kubeadm to install for you. If you do not, there is a risk of a version skew occurring that
can lead to unexpected, buggy behaviour. However, one minor version skew between the
kubelet and the control plane is supported, but the kubelet version may never exceed the API
server version. For example, the kubelet running 1.7.0 should be fully compatible with a 1.8.0 API server,
but not vice versa.
For information about installing kubectl
, see Install and set up kubectl.
For more information on version skews, see:
- Kubernetes version and version-skew policy
- Kubeadm-specific version skew policy
-
Update the
apt
package index and install packages needed to use the Kubernetesapt
repository:sudo apt-get update sudo apt-get install -y apt-transport-https ca-certificates curl
-
Download the Google Cloud public signing key:
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
-
Add the Kubernetes
apt
repository:echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
-
Update
apt
package index, install kubelet, kubeadm and kubectl, and pin their version:sudo apt-get update sudo apt-get install -y kubelet kubeadm kubectl sudo apt-mark hold kubelet kubeadm kubectl
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-\$basearch
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
exclude=kubelet kubeadm kubectl
EOF
# Set SELinux in permissive mode (effectively disabling it)
sudo setenforce 0
sudo sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
sudo yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetes
sudo systemctl enable --now kubelet
Notes:
-
Setting SELinux in permissive mode by running
setenforce 0
andsed ...
effectively disables it. This is required to allow containers to access the host filesystem, which is needed by pod networks for example. You have to do this until SELinux support is improved in the kubelet. -
You can leave SELinux enabled if you know how to configure it but it may require settings that are not supported by kubeadm.
-
If the
baseurl
fails because your Red Hat-based distribution cannot interpretbasearch
, replace\$basearch
with your computer's architecture. Typeuname -m
to see that value. For example, thebaseurl
URL forx86_64
could be:https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
.
Install CNI plugins (required for most pod network):
CNI_VERSION="v0.8.2"
ARCH="amd64"
sudo mkdir -p /opt/cni/bin
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-linux-${ARCH}-${CNI_VERSION}.tgz" | sudo tar -C /opt/cni/bin -xz
Define the directory to download command files
DOWNLOAD_DIR
variable must be set to a writable directory.
If you are running Flatcar Container Linux, set DOWNLOAD_DIR=/opt/bin
.
DOWNLOAD_DIR=/usr/local/bin
sudo mkdir -p $DOWNLOAD_DIR
Install crictl (required for kubeadm / Kubelet Container Runtime Interface (CRI))
CRICTL_VERSION="v1.22.0"
ARCH="amd64"
curl -L "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-${ARCH}.tar.gz" | sudo tar -C $DOWNLOAD_DIR -xz
Install kubeadm
, kubelet
, kubectl
and add a kubelet
systemd service:
RELEASE="$(curl -sSL https://dl.k8s.io/release/stable.txt)"
ARCH="amd64"
cd $DOWNLOAD_DIR
sudo curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet,kubectl}
sudo chmod +x {kubeadm,kubelet,kubectl}
RELEASE_VERSION="v0.4.0"
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service
sudo mkdir -p /etc/systemd/system/kubelet.service.d
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Enable and start kubelet
:
systemctl enable --now kubelet
/usr
directory as a read-only filesystem.
Before bootstrapping your cluster, you need to take additional steps to configure a writable directory.
See the Kubeadm Troubleshooting guide to learn how to set up a writable directory.
The kubelet is now restarting every few seconds, as it waits in a crashloop for kubeadm to tell it what to do.
Configuring a cgroup driver
Both the container runtime and the kubelet have a property called "cgroup driver", which is important for the management of cgroups on Linux machines.
Matching the container runtime and kubelet cgroup drivers is required or otherwise the kubelet process will fail.
See Configuring a cgroup driver for more details.
Troubleshooting
If you are running into difficulties with kubeadm, please consult our troubleshooting docs.
What's next
2.2.2.1.2 - Troubleshooting kubeadm
As with any program, you might run into an error installing or running kubeadm. This page lists some common failure scenarios and have provided steps that can help you understand and fix the problem.
If your problem is not listed below, please follow the following steps:
-
If you think your problem is a bug with kubeadm:
- Go to github.com/kubernetes/kubeadm and search for existing issues.
- If no issue exists, please open one and follow the issue template.
-
If you are unsure about how kubeadm works, you can ask on Slack in
#kubeadm
, or open a question on StackOverflow. Please include relevant tags like#kubernetes
and#kubeadm
so folks can help you.
Not possible to join a v1.18 Node to a v1.17 cluster due to missing RBAC
In v1.18 kubeadm added prevention for joining a Node in the cluster if a Node with the same name already exists. This required adding RBAC for the bootstrap-token user to be able to GET a Node object.
However this causes an issue where kubeadm join
from v1.18 cannot join a cluster created by kubeadm v1.17.
To workaround the issue you have two options:
Execute kubeadm init phase bootstrap-token
on a control-plane node using kubeadm v1.18.
Note that this enables the rest of the bootstrap-token permissions as well.
or
Apply the following RBAC manually using kubectl apply -f ...
:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kubeadm:get-nodes
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kubeadm:get-nodes
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kubeadm:get-nodes
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:bootstrappers:kubeadm:default-node-token
ebtables
or some similar executable not found during installation
If you see the following warnings while running kubeadm init
[preflight] WARNING: ebtables not found in system path
[preflight] WARNING: ethtool not found in system path
Then you may be missing ebtables
, ethtool
or a similar executable on your node. You can install them with the following commands:
- For Ubuntu/Debian users, run
apt install ebtables ethtool
. - For CentOS/Fedora users, run
yum install ebtables ethtool
.
kubeadm blocks waiting for control plane during installation
If you notice that kubeadm init
hangs after printing out the following line:
[apiclient] Created API client, waiting for the control plane to become ready
This may be caused by a number of problems. The most common are:
- network connection problems. Check that your machine has full network connectivity before continuing.
- the cgroup driver of the container runtime differs from that of the kubelet. To understand how to configure it properly see Configuring a cgroup driver.
- control plane containers are crashlooping or hanging. You can check this by running
docker ps
and investigating each container by runningdocker logs
. For other container runtime see Debugging Kubernetes nodes with crictl.
kubeadm blocks when removing managed containers
The following could happen if Docker halts and does not remove any Kubernetes-managed containers:
sudo kubeadm reset
[preflight] Running pre-flight checks
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Removing kubernetes-managed containers
(block)
A possible solution is to restart the Docker service and then re-run kubeadm reset
:
sudo systemctl restart docker.service
sudo kubeadm reset
Inspecting the logs for docker may also be useful:
journalctl -u docker
Pods in RunContainerError
, CrashLoopBackOff
or Error
state
Right after kubeadm init
there should not be any pods in these states.
- If there are pods in one of these states right after
kubeadm init
, please open an issue in the kubeadm repo.coredns
(orkube-dns
) should be in thePending
state until you have deployed the network add-on. - If you see Pods in the
RunContainerError
,CrashLoopBackOff
orError
state after deploying the network add-on and nothing happens tocoredns
(orkube-dns
), it's very likely that the Pod Network add-on that you installed is somehow broken. You might have to grant it more RBAC privileges or use a newer version. Please file an issue in the Pod Network providers' issue tracker and get the issue triaged there. - If you install a version of Docker older than 1.12.1, remove the
MountFlags=slave
option when bootingdockerd
withsystemd
and restartdocker
. You can see the MountFlags in/usr/lib/systemd/system/docker.service
. MountFlags can interfere with volumes mounted by Kubernetes, and put the Pods inCrashLoopBackOff
state. The error happens when Kubernetes does not findvar/run/secrets/kubernetes.io/serviceaccount
files.
coredns
is stuck in the Pending
state
This is expected and part of the design. kubeadm is network provider-agnostic, so the admin
should install the pod network add-on
of choice. You have to install a Pod Network
before CoreDNS may be deployed fully. Hence the Pending
state before the network is set up.
HostPort
services do not work
The HostPort
and HostIP
functionality is available depending on your Pod Network
provider. Please contact the author of the Pod Network add-on to find out whether
HostPort
and HostIP
functionality are available.
Calico, Canal, and Flannel CNI providers are verified to support HostPort.
For more information, see the CNI portmap documentation.
If your network provider does not support the portmap CNI plugin, you may need to use the NodePort feature of
services or use HostNetwork=true
.
Pods are not accessible via their Service IP
-
Many network add-ons do not yet enable hairpin mode which allows pods to access themselves via their Service IP. This is an issue related to CNI. Please contact the network add-on provider to get the latest status of their support for hairpin mode.
-
If you are using VirtualBox (directly or via Vagrant), you will need to ensure that
hostname -i
returns a routable IP address. By default the first interface is connected to a non-routable host-only network. A work around is to modify/etc/hosts
, see this Vagrantfile for an example.
TLS certificate errors
The following error indicates a possible certificate mismatch.
# kubectl get pods
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
-
Verify that the
$HOME/.kube/config
file contains a valid certificate, and regenerate a certificate if necessary. The certificates in a kubeconfig file are base64 encoded. Thebase64 --decode
command can be used to decode the certificate andopenssl x509 -text -noout
can be used for viewing the certificate information. -
Unset the
KUBECONFIG
environment variable using:unset KUBECONFIG
Or set it to the default
KUBECONFIG
location:export KUBECONFIG=/etc/kubernetes/admin.conf
-
Another workaround is to overwrite the existing
kubeconfig
for the "admin" user:mv $HOME/.kube $HOME/.kube.bak mkdir $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config
Kubelet client certificate rotation fails
By default, kubeadm configures a kubelet with automatic rotation of client certificates by using the /var/lib/kubelet/pki/kubelet-client-current.pem
symlink specified in /etc/kubernetes/kubelet.conf
.
If this rotation process fails you might see errors such as x509: certificate has expired or is not yet valid
in kube-apiserver logs. To fix the issue you must follow these steps:
-
Backup and delete
/etc/kubernetes/kubelet.conf
and/var/lib/kubelet/pki/kubelet-client*
from the failed node. -
From a working control plane node in the cluster that has
/etc/kubernetes/pki/ca.key
executekubeadm kubeconfig user --org system:nodes --client-name system:node:$NODE > kubelet.conf
.$NODE
must be set to the name of the existing failed node in the cluster. Modify the resultedkubelet.conf
manually to adjust the cluster name and server endpoint, or passkubeconfig user --config
(it acceptsInitConfiguration
). If your cluster does not have theca.key
you must sign the embedded certificates in thekubelet.conf
externally. -
Copy this resulted
kubelet.conf
to/etc/kubernetes/kubelet.conf
on the failed node. -
Restart the kubelet (
systemctl restart kubelet
) on the failed node and wait for/var/lib/kubelet/pki/kubelet-client-current.pem
to be recreated. -
Manually edit the
kubelet.conf
to point to the rotated kubelet client certificates, by replacingclient-certificate-data
andclient-key-data
with:client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
-
Restart the kubelet.
-
Make sure the node becomes
Ready
.
Default NIC When using flannel as the pod network in Vagrant
The following error might indicate that something was wrong in the pod network:
Error from server (NotFound): the server could not find the requested resource
-
If you're using flannel as the pod network inside Vagrant, then you will have to specify the default interface name for flannel.
Vagrant typically assigns two interfaces to all VMs. The first, for which all hosts are assigned the IP address
10.0.2.15
, is for external traffic that gets NATed.This may lead to problems with flannel, which defaults to the first interface on a host. This leads to all hosts thinking they have the same public IP address. To prevent this, pass the
--iface eth1
flag to flannel so that the second interface is chosen.
Non-public IP used for containers
In some situations kubectl logs
and kubectl run
commands may return with the following errors in an otherwise functional cluster:
Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc65b868-glc5m/mysql: dial tcp 10.19.0.41:10250: getsockopt: no route to host
-
This may be due to Kubernetes using an IP that can not communicate with other IPs on the seemingly same subnet, possibly by policy of the machine provider.
-
DigitalOcean assigns a public IP to
eth0
as well as a private one to be used internally as anchor for their floating IP feature, yetkubelet
will pick the latter as the node'sInternalIP
instead of the public one.Use
ip addr show
to check for this scenario instead ofifconfig
becauseifconfig
will not display the offending alias IP address. Alternatively an API endpoint specific to DigitalOcean allows to query for the anchor IP from the droplet:curl http://169.254.169.254/metadata/v1/interfaces/public/0/anchor_ipv4/address
The workaround is to tell
kubelet
which IP to use using--node-ip
. When using DigitalOcean, it can be the public one (assigned toeth0
) or the private one (assigned toeth1
) should you want to use the optional private network. ThekubeletExtraArgs
section of the kubeadmNodeRegistrationOptions
structure can be used for this.Then restart
kubelet
:systemctl daemon-reload systemctl restart kubelet
coredns
pods have CrashLoopBackOff
or Error
state
If you have nodes that are running SELinux with an older version of Docker you might experience a scenario
where the coredns
pods are not starting. To solve that you can try one of the following options:
-
Upgrade to a newer version of Docker.
-
Modify the
coredns
deployment to setallowPrivilegeEscalation
totrue
:
kubectl -n kube-system get deployment coredns -o yaml | \
sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' | \
kubectl apply -f -
Another cause for CoreDNS to have CrashLoopBackOff
is when a CoreDNS Pod deployed in Kubernetes detects a loop. A number of workarounds
are available to avoid Kubernetes trying to restart the CoreDNS Pod every time CoreDNS detects the loop and exits.
allowPrivilegeEscalation
to true
can compromise
the security of your cluster.
etcd pods restart continually
If you encounter the following error:
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""
this issue appears if you run CentOS 7 with Docker 1.13.1.84. This version of Docker can prevent the kubelet from executing into the etcd container.
To work around the issue, choose one of these options:
- Roll back to an earlier version of Docker, such as 1.13.1-75
yum downgrade docker-1.13.1-75.git8633870.el7.centos.x86_64 docker-client-1.13.1-75.git8633870.el7.centos.x86_64 docker-common-1.13.1-75.git8633870.el7.centos.x86_64
- Install one of the more recent recommended versions, such as 18.06:
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum install docker-ce-18.06.1.ce-3.el7.x86_64
Not possible to pass a comma separated list of values to arguments inside a --component-extra-args
flag
kubeadm init
flags such as --component-extra-args
allow you to pass custom arguments to a control-plane
component like the kube-apiserver. However, this mechanism is limited due to the underlying type used for parsing
the values (mapStringString
).
If you decide to pass an argument that supports multiple, comma-separated values such as
--apiserver-extra-args "enable-admission-plugins=LimitRanger,NamespaceExists"
this flag will fail with
flag: malformed pair, expect string=string
. This happens because the list of arguments for
--apiserver-extra-args
expects key=value
pairs and in this case NamespacesExists
is considered
as a key that is missing a value.
Alternatively, you can try separating the key=value
pairs like so:
--apiserver-extra-args "enable-admission-plugins=LimitRanger,enable-admission-plugins=NamespaceExists"
but this will result in the key enable-admission-plugins
only having the value of NamespaceExists
.
A known workaround is to use the kubeadm configuration file.
kube-proxy scheduled before node is initialized by cloud-controller-manager
In cloud provider scenarios, kube-proxy can end up being scheduled on new worker nodes before the cloud-controller-manager has initialized the node addresses. This causes kube-proxy to fail to pick up the node's IP address properly and has knock-on effects to the proxy function managing load balancers.
The following error can be seen in kube-proxy Pods:
server.go:610] Failed to retrieve node IP: host IP unknown; known addresses: []
proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
A known solution is to patch the kube-proxy DaemonSet to allow scheduling it on control-plane nodes regardless of their conditions, keeping it off of other nodes until their initial guarding conditions abate:
kubectl -n kube-system patch ds kube-proxy -p='{ "spec": { "template": { "spec": { "tolerations": [ { "key": "CriticalAddonsOnly", "operator": "Exists" }, { "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" } ] } } } }'
The tracking issue for this problem is here.
/usr
is mounted read-only on nodes
On Linux distributions such as Fedora CoreOS or Flatcar Container Linux, the directory /usr
is mounted as a read-only filesystem.
For flex-volume support,
Kubernetes components like the kubelet and kube-controller-manager use the default path of
/usr/libexec/kubernetes/kubelet-plugins/volume/exec/
, yet the flex-volume directory must be writeable
for the feature to work.
(Note: FlexVolume was deprecated in the Kubernetes v1.23 release)
To workaround this issue you can configure the flex-volume directory using the kubeadm configuration file.
On the primary control-plane Node (created using kubeadm init
) pass the following
file using --config
:
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controllerManager:
extraArgs:
flex-volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
On joining Nodes:
apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
Alternatively, you can modify /etc/fstab
to make the /usr
mount writeable, but please
be advised that this is modifying a design principle of the Linux distribution.
kubeadm upgrade plan
prints out context deadline exceeded
error message
This error message is shown when upgrading a Kubernetes cluster with kubeadm
in the case of running an external etcd. This is not a critical bug and happens because older versions of kubeadm perform a version check on the external etcd cluster. You can proceed with kubeadm upgrade apply ...
.
This issue is fixed as of version 1.19.
kubeadm reset
unmounts /var/lib/kubelet
If /var/lib/kubelet
is being mounted, performing a kubeadm reset
will effectively unmount it.
To workaround the issue, re-mount the /var/lib/kubelet
directory after performing the kubeadm reset
operation.
This is a regression introduced in kubeadm 1.15. The issue is fixed in 1.20.
Cannot use the metrics-server securely in a kubeadm cluster
In a kubeadm cluster, the metrics-server
can be used insecurely by passing the --kubelet-insecure-tls
to it. This is not recommended for production clusters.
If you want to use TLS between the metrics-server and the kubelet there is a problem, since kubeadm deploys a self-signed serving certificate for the kubelet. This can cause the following errors on the side of the metrics-server:
x509: certificate signed by unknown authority
x509: certificate is valid for IP-foo not IP-bar
See Enabling signed kubelet serving certificates to understand how to configure the kubelets in a kubeadm cluster to have properly signed serving certificates.
Also see How to run the metrics-server securely.
2.2.2.1.3 - Creating a cluster with kubeadm
Using kubeadm
, you can create a minimum viable Kubernetes cluster that conforms to best practices.
In fact, you can use kubeadm
to set up a cluster that will pass the
Kubernetes Conformance tests.
kubeadm
also supports other cluster lifecycle functions, such as
bootstrap tokens and cluster upgrades.
The kubeadm
tool is good if you need:
- A simple way for you to try out Kubernetes, possibly for the first time.
- A way for existing users to automate setting up a cluster and test their application.
- A building block in other ecosystem and/or installer tools with a larger scope.
You can install and use kubeadm
on various machines: your laptop, a set
of cloud servers, a Raspberry Pi, and more. Whether you're deploying into the
cloud or on-premises, you can integrate kubeadm
into provisioning systems such
as Ansible or Terraform.
Before you begin
To follow this guide, you need:
- One or more machines running a deb/rpm-compatible Linux OS; for example: Ubuntu or CentOS.
- 2 GiB or more of RAM per machine--any less leaves little room for your apps.
- At least 2 CPUs on the machine that you use as a control-plane node.
- Full network connectivity among all machines in the cluster. You can use either a public or a private network.
You also need to use a version of kubeadm
that can deploy the version
of Kubernetes that you want to use in your new cluster.
Kubernetes' version and version skew support policy
applies to kubeadm
as well as to Kubernetes overall.
Check that policy to learn about what versions of Kubernetes and kubeadm
are supported. This page is written for Kubernetes v1.23.
The kubeadm
tool's overall feature state is General Availability (GA). Some sub-features are
still under active development. The implementation of creating the cluster may change
slightly as the tool evolves, but the overall implementation should be pretty stable.
kubeadm alpha
are, by definition, supported on an alpha level.
Objectives
- Install a single control-plane Kubernetes cluster
- Install a Pod network on the cluster so that your Pods can talk to each other
Instructions
Installing kubeadm on your hosts
See "Installing kubeadm".
If you have already installed kubeadm, run apt-get update && apt-get upgrade
or yum update
to get the latest version of kubeadm.
When you upgrade, the kubelet restarts every few seconds as it waits in a crashloop for kubeadm to tell it what to do. This crashloop is expected and normal. After you initialize your control-plane, the kubelet runs normally.
Preparing the required container images
This step is optional and only applies in case you wish kubeadm init
and kubeadm join
to not download the default container images which are hosted at k8s.gcr.io
.
Kubeadm has commands that can help you pre-pull the required images when creating a cluster without an internet connection on its nodes. See Running kubeadm without an internet connection for more details.
Kubeadm allows you to use a custom image repository for the required images. See Using custom images for more details.
Initializing your control-plane node
The control-plane node is the machine where the control plane components run, including etcd (the cluster database) and the API Server (which the kubectl command line tool communicates with).
- (Recommended) If you have plans to upgrade this single control-plane
kubeadm
cluster to high availability you should specify the--control-plane-endpoint
to set the shared endpoint for all control-plane nodes. Such an endpoint can be either a DNS name or an IP address of a load-balancer. - Choose a Pod network add-on, and verify whether it requires any arguments to
be passed to
kubeadm init
. Depending on which third-party provider you choose, you might need to set the--pod-network-cidr
to a provider-specific value. See Installing a Pod network add-on. - (Optional) Since version 1.14,
kubeadm
tries to detect the container runtime on Linux by using a list of well known domain socket paths. To use different container runtime or if there are more than one installed on the provisioned node, specify the--cri-socket
argument tokubeadm init
. See Installing a runtime. - (Optional) Unless otherwise specified,
kubeadm
uses the network interface associated with the default gateway to set the advertise address for this particular control-plane node's API server. To use a different network interface, specify the--apiserver-advertise-address=<ip-address>
argument tokubeadm init
. To deploy an IPv6 Kubernetes cluster using IPv6 addressing, you must specify an IPv6 address, for example--apiserver-advertise-address=fd00::101
To initialize the control-plane node run:
kubeadm init <args>
Considerations about apiserver-advertise-address and ControlPlaneEndpoint
While --apiserver-advertise-address
can be used to set the advertise address for this particular
control-plane node's API server, --control-plane-endpoint
can be used to set the shared endpoint
for all control-plane nodes.
--control-plane-endpoint
allows both IP addresses and DNS names that can map to IP addresses.
Please contact your network administrator to evaluate possible solutions with respect to such mapping.
Here is an example mapping:
192.168.0.102 cluster-endpoint
Where 192.168.0.102
is the IP address of this node and cluster-endpoint
is a custom DNS name that maps to this IP.
This will allow you to pass --control-plane-endpoint=cluster-endpoint
to kubeadm init
and pass the same DNS name to
kubeadm join
. Later you can modify cluster-endpoint
to point to the address of your load-balancer in an
high availability scenario.
Turning a single control plane cluster created without --control-plane-endpoint
into a highly available cluster
is not supported by kubeadm.
More information
For more information about kubeadm init
arguments, see the kubeadm reference guide.
To configure kubeadm init
with a configuration file see
Using kubeadm init with a configuration file.
To customize control plane components, including optional IPv6 assignment to liveness probe for control plane components and etcd server, provide extra arguments to each component as documented in custom arguments.
To run kubeadm init
again, you must first tear down the cluster.
If you join a node with a different architecture to your cluster, make sure that your deployed DaemonSets have container image support for this architecture.
kubeadm init
first runs a series of prechecks to ensure that the machine
is ready to run Kubernetes. These prechecks expose warnings and exit on errors. kubeadm init
then downloads and installs the cluster control plane components. This may take several minutes.
After it finishes you should see:
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a Pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
/docs/concepts/cluster-administration/addons/
You can now join any number of machines by running the following on each node
as root:
kubeadm join <control-plane-host>:<control-plane-port> --token <token> --discovery-token-ca-cert-hash sha256:<hash>
To make kubectl work for your non-root user, run these commands, which are
also part of the kubeadm init
output:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Alternatively, if you are the root
user, you can run:
export KUBECONFIG=/etc/kubernetes/admin.conf
admin.conf
to have Subject: O = system:masters, CN = kubernetes-admin
.
system:masters
is a break-glass, super user group that bypasses the authorization layer (e.g. RBAC).
Do not share the admin.conf
file with anyone and instead grant users custom permissions by generating
them a kubeconfig file using the kubeadm kubeconfig user
command. For more details see
Generating kubeconfig files for additional users.
Make a record of the kubeadm join
command that kubeadm init
outputs. You
need this command to join nodes to your cluster.
The token is used for mutual authentication between the control-plane node and the joining
nodes. The token included here is secret. Keep it safe, because anyone with this
token can add authenticated nodes to your cluster. These tokens can be listed,
created, and deleted with the kubeadm token
command. See the
kubeadm reference guide.
Installing a Pod network add-on
This section contains important information about networking setup and deployment order. Read all of this advice carefully before proceeding.
You must deploy a Container Network Interface (CNI) based Pod network add-on so that your Pods can communicate with each other. Cluster DNS (CoreDNS) will not start up before a network is installed.
-
Take care that your Pod network must not overlap with any of the host networks: you are likely to see problems if there is any overlap. (If you find a collision between your network plugin's preferred Pod network and some of your host networks, you should think of a suitable CIDR block to use instead, then use that during
kubeadm init
with--pod-network-cidr
and as a replacement in your network plugin's YAML). -
By default,
kubeadm
sets up your cluster to use and enforce use of RBAC (role based access control). Make sure that your Pod network plugin supports RBAC, and so do any manifests that you use to deploy it. -
If you want to use IPv6--either dual-stack, or single-stack IPv6 only networking--for your cluster, make sure that your Pod network plugin supports IPv6. IPv6 support was added to CNI in v0.6.0.
Several external projects provide Kubernetes Pod networks using CNI, some of which also support Network Policy.
See a list of add-ons that implement the Kubernetes networking model.
You can install a Pod network add-on with the following command on the control-plane node or a node that has the kubeconfig credentials:
kubectl apply -f <add-on.yaml>
You can install only one Pod network per cluster.
Once a Pod network has been installed, you can confirm that it is working by
checking that the CoreDNS Pod is Running
in the output of kubectl get pods --all-namespaces
.
And once the CoreDNS Pod is up and running, you can continue by joining your nodes.
If your network is not working or CoreDNS is not in the Running
state, check out the
troubleshooting guide
for kubeadm
.
Managed node labels
By default, kubeadm enables the NodeRestriction
admission controller that restricts what labels can be self-applied by kubelets on node registration.
The admission controller documentation covers what labels are permitted to be used with the kubelet --node-labels
option.
The node-role.kubernetes.io/control-plane
label is such a restricted label and kubeadm manually applies it using
a privileged client after a node has been created. To do that manually you can do the same by using kubectl label
and ensure it is using a privileged kubeconfig such as the kubeadm managed /etc/kubernetes/admin.conf
.
Control plane node isolation
By default, your cluster will not schedule Pods on the control-plane node for security reasons. If you want to be able to schedule Pods on the control-plane node, for example for a single-machine Kubernetes cluster for development, run:
kubectl taint nodes --all node-role.kubernetes.io/master-
With output looking something like:
node "test-01" untainted
taint "node-role.kubernetes.io/master:" not found
taint "node-role.kubernetes.io/master:" not found
This will remove the node-role.kubernetes.io/master
taint from any nodes that
have it, including the control-plane node, meaning that the scheduler will then be able
to schedule Pods everywhere.
Joining your nodes
The nodes are where your workloads (containers and Pods, etc) run. To add new nodes to your cluster do the following for each machine:
-
SSH to the machine
-
Become root (e.g.
sudo su -
) -
Install a runtime if needed
-
Run the command that was output by
kubeadm init
. For example:kubeadm join --token <token> <control-plane-host>:<control-plane-port> --discovery-token-ca-cert-hash sha256:<hash>
If you do not have the token, you can get it by running the following command on the control-plane node:
kubeadm token list
The output is similar to this:
TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
8ewj1p.9r9hcjoqgajrj4gi 23h 2018-06-12T02:51:28Z authentication, The default bootstrap system:
signing token generated by bootstrappers:
'kubeadm init'. kubeadm:
default-node-token
By default, tokens expire after 24 hours. If you are joining a node to the cluster after the current token has expired, you can create a new token by running the following command on the control-plane node:
kubeadm token create
The output is similar to this:
5didvk.d09sbcov8ph2amjw
If you don't have the value of --discovery-token-ca-cert-hash
, you can get it by running the following command chain on the control-plane node:
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | \
openssl dgst -sha256 -hex | sed 's/^.* //'
The output is similar to:
8cb2de97839780a412b93877f8507ad6c94f73add17d5d7058e91741c9d5ec78
<control-plane-host>:<control-plane-port>
, IPv6 address must be enclosed in square brackets, for example: [fd00::101]:2073
.
The output should look something like:
[preflight] Running pre-flight checks
... (log output of join workflow) ...
Node join complete:
* Certificate signing request sent to control-plane and response
received.
* Kubelet informed of new secure connection details.
Run 'kubectl get nodes' on control-plane to see this machine join.
A few seconds later, you should notice this node in the output from kubectl get nodes
when run on the control-plane node.
kubectl -n kube-system rollout restart deployment coredns
after at least one new node is joined.
(Optional) Controlling your cluster from machines other than the control-plane node
In order to get a kubectl on some other computer (e.g. laptop) to talk to your cluster, you need to copy the administrator kubeconfig file from your control-plane node to your workstation like this:
scp root@<control-plane-host>:/etc/kubernetes/admin.conf .
kubectl --kubeconfig ./admin.conf get nodes
The example above assumes SSH access is enabled for root. If that is not the
case, you can copy the admin.conf
file to be accessible by some other user
and scp
using that other user instead.
The admin.conf
file gives the user superuser privileges over the cluster.
This file should be used sparingly. For normal users, it's recommended to
generate an unique credential to which you grant privileges. You can do
this with the kubeadm alpha kubeconfig user --client-name <CN>
command. That command will print out a KubeConfig file to STDOUT which you
should save to a file and distribute to your user. After that, grant
privileges by using kubectl create (cluster)rolebinding
.
(Optional) Proxying API Server to localhost
If you want to connect to the API Server from outside the cluster you can use
kubectl proxy
:
scp root@<control-plane-host>:/etc/kubernetes/admin.conf .
kubectl --kubeconfig ./admin.conf proxy
You can now access the API Server locally at http://localhost:8001/api/v1
Clean up
If you used disposable servers for your cluster, for testing, you can
switch those off and do no further clean up. You can use
kubectl config delete-cluster
to delete your local references to the
cluster.
However, if you want to deprovision your cluster more cleanly, you should first drain the node and make sure that the node is empty, then deconfigure the node.
Remove the node
Talking to the control-plane node with the appropriate credentials, run:
kubectl drain <node name> --delete-emptydir-data --force --ignore-daemonsets
Before removing the node, reset the state installed by kubeadm
:
kubeadm reset
The reset process does not reset or clean up iptables rules or IPVS tables. If you wish to reset iptables, you must do so manually:
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X
If you want to reset the IPVS tables, you must run the following command:
ipvsadm -C
Now remove the node:
kubectl delete node <node name>
If you wish to start over, run kubeadm init
or kubeadm join
with the
appropriate arguments.
Clean up the control plane
You can use kubeadm reset
on the control plane host to trigger a best-effort
clean up.
See the kubeadm reset
reference documentation for more information about this subcommand and its
options.
What's next
- Verify that your cluster is running properly with Sonobuoy
- See Upgrading kubeadm clusters
for details about upgrading your cluster using
kubeadm
. - Learn about advanced
kubeadm
usage in the kubeadm reference documentation - Learn more about Kubernetes concepts and
kubectl
. - See the Cluster Networking page for a bigger list of Pod network add-ons.
- See the list of add-ons to explore other add-ons, including tools for logging, monitoring, network policy, visualization & control of your Kubernetes cluster.
- Configure how your cluster handles logs for cluster events and from applications running in Pods. See Logging Architecture for an overview of what is involved.
Feedback
- For bugs, visit the kubeadm GitHub issue tracker
- For support, visit the #kubeadm Slack channel
- General SIG Cluster Lifecycle development Slack channel: #sig-cluster-lifecycle
- SIG Cluster Lifecycle SIG information
- SIG Cluster Lifecycle mailing list: kubernetes-sig-cluster-lifecycle
Version skew policy
While kubeadm allows version skew against some components that it manages, it is recommended that you match the kubeadm version with the versions of the control plane components, kube-proxy and kubelet.
kubeadm's skew against the Kubernetes version
kubeadm can be used with Kubernetes components that are the same version as kubeadm
or one version older. The Kubernetes version can be specified to kubeadm by using the
--kubernetes-version
flag of kubeadm init
or the
ClusterConfiguration.kubernetesVersion
field when using --config
. This option will control the versions
of kube-apiserver, kube-controller-manager, kube-scheduler and kube-proxy.
Example:
- kubeadm is at 1.23
kubernetesVersion
must be at 1.23 or 1.22
kubeadm's skew against the kubelet
Similarly to the Kubernetes version, kubeadm can be used with a kubelet version that is the same version as kubeadm or one version older.
Example:
- kubeadm is at 1.23
- kubelet on the host must be at 1.23 or 1.22
kubeadm's skew against kubeadm
There are certain limitations on how kubeadm commands can operate on existing nodes or whole clusters managed by kubeadm.
If new nodes are joined to the cluster, the kubeadm binary used for kubeadm join
must match
the last version of kubeadm used to either create the cluster with kubeadm init
or to upgrade
the same node with kubeadm upgrade
. Similar rules apply to the rest of the kubeadm commands
with the exception of kubeadm upgrade
.
Example for kubeadm join
:
- kubeadm version 1.23 was used to create a cluster with
kubeadm init
- Joining nodes must use a kubeadm binary that is at version 1.23
Nodes that are being upgraded must use a version of kubeadm that is the same MINOR version or one MINOR version newer than the version of kubeadm used for managing the node.
Example for kubeadm upgrade
:
- kubeadm version 1.22 was used to create or upgrade the node
- The version of kubeadm used for upgrading the node must be at 1.22 or 1.23
To learn more about the version skew between the different Kubernetes component see the Version Skew Policy.
Limitations
Cluster resilience
The cluster created here has a single control-plane node, with a single etcd database running on it. This means that if the control-plane node fails, your cluster may lose data and may need to be recreated from scratch.
Workarounds:
-
Regularly back up etcd. The etcd data directory configured by kubeadm is at
/var/lib/etcd
on the control-plane node. -
Use multiple control-plane nodes. You can read Options for Highly Available topology to pick a cluster topology that provides high-availability.
Platform compatibility
kubeadm deb/rpm packages and binaries are built for amd64, arm (32-bit), arm64, ppc64le, and s390x following the multi-platform proposal.
Multiplatform container images for the control plane and addons are also supported since v1.12.
Only some of the network providers offer solutions for all platforms. Please consult the list of network providers above or the documentation from each provider to figure out whether the provider supports your chosen platform.
Troubleshooting
If you are running into difficulties with kubeadm, please consult our troubleshooting docs.
2.2.2.1.4 - Customizing components with the kubeadm API
This page covers how to customize the components that kubeadm deploys. For control plane components
you can use flags in the ClusterConfiguration
structure or patches per-node. For the kubelet
and kube-proxy you can use KubeletConfiguration
and KubeProxyConfiguration
, accordingly.
All of these options are possible via the kubeadm configuration API. For more details on each field in the configuration you can navigate to our API reference pages.
kube-system/coredns
ConfigMap
and recreate the CoreDNS Pods after that. Alternatively,
you can skip the default CoreDNS deployment and deploy your own variant.
For more details on that see Using init phases with kubeadm.
Kubernetes v1.12 [stable]
Customizing the control plane with flags in ClusterConfiguration
The kubeadm ClusterConfiguration
object exposes a way for users to override the default
flags passed to control plane components such as the APIServer, ControllerManager, Scheduler and Etcd.
The components are defined using the following structures:
apiServer
controllerManager
scheduler
etcd
These structures contain a common extraArgs
field, that consists of key: value
pairs.
To override a flag for a control plane component:
- Add the appropriate
extraArgs
to your configuration. - Add flags to the
extraArgs
field. - Run
kubeadm init
with--config <YOUR CONFIG YAML>
.
ClusterConfiguration
object with default values by running kubeadm config print init-defaults
and saving the output to a file of your choice.
ClusterConfiguration
object is currently global in kubeadm clusters. This means that any flags that you add,
will apply to all instances of the same component on different nodes. To apply individual configuration per component
on different nodes you can use patches.
--foo
multiple times, is currently not supported.
To workaround that you must use patches.
APIServer flags
For details, see the reference documentation for kube-apiserver.
Example usage:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.16.0
apiServer:
extraArgs:
anonymous-auth: "false"
enable-admission-plugins: AlwaysPullImages,DefaultStorageClass
audit-log-path: /home/johndoe/audit.log
ControllerManager flags
For details, see the reference documentation for kube-controller-manager.
Example usage:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.16.0
controllerManager:
extraArgs:
cluster-signing-key-file: /home/johndoe/keys/ca.key
deployment-controller-sync-period: "50"
Scheduler flags
For details, see the reference documentation for kube-scheduler.
Example usage:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.16.0
scheduler:
extraArgs:
config: /etc/kubernetes/scheduler-config.yaml
extraVolumes:
- name: schedulerconfig
hostPath: /home/johndoe/schedconfig.yaml
mountPath: /etc/kubernetes/scheduler-config.yaml
readOnly: true
pathType: "File"
Etcd flags
For details, see the etcd server documentation.
Example usage:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
etcd:
local:
extraArgs:
election-timeout: 1000
Customizing the control plane with patches
Kubernetes v1.22 [beta]
Kubeadm allows you to pass a directory with patch files to InitConfiguration
and JoinConfiguration
on individual nodes. These patches can be used as the last customization step before the control
plane component manifests are written to disk.
You can pass this file to kubeadm init
with --config <YOUR CONFIG YAML>
:
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
patches:
directory: /home/user/somedir
kubeadm init
you can pass a file containing both a ClusterConfiguration
and InitConfiguration
separated by ---
.
You can pass this file to kubeadm join
with --config <YOUR CONFIG YAML>
:
apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
patches:
directory: /home/user/somedir
The directory must contain files named target[suffix][+patchtype].extension
.
For example, kube-apiserver0+merge.yaml
or just etcd.json
.
target
can be one ofkube-apiserver
,kube-controller-manager
,kube-scheduler
andetcd
.patchtype
can be one ofstrategic
,merge
orjson
and these must match the patching formats supported by kubectl. The defaultpatchtype
isstrategic
.extension
must be eitherjson
oryaml
.suffix
is an optional string that can be used to determine which patches are applied first alpha-numerically.
kubeadm upgrade
to upgrade your kubeadm nodes you must again provide the same
patches, so that the customization is preserved after upgrade. To do that you can use the --patches
flag, which must point to the same directory. kubeadm upgrade
currently does not support a configuration
API structure that can be used for the same purpose.
Customizing the kubelet
To customize the kubelet you can add a KubeletConfiguration
next to the ClusterConfiguration
or
InitConfiguration
separated by ---
within the same configuration file. This file can then be passed to kubeadm init
.
KubeletConfiguration
to all nodes in the cluster. To apply node
specific settings you can use kubelet flags as overrides by passing them in the nodeRegistration.kubeletExtraArgs
field supported by both InitConfiguration
and JoinConfiguration
. Some kubelet flags are deprecated,
so check their status in the kubelet reference documentation
before using them.
For more details see Configuring each kubelet in your cluster using kubeadm
Customizing kube-proxy
To customize kube-proxy you can pass a KubeProxyConfiguration
next your ClusterConfiguration
or
InitConfiguration
to kubeadm init
separated by ---
.
For more details you can navigate to our API reference pages.
KubeProxyConfiguration
would apply to all instances of kube-proxy in the cluster.
2.2.2.1.5 - Options for Highly Available Topology
This page explains the two options for configuring the topology of your highly available (HA) Kubernetes clusters.
You can set up an HA cluster:
- With stacked control plane nodes, where etcd nodes are colocated with control plane nodes
- With external etcd nodes, where etcd runs on separate nodes from the control plane
You should carefully consider the advantages and disadvantages of each topology before setting up an HA cluster.
Stacked etcd topology
A stacked HA cluster is a topology where the distributed data storage cluster provided by etcd is stacked on top of the cluster formed by the nodes managed by kubeadm that run control plane components.
Each control plane node runs an instance of the kube-apiserver
, kube-scheduler
, and kube-controller-manager
.
The kube-apiserver
is exposed to worker nodes using a load balancer.
Each control plane node creates a local etcd member and this etcd member communicates only with
the kube-apiserver
of this node. The same applies to the local kube-controller-manager
and kube-scheduler
instances.
This topology couples the control planes and etcd members on the same nodes. It is simpler to set up than a cluster with external etcd nodes, and simpler to manage for replication.
However, a stacked cluster runs the risk of failed coupling. If one node goes down, both an etcd member and a control plane instance are lost, and redundancy is compromised. You can mitigate this risk by adding more control plane nodes.
You should therefore run a minimum of three stacked control plane nodes for an HA cluster.
This is the default topology in kubeadm. A local etcd member is created automatically
on control plane nodes when using kubeadm init
and kubeadm join --control-plane
.
External etcd topology
An HA cluster with external etcd is a topology where the distributed data storage cluster provided by etcd is external to the cluster formed by the nodes that run control plane components.
Like the stacked etcd topology, each control plane node in an external etcd topology runs an instance of the kube-apiserver
, kube-scheduler
, and kube-controller-manager
. And the kube-apiserver
is exposed to worker nodes using a load balancer. However, etcd members run on separate hosts, and each etcd host communicates with the kube-apiserver
of each control plane node.
This topology decouples the control plane and etcd member. It therefore provides an HA setup where losing a control plane instance or an etcd member has less impact and does not affect the cluster redundancy as much as the stacked HA topology.
However, this topology requires twice the number of hosts as the stacked HA topology. A minimum of three hosts for control plane nodes and three hosts for etcd nodes are required for an HA cluster with this topology.
What's next
2.2.2.1.6 - Creating Highly Available Clusters with kubeadm
This page explains two different approaches to setting up a highly available Kubernetes cluster using kubeadm:
- With stacked control plane nodes. This approach requires less infrastructure. The etcd members and control plane nodes are co-located.
- With an external etcd cluster. This approach requires more infrastructure. The control plane nodes and etcd members are separated.
Before proceeding, you should carefully consider which approach best meets the needs of your applications and environment. Options for Highly Available topology outlines the advantages and disadvantages of each.
If you encounter issues with setting up the HA cluster, please report these in the kubeadm issue tracker.
See also the upgrade documentation.
Before you begin
The prerequisites depend on which topology you have selected for your cluster's control plane:
You need:
- Three or more machines that meet kubeadm's minimum requirements for
the control-plane nodes. Having an odd number of control plane nodes can help
with leader selection in the case of machine or zone failure.
- including a container runtime, already set up and working
- Three or more machines that meet kubeadm's minimum
requirements for the workers
- including a container runtime, already set up and working
- Full network connectivity between all machines in the cluster (public or private network)
- Superuser privileges on all machines using
sudo
- You can use a different tool; this guide uses
sudo
in the examples.
- You can use a different tool; this guide uses
- SSH access from one device to all nodes in the system
kubeadm
andkubelet
already installed on all machines.
See Stacked etcd topology for context.
You need:
- Three or more machines that meet kubeadm's minimum requirements for
the control-plane nodes. Having an odd number of control plane nodes can help
with leader selection in the case of machine or zone failure.
- including a container runtime, already set up and working
- Three or more machines that meet kubeadm's minimum
requirements for the workers
- including a container runtime, already set up and working
- Full network connectivity between all machines in the cluster (public or private network)
- Superuser privileges on all machines using
sudo
- You can use a different tool; this guide uses
sudo
in the examples.
- You can use a different tool; this guide uses
- SSH access from one device to all nodes in the system
kubeadm
andkubelet
already installed on all machines.
And you also need:
- Three or more additional machines, that will become etcd cluster members.
Having an odd number of members in the etcd cluster is a requirement for achieving
optimal voting quorum.
- These machines again need to have
kubeadm
andkubelet
installed. - These machines also require a container runtime, that is already set up and working.
- These machines again need to have
See External etcd topology for context.
Container images
Each host should have access read and fetch images from the Kubernetes container image registry, k8s.gcr.io
.
If you want to deploy a highly-available cluster where the hosts do not have access to pull images, this is possible. You must ensure by some other means that the correct container images are already available on the relevant hosts.
Command line interface
To manage Kubernetes once your cluster is set up, you should
install kubectl on your PC. It is also useful
to install the kubectl
tool on each control plane node, as this can be
helpful for troubleshooting.
First steps for both methods
Create load balancer for kube-apiserver
-
Create a kube-apiserver load balancer with a name that resolves to DNS.
-
In a cloud environment you should place your control plane nodes behind a TCP forwarding load balancer. This load balancer distributes traffic to all healthy control plane nodes in its target list. The health check for an apiserver is a TCP check on the port the kube-apiserver listens on (default value
:6443
). -
It is not recommended to use an IP address directly in a cloud environment.
-
The load balancer must be able to communicate with all control plane nodes on the apiserver port. It must also allow incoming traffic on its listening port.
-
Make sure the address of the load balancer always matches the address of kubeadm's
ControlPlaneEndpoint
. -
Read the Options for Software Load Balancing guide for more details.
-
-
Add the first control plane node to the load balancer, and test the connection:
nc -v <LOAD_BALANCER_IP> <PORT>
A connection refused error is expected because the API server is not yet running. A timeout, however, means the load balancer cannot communicate with the control plane node. If a timeout occurs, reconfigure the load balancer to communicate with the control plane node.
-
Add the remaining control plane nodes to the load balancer target group.
Stacked control plane and etcd nodes
Steps for the first control plane node
-
Initialize the control plane:
sudo kubeadm init --control-plane-endpoint "LOAD_BALANCER_DNS:LOAD_BALANCER_PORT" --upload-certs
-
You can use the
--kubernetes-version
flag to set the Kubernetes version to use. It is recommended that the versions of kubeadm, kubelet, kubectl and Kubernetes match. -
The
--control-plane-endpoint
flag should be set to the address or DNS and port of the load balancer. -
The
--upload-certs
flag is used to upload the certificates that should be shared across all the control-plane instances to the cluster. If instead, you prefer to copy certs across control-plane nodes manually or using automation tools, please remove this flag and refer to Manual certificate distribution section below.
Note: Thekubeadm init
flags--config
and--certificate-key
cannot be mixed, therefore if you want to use the kubeadm configuration you must add thecertificateKey
field in the appropriate config locations (underInitConfiguration
andJoinConfiguration: controlPlane
).Note: Some CNI network plugins require additional configuration, for example specifying the pod IP CIDR, while others do not. See the CNI network documentation. To add a pod CIDR pass the flag--pod-network-cidr
, or if you are using a kubeadm configuration file set thepodSubnet
field under thenetworking
object ofClusterConfiguration
.The output looks similar to:
... You can now join any number of control-plane node by running the following command on each as a root: kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866 --control-plane --certificate-key f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07 Please note that the certificate-key gives access to cluster sensitive data, keep it secret! As a safeguard, uploaded-certs will be deleted in two hours; If necessary, you can use kubeadm init phase upload-certs to reload certs afterward. Then you can join any number of worker nodes by running the following on each as root: kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866
-
Copy this output to a text file. You will need it later to join control plane and worker nodes to the cluster.
-
When
--upload-certs
is used withkubeadm init
, the certificates of the primary control plane are encrypted and uploaded in thekubeadm-certs
Secret. -
To re-upload the certificates and generate a new decryption key, use the following command on a control plane node that is already joined to the cluster:
sudo kubeadm init phase upload-certs --upload-certs
-
You can also specify a custom
--certificate-key
duringinit
that can later be used byjoin
. To generate such a key you can use the following command:kubeadm certs certificate-key
Note: Thekubeadm-certs
Secret and decryption key expire after two hours.Caution: As stated in the command output, the certificate key gives access to cluster sensitive data, keep it secret! -
-
Apply the CNI plugin of your choice:
Follow these instructions to install the CNI provider. Make sure the configuration corresponds to the Pod CIDR specified in the kubeadm configuration file (if applicable).Note: You must pick a network plugin that suits your use case and deploy it before you move on to next step. If you don't do this, you will not be able to launch your cluster properly. -
Type the following and watch the pods of the control plane components get started:
kubectl get pod -n kube-system -w
Steps for the rest of the control plane nodes
For each additional control plane node you should:
-
Execute the join command that was previously given to you by the
kubeadm init
output on the first node. It should look something like this:sudo kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866 --control-plane --certificate-key f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07
- The
--control-plane
flag tellskubeadm join
to create a new control plane. - The
--certificate-key ...
will cause the control plane certificates to be downloaded from thekubeadm-certs
Secret in the cluster and be decrypted using the given key.
- The
You can join multiple control-plane nodes in parallel.
External etcd nodes
Setting up a cluster with external etcd nodes is similar to the procedure used for stacked etcd with the exception that you should setup etcd first, and you should pass the etcd information in the kubeadm config file.
Set up the etcd cluster
-
Follow these instructions to set up the etcd cluster.
-
Setup SSH as described here.
-
Copy the following files from any etcd node in the cluster to the first control plane node:
export CONTROL_PLANE="ubuntu@10.0.0.7" scp /etc/kubernetes/pki/etcd/ca.crt "${CONTROL_PLANE}": scp /etc/kubernetes/pki/apiserver-etcd-client.crt "${CONTROL_PLANE}": scp /etc/kubernetes/pki/apiserver-etcd-client.key "${CONTROL_PLANE}":
- Replace the value of
CONTROL_PLANE
with theuser@host
of the first control-plane node.
- Replace the value of
Set up the first control plane node
-
Create a file called
kubeadm-config.yaml
with the following contents:--- apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration kubernetesVersion: stable controlPlaneEndpoint: "LOAD_BALANCER_DNS:LOAD_BALANCER_PORT" # change this (see below) etcd: external: endpoints: - https://ETCD_0_IP:2379 # change ETCD_0_IP appropriately - https://ETCD_1_IP:2379 # change ETCD_1_IP appropriately - https://ETCD_2_IP:2379 # change ETCD_2_IP appropriately caFile: /etc/kubernetes/pki/etcd/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key
Note: The difference between stacked etcd and external etcd here is that the external etcd setup requires a configuration file with the etcd endpoints under theexternal
object foretcd
. In the case of the stacked etcd topology, this is managed automatically.-
Replace the following variables in the config template with the appropriate values for your cluster:
LOAD_BALANCER_DNS
LOAD_BALANCER_PORT
ETCD_0_IP
ETCD_1_IP
ETCD_2_IP
-
The following steps are similar to the stacked etcd setup:
-
Run
sudo kubeadm init --config kubeadm-config.yaml --upload-certs
on this node. -
Write the output join commands that are returned to a text file for later use.
-
Apply the CNI plugin of your choice.
Note: You must pick a network plugin that suits your use case and deploy it before you move on to next step. If you don't do this, you will not be able to launch your cluster properly.
Steps for the rest of the control plane nodes
The steps are the same as for the stacked etcd setup:
- Make sure the first control plane node is fully initialized.
- Join each control plane node with the join command you saved to a text file. It's recommended to join the control plane nodes one at a time.
- Don't forget that the decryption key from
--certificate-key
expires after two hours, by default.
Common tasks after bootstrapping control plane
Install workers
Worker nodes can be joined to the cluster with the command you stored previously
as the output from the kubeadm init
command:
sudo kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866
Manual certificate distribution
If you choose to not use kubeadm init
with the --upload-certs
flag this means that
you are going to have to manually copy the certificates from the primary control plane node to the
joining control plane nodes.
There are many ways to do this. The following example uses ssh
and scp
:
SSH is required if you want to control all nodes from a single machine.
-
Enable ssh-agent on your main device that has access to all other nodes in the system:
eval $(ssh-agent)
-
Add your SSH identity to the session:
ssh-add ~/.ssh/path_to_private_key
-
SSH between nodes to check that the connection is working correctly.
-
When you SSH to any node, add the
-A
flag. This flag allows the node that you have logged into via SSH to access the SSH agent on your PC. Consider alternative methods if you do not fully trust the security of your user session on the node.ssh -A 10.0.0.7
-
When using sudo on any node, make sure to preserve the environment so SSH forwarding works:
sudo -E -s
-
-
After configuring SSH on all the nodes you should run the following script on the first control plane node after running
kubeadm init
. This script will copy the certificates from the first control plane node to the other control plane nodes:In the following example, replace
CONTROL_PLANE_IPS
with the IP addresses of the other control plane nodes.USER=ubuntu # customizable CONTROL_PLANE_IPS="10.0.0.7 10.0.0.8" for host in ${CONTROL_PLANE_IPS}; do scp /etc/kubernetes/pki/ca.crt "${USER}"@$host: scp /etc/kubernetes/pki/ca.key "${USER}"@$host: scp /etc/kubernetes/pki/sa.key "${USER}"@$host: scp /etc/kubernetes/pki/sa.pub "${USER}"@$host: scp /etc/kubernetes/pki/front-proxy-ca.crt "${USER}"@$host: scp /etc/kubernetes/pki/front-proxy-ca.key "${USER}"@$host: scp /etc/kubernetes/pki/etcd/ca.crt "${USER}"@$host:etcd-ca.crt # Skip the next line if you are using external etcd scp /etc/kubernetes/pki/etcd/ca.key "${USER}"@$host:etcd-ca.key done
Caution: Copy only the certificates in the above list. kubeadm will take care of generating the rest of the certificates with the required SANs for the joining control-plane instances. If you copy all the certificates by mistake, the creation of additional nodes could fail due to a lack of required SANs. -
Then on each joining control plane node you have to run the following script before running
kubeadm join
. This script will move the previously copied certificates from the home directory to/etc/kubernetes/pki
:USER=ubuntu # customizable mkdir -p /etc/kubernetes/pki/etcd mv /home/${USER}/ca.crt /etc/kubernetes/pki/ mv /home/${USER}/ca.key /etc/kubernetes/pki/ mv /home/${USER}/sa.pub /etc/kubernetes/pki/ mv /home/${USER}/sa.key /etc/kubernetes/pki/ mv /home/${USER}/front-proxy-ca.crt /etc/kubernetes/pki/ mv /home/${USER}/front-proxy-ca.key /etc/kubernetes/pki/ mv /home/${USER}/etcd-ca.crt /etc/kubernetes/pki/etcd/ca.crt # Skip the next line if you are using external etcd mv /home/${USER}/etcd-ca.key /etc/kubernetes/pki/etcd/ca.key
2.2.2.1.7 - Set up a High Availability etcd Cluster with kubeadm
By default, kubeadm runs a local etcd instance on each control plane node. It is also possible to treat the etcd cluster as external and provision etcd instances on separate hosts. The differences between the two approaches are covered in the Options for Highly Available topology page.
This task walks through the process of creating a high availability external etcd cluster of three members that can be used by kubeadm during cluster creation.
Before you begin
- Three hosts that can talk to each other over TCP ports 2379 and 2380. This document assumes these default ports. However, they are configurable through the kubeadm config file.
- Each host must have systemd and a bash compatible shell installed.
- Each host must have a container runtime, kubelet, and kubeadm installed.
- Each host should have access to the Kubernetes container image registry (
k8s.gcr.io
) or list/pull the required etcd image usingkubeadm config images list/pull
. This guide will setup etcd instances as static pods managed by a kubelet. - Some infrastructure to copy files between hosts. For example
ssh
andscp
can satisfy this requirement.
Setting up the cluster
The general approach is to generate all certs on one node and only distribute the necessary files to the other nodes.
-
Configure the kubelet to be a service manager for etcd.
Note: You must do this on every host where etcd should be running.Since etcd was created first, you must override the service priority by creating a new unit file that has higher precedence than the kubeadm-provided kubelet unit file.cat << EOF > /etc/systemd/system/kubelet.service.d/20-etcd-service-manager.conf [Service] ExecStart= # Replace "systemd" with the cgroup driver of your container runtime. The default value in the kubelet is "cgroupfs". # Replace the value of "--container-runtime-endpoint" for a different container runtime if needed. ExecStart=/usr/bin/kubelet --address=127.0.0.1 --pod-manifest-path=/etc/kubernetes/manifests --cgroup-driver=systemd --container-runtime=remote --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock Restart=always EOF systemctl daemon-reload systemctl restart kubelet
Check the kubelet status to ensure it is running.
systemctl status kubelet
-
Create configuration files for kubeadm.
Generate one kubeadm configuration file for each host that will have an etcd member running on it using the following script.
# Update HOST0, HOST1 and HOST2 with the IPs of your hosts export HOST0=10.0.0.6 export HOST1=10.0.0.7 export HOST2=10.0.0.8 # Update NAME0, NAME1 and NAME2 with the hostnames of your hosts export NAME0="infra0" export NAME1="infra1" export NAME2="infra2" # Create temp directories to store files that will end up on other hosts. mkdir -p /tmp/${HOST0}/ /tmp/${HOST1}/ /tmp/${HOST2}/ HOSTS=(${HOST0} ${HOST1} ${HOST2}) NAMES=(${NAME0} ${NAME1} ${NAME2}) for i in "${!HOSTS[@]}"; do HOST=${HOSTS[$i]} NAME=${NAMES[$i]} cat << EOF > /tmp/${HOST}/kubeadmcfg.yaml --- apiVersion: "kubeadm.k8s.io/v1beta3" kind: InitConfiguration nodeRegistration: name: ${NAME} localAPIEndpoint: advertiseAddress: ${HOST} --- apiVersion: "kubeadm.k8s.io/v1beta3" kind: ClusterConfiguration etcd: local: serverCertSANs: - "${HOST}" peerCertSANs: - "${HOST}" extraArgs: initial-cluster: ${NAMES[0]}=https://${HOSTS[0]}:2380,${NAMES[1]}=https://${HOSTS[1]}:2380,${NAMES[2]}=https://${HOSTS[2]}:2380 initial-cluster-state: new name: ${NAME} listen-peer-urls: https://${HOST}:2380 listen-client-urls: https://${HOST}:2379 advertise-client-urls: https://${HOST}:2379 initial-advertise-peer-urls: https://${HOST}:2380 EOF done
-
Generate the certificate authority
If you already have a CA then the only action that is copying the CA's
crt
andkey
file to/etc/kubernetes/pki/etcd/ca.crt
and/etc/kubernetes/pki/etcd/ca.key
. After those files have been copied, proceed to the next step, "Create certificates for each member".If you do not already have a CA then run this command on
$HOST0
(where you generated the configuration files for kubeadm).kubeadm init phase certs etcd-ca
This creates two files
/etc/kubernetes/pki/etcd/ca.crt
/etc/kubernetes/pki/etcd/ca.key
-
Create certificates for each member
kubeadm init phase certs etcd-server --config=/tmp/${HOST2}/kubeadmcfg.yaml kubeadm init phase certs etcd-peer --config=/tmp/${HOST2}/kubeadmcfg.yaml kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST2}/kubeadmcfg.yaml kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST2}/kubeadmcfg.yaml cp -R /etc/kubernetes/pki /tmp/${HOST2}/ # cleanup non-reusable certificates find /etc/kubernetes/pki -not -name ca.crt -not -name ca.key -type f -delete kubeadm init phase certs etcd-server --config=/tmp/${HOST1}/kubeadmcfg.yaml kubeadm init phase certs etcd-peer --config=/tmp/${HOST1}/kubeadmcfg.yaml kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST1}/kubeadmcfg.yaml kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST1}/kubeadmcfg.yaml cp -R /etc/kubernetes/pki /tmp/${HOST1}/ find /etc/kubernetes/pki -not -name ca.crt -not -name ca.key -type f -delete kubeadm init phase certs etcd-server --config=/tmp/${HOST0}/kubeadmcfg.yaml kubeadm init phase certs etcd-peer --config=/tmp/${HOST0}/kubeadmcfg.yaml kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST0}/kubeadmcfg.yaml kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST0}/kubeadmcfg.yaml # No need to move the certs because they are for HOST0 # clean up certs that should not be copied off this host find /tmp/${HOST2} -name ca.key -type f -delete find /tmp/${HOST1} -name ca.key -type f -delete
-
Copy certificates and kubeadm configs
The certificates have been generated and now they must be moved to their respective hosts.
USER=ubuntu HOST=${HOST1} scp -r /tmp/${HOST}/* ${USER}@${HOST}: ssh ${USER}@${HOST} USER@HOST $ sudo -Es root@HOST $ chown -R root:root pki root@HOST $ mv pki /etc/kubernetes/
-
Ensure all expected files exist
The complete list of required files on
$HOST0
is:/tmp/${HOST0} └── kubeadmcfg.yaml --- /etc/kubernetes/pki ├── apiserver-etcd-client.crt ├── apiserver-etcd-client.key └── etcd ├── ca.crt ├── ca.key ├── healthcheck-client.crt ├── healthcheck-client.key ├── peer.crt ├── peer.key ├── server.crt └── server.key
On
$HOST1
:$HOME └── kubeadmcfg.yaml --- /etc/kubernetes/pki ├── apiserver-etcd-client.crt ├── apiserver-etcd-client.key └── etcd ├── ca.crt ├── healthcheck-client.crt ├── healthcheck-client.key ├── peer.crt ├── peer.key ├── server.crt └── server.key
On
$HOST2
$HOME └── kubeadmcfg.yaml --- /etc/kubernetes/pki ├── apiserver-etcd-client.crt ├── apiserver-etcd-client.key └── etcd ├── ca.crt ├── healthcheck-client.crt ├── healthcheck-client.key ├── peer.crt ├── peer.key ├── server.crt └── server.key
-
Create the static pod manifests
Now that the certificates and configs are in place it's time to create the manifests. On each host run the
kubeadm
command to generate a static manifest for etcd.root@HOST0 $ kubeadm init phase etcd local --config=/tmp/${HOST0}/kubeadmcfg.yaml root@HOST1 $ kubeadm init phase etcd local --config=$HOME/kubeadmcfg.yaml root@HOST2 $ kubeadm init phase etcd local --config=$HOME/kubeadmcfg.yaml
-
Optional: Check the cluster health
docker run --rm -it \ --net host \ -v /etc/kubernetes:/etc/kubernetes k8s.gcr.io/etcd:${ETCD_TAG} etcdctl \ --cert /etc/kubernetes/pki/etcd/peer.crt \ --key /etc/kubernetes/pki/etcd/peer.key \ --cacert /etc/kubernetes/pki/etcd/ca.crt \ --endpoints https://${HOST0}:2379 endpoint health --cluster ... https://[HOST0 IP]:2379 is healthy: successfully committed proposal: took = 16.283339ms https://[HOST1 IP]:2379 is healthy: successfully committed proposal: took = 19.44402ms https://[HOST2 IP]:2379 is healthy: successfully committed proposal: took = 35.926451ms
- Set
${ETCD_TAG}
to the version tag of your etcd image. For example3.4.3-0
. To see the etcd image and tag that kubeadm uses executekubeadm config images list --kubernetes-version ${K8S_VERSION}
, where${K8S_VERSION}
is for examplev1.17.0
- Set
${HOST0}
to the IP address of the host you are testing.
- Set
What's next
Once you have a working 3 member etcd cluster, you can continue setting up a highly available control plane using the external etcd method with kubeadm.
2.2.2.1.8 - Configuring each kubelet in your cluster using kubeadm
Kubernetes v1.11 [stable]
The lifecycle of the kubeadm CLI tool is decoupled from the kubelet, which is a daemon that runs on each node within the Kubernetes cluster. The kubeadm CLI tool is executed by the user when Kubernetes is initialized or upgraded, whereas the kubelet is always running in the background.
Since the kubelet is a daemon, it needs to be maintained by some kind of an init system or service manager. When the kubelet is installed using DEBs or RPMs, systemd is configured to manage the kubelet. You can use a different service manager instead, but you need to configure it manually.
Some kubelet configuration details need to be the same across all kubelets involved in the cluster, while
other configuration aspects need to be set on a per-kubelet basis to accommodate the different
characteristics of a given machine (such as OS, storage, and networking). You can manage the configuration
of your kubelets manually, but kubeadm now provides a KubeletConfiguration
API type for
managing your kubelet configurations centrally.
Kubelet configuration patterns
The following sections describe patterns to kubelet configuration that are simplified by using kubeadm, rather than managing the kubelet configuration for each Node manually.
Propagating cluster-level configuration to each kubelet
You can provide the kubelet with default values to be used by kubeadm init
and kubeadm join
commands. Interesting examples include using a different CRI runtime or setting the default subnet
used by services.
If you want your services to use the subnet 10.96.0.0/12
as the default for services, you can pass
the --service-cidr
parameter to kubeadm:
kubeadm init --service-cidr 10.96.0.0/12
Virtual IPs for services are now allocated from this subnet. You also need to set the DNS address used
by the kubelet, using the --cluster-dns
flag. This setting needs to be the same for every kubelet
on every manager and Node in the cluster. The kubelet provides a versioned, structured API object
that can configure most parameters in the kubelet and push out this configuration to each running
kubelet in the cluster. This object is called
KubeletConfiguration
.
The KubeletConfiguration
allows the user to specify flags such as the cluster DNS IP addresses expressed as
a list of values to a camelCased key, illustrated by the following example:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clusterDNS:
- 10.96.0.10
For more details on the KubeletConfiguration
have a look at this section.
Providing instance-specific configuration details
Some hosts require specific kubelet configurations due to differences in hardware, operating system, networking, or other host-specific parameters. The following list provides a few examples.
-
The path to the DNS resolution file, as specified by the
--resolv-conf
kubelet configuration flag, may differ among operating systems, or depending on whether you are usingsystemd-resolved
. If this path is wrong, DNS resolution will fail on the Node whose kubelet is configured incorrectly. -
The Node API object
.metadata.name
is set to the machine's hostname by default, unless you are using a cloud provider. You can use the--hostname-override
flag to override the default behavior if you need to specify a Node name different from the machine's hostname. -
Currently, the kubelet cannot automatically detect the cgroup driver used by the CRI runtime, but the value of
--cgroup-driver
must match the cgroup driver used by the CRI runtime to ensure the health of the kubelet. -
Depending on the CRI runtime your cluster uses, you may need to specify different flags to the kubelet. For instance, when using Docker, you need to specify flags such as
--network-plugin=cni
, but if you are using an external runtime, you need to specify--container-runtime=remote
and specify the CRI endpoint using the--container-runtime-endpoint=<path>
.
You can specify these flags by configuring an individual kubelet's configuration in your service manager, such as systemd.
Configure kubelets using kubeadm
It is possible to configure the kubelet that kubeadm will start if a custom KubeletConfiguration
API object is passed with a configuration file like so kubeadm ... --config some-config-file.yaml
.
By calling kubeadm config print init-defaults --component-configs KubeletConfiguration
you can
see all the default values for this structure.
Also have a look at the reference for the KubeletConfiguration for more information on the individual fields.
Workflow when using kubeadm init
When you call kubeadm init
, the kubelet configuration is marshalled to disk
at /var/lib/kubelet/config.yaml
, and also uploaded to a ConfigMap in the cluster. The ConfigMap
is named kubelet-config-1.X
, where X
is the minor version of the Kubernetes version you are
initializing. A kubelet configuration file is also written to /etc/kubernetes/kubelet.conf
with the
baseline cluster-wide configuration for all kubelets in the cluster. This configuration file
points to the client certificates that allow the kubelet to communicate with the API server. This
addresses the need to
propagate cluster-level configuration to each kubelet.
To address the second pattern of
providing instance-specific configuration details,
kubeadm writes an environment file to /var/lib/kubelet/kubeadm-flags.env
, which contains a list of
flags to pass to the kubelet when it starts. The flags are presented in the file like this:
KUBELET_KUBEADM_ARGS="--flag1=value1 --flag2=value2 ..."
In addition to the flags used when starting the kubelet, the file also contains dynamic
parameters such as the cgroup driver and whether to use a different CRI runtime socket
(--cri-socket
).
After marshalling these two files to disk, kubeadm attempts to run the following two commands, if you are using systemd:
systemctl daemon-reload && systemctl restart kubelet
If the reload and restart are successful, the normal kubeadm init
workflow continues.
Workflow when using kubeadm join
When you run kubeadm join
, kubeadm uses the Bootstrap Token credential to perform
a TLS bootstrap, which fetches the credential needed to download the
kubelet-config-1.X
ConfigMap and writes it to /var/lib/kubelet/config.yaml
. The dynamic
environment file is generated in exactly the same way as kubeadm init
.
Next, kubeadm
runs the following two commands to load the new configuration into the kubelet:
systemctl daemon-reload && systemctl restart kubelet
After the kubelet loads the new configuration, kubeadm writes the
/etc/kubernetes/bootstrap-kubelet.conf
KubeConfig file, which contains a CA certificate and Bootstrap
Token. These are used by the kubelet to perform the TLS Bootstrap and obtain a unique
credential, which is stored in /etc/kubernetes/kubelet.conf
.
When the /etc/kubernetes/kubelet.conf
file is written, the kubelet has finished performing the TLS Bootstrap.
Kubeadm deletes the /etc/kubernetes/bootstrap-kubelet.conf
file after completing the TLS Bootstrap.
The kubelet drop-in file for systemd
kubeadm
ships with configuration for how systemd should run the kubelet.
Note that the kubeadm CLI command never touches this drop-in file.
This configuration file installed by the kubeadm
DEB or
RPM package is written to
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
and is used by systemd.
It augments the basic
kubelet.service
for RPM or
kubelet.service
for DEB:
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generate at runtime, populating
the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably,
# the user should use the .NodeRegistration.KubeletExtraArgs object in the configuration files instead.
# KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
This file specifies the default locations for all of the files managed by kubeadm for the kubelet.
- The KubeConfig file to use for the TLS Bootstrap is
/etc/kubernetes/bootstrap-kubelet.conf
, but it is only used if/etc/kubernetes/kubelet.conf
does not exist. - The KubeConfig file with the unique kubelet identity is
/etc/kubernetes/kubelet.conf
. - The file containing the kubelet's ComponentConfig is
/var/lib/kubelet/config.yaml
. - The dynamic environment file that contains
KUBELET_KUBEADM_ARGS
is sourced from/var/lib/kubelet/kubeadm-flags.env
. - The file that can contain user-specified flag overrides with
KUBELET_EXTRA_ARGS
is sourced from/etc/default/kubelet
(for DEBs), or/etc/sysconfig/kubelet
(for RPMs).KUBELET_EXTRA_ARGS
is last in the flag chain and has the highest priority in the event of conflicting settings.
Kubernetes binaries and package contents
The DEB and RPM packages shipped with the Kubernetes releases are:
Package name | Description |
---|---|
kubeadm |
Installs the /usr/bin/kubeadm CLI tool and the kubelet drop-in file for the kubelet. |
kubelet |
Installs the /usr/bin/kubelet binary. |
kubectl |
Installs the /usr/bin/kubectl binary. |
cri-tools |
Installs the /usr/bin/crictl binary from the cri-tools git repository. |
kubernetes-cni |
Installs the /opt/cni/bin binaries from the plugins git repository. |
2.2.2.1.9 - Dual-stack support with kubeadm
Kubernetes v1.23 [stable]
Your Kubernetes cluster includes dual-stack networking, which means that cluster networking lets you use either address family. In a cluster, the control plane can assign both an IPv4 address and an IPv6 address to a single Pod or a Service.
Before you begin
You need to have installed the kubeadm tool, following the steps from Installing kubeadm.
For each server that you want to use as a node, make sure it allows IPv6 forwarding. On Linux, you can set this by running run sysctl -w net.ipv6.conf.all.forwarding=1
as the root user on each server.
You need to have an IPv4 and and IPv6 address range to use. Cluster operators typically
use private address ranges for IPv4. For IPv6, a cluster operator typically chooses a global
unicast address block from within 2000::/3
, using a range that is assigned to the operator.
You don't have to route the cluster's IP address ranges to the public internet.
The size of the IP address allocations should be suitable for the number of Pods and Services that you are planning to run.
kubeadm upgrade
command,
kubeadm
does not support making modifications to the pod IP address range
(“cluster CIDR”) nor to the cluster's Service address range (“Service CIDR”).
Create a dual-stack cluster
To create a dual-stack cluster with kubeadm init
you can pass command line arguments
similar to the following example:
# These address ranges are examples
kubeadm init --pod-network-cidr=10.244.0.0/16,2001:db8:42:0::/56 --service-cidr=10.96.0.0/16,2001:db8:42:1::/112
To make things clearer, here is an example kubeadm
configuration file
kubeadm-config.yaml
for the primary dual-stack control plane node.
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
networking:
podSubnet: 10.244.0.0/16,2001:db8:42:0::/56
serviceSubnet: 10.96.0.0/16,2001:db8:42:1::/112
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: "10.100.0.1"
bindPort: 6443
nodeRegistration:
kubeletExtraArgs:
node-ip: 10.100.0.2,fd00:1:2:3::2
advertiseAddress
in InitConfiguration specifies the IP address that the API Server will advertise it is listening on. The value of advertiseAddress
equals the --apiserver-advertise-address
flag of kubeadm init
Run kubeadm to initiate the dual-stack control plane node:
kubeadm init --config=kubeadm-config.yaml
The kube-controller-manager flags --node-cidr-mask-size-ipv4|--node-cidr-mask-size-ipv6
are set with default values. See configure IPv4/IPv6 dual stack.
--apiserver-advertise-address
flag does not support dual-stack.
Join a node to dual-stack cluster
Before joining a node, make sure that the node has IPv6 routable network interface and allows IPv6 forwarding.
Here is an example kubeadm configuration file
kubeadm-config.yaml
for joining a worker node to the cluster.
apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
discovery:
bootstrapToken:
apiServerEndpoint: 10.100.0.1:6443
token: "clvldh.vjjwg16ucnhp94qr"
caCertHashes:
- "sha256:a4863cde706cfc580a439f842cc65d5ef112b7b2be31628513a9881cf0d9fe0e"
# change auth info above to match the actual token and CA certificate hash for your cluster
nodeRegistration:
kubeletExtraArgs:
node-ip: 10.100.0.3,fd00:1:2:3::3
Also, here is an example kubeadm configuration file
kubeadm-config.yaml
for joining another control plane node to the cluster.
apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
controlPlane:
localAPIEndpoint:
advertiseAddress: "10.100.0.2"
bindPort: 6443
discovery:
bootstrapToken:
apiServerEndpoint: 10.100.0.1:6443
token: "clvldh.vjjwg16ucnhp94qr"
caCertHashes:
- "sha256:a4863cde706cfc580a439f842cc65d5ef112b7b2be31628513a9881cf0d9fe0e"
# change auth info above to match the actual token and CA certificate hash for your cluster
nodeRegistration:
kubeletExtraArgs:
node-ip: 10.100.0.4,fd00:1:2:3::4
advertiseAddress
in JoinConfiguration.controlPlane specifies the IP address that the API Server will advertise it is listening on. The value of advertiseAddress
equals the --apiserver-advertise-address
flag of kubeadm join
.
kubeadm join --config=kubeadm-config.yaml
Create a single-stack cluster
To make things more clear, here is an example kubeadm
configuration file
kubeadm-config.yaml
for the single-stack control plane node.
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
networking:
podSubnet: 10.244.0.0/16
serviceSubnet: 10.96.0.0/16
What's next
- Validate IPv4/IPv6 dual-stack networking
- Read about Dual-stack cluster networking
- Learn more about the kubeadm configuration format
2.2.2.2 - Installing Kubernetes with kops
This quickstart shows you how to easily install a Kubernetes cluster on AWS.
It uses a tool called kops
.
kops is an automated provisioning system:
- Fully automated installation
- Uses DNS to identify clusters
- Self-healing: everything runs in Auto-Scaling Groups
- Multiple OS support (Debian, Ubuntu 16.04 supported, CentOS & RHEL, Amazon Linux and CoreOS) - see the images.md
- High-Availability support - see the high_availability.md
- Can directly provision, or generate terraform manifests - see the terraform.md
Before you begin
-
You must have kubectl installed.
-
You must install
kops
on a 64-bit (AMD64 and Intel 64) device architecture. -
You must have an AWS account, generate IAM keys and configure them. The IAM user will need adequate permissions.
Creating a cluster
(1/5) Install kops
Installation
Download kops from the releases page (it is also convenient to build from source):
Download the latest release with the command:
curl -LO https://github.com/kubernetes/kops/releases/download/$(curl -s https://api.github.com/repos/kubernetes/kops/releases/latest | grep tag_name | cut -d '"' -f 4)/kops-darwin-amd64
To download a specific version, replace the following portion of the command with the specific kops version.
$(curl -s https://api.github.com/repos/kubernetes/kops/releases/latest | grep tag_name | cut -d '"' -f 4)
For example, to download kops version v1.20.0 type:
curl -LO https://github.com/kubernetes/kops/releases/download/v1.20.0/kops-darwin-amd64
Make the kops binary executable.
chmod +x kops-darwin-amd64
Move the kops binary in to your PATH.
sudo mv kops-darwin-amd64 /usr/local/bin/kops
You can also install kops using Homebrew.
brew update && brew install kops
Download the latest release with the command:
curl -LO https://github.com/kubernetes/kops/releases/download/$(curl -s https://api.github.com/repos/kubernetes/kops/releases/latest | grep tag_name | cut -d '"' -f 4)/kops-linux-amd64
To download a specific version of kops, replace the following portion of the command with the specific kops version.
$(curl -s https://api.github.com/repos/kubernetes/kops/releases/latest | grep tag_name | cut -d '"' -f 4)
For example, to download kops version v1.20.0 type:
curl -LO https://github.com/kubernetes/kops/releases/download/v1.20.0/kops-linux-amd64
Make the kops binary executable
chmod +x kops-linux-amd64
Move the kops binary in to your PATH.
sudo mv kops-linux-amd64 /usr/local/bin/kops
You can also install kops using Homebrew.
brew update && brew install kops
(2/5) Create a route53 domain for your cluster
kops uses DNS for discovery, both inside the cluster and outside, so that you can reach the kubernetes API server from clients.
kops has a strong opinion on the cluster name: it should be a valid DNS name. By doing so you will no longer get your clusters confused, you can share clusters with your colleagues unambiguously, and you can reach them without relying on remembering an IP address.
You can, and probably should, use subdomains to divide your clusters. As our example we will use
useast1.dev.example.com
. The API server endpoint will then be api.useast1.dev.example.com
.
A Route53 hosted zone can serve subdomains. Your hosted zone could be useast1.dev.example.com
,
but also dev.example.com
or even example.com
. kops works with any of these, so typically
you choose for organization reasons (e.g. you are allowed to create records under dev.example.com
,
but not under example.com
).
Let's assume you're using dev.example.com
as your hosted zone. You create that hosted zone using
the normal process, or
with a command such as aws route53 create-hosted-zone --name dev.example.com --caller-reference 1
.
You must then set up your NS records in the parent domain, so that records in the domain will resolve. Here,
you would create NS records in example.com
for dev
. If it is a root domain name you would configure the NS
records at your domain registrar (e.g. example.com
would need to be configured where you bought example.com
).
Verify your route53 domain setup (it is the #1 cause of problems!). You can double-check that your cluster is configured correctly if you have the dig tool by running:
dig NS dev.example.com
You should see the 4 NS records that Route53 assigned your hosted zone.
(3/5) Create an S3 bucket to store your clusters state
kops lets you manage your clusters even after installation. To do this, it must keep track of the clusters that you have created, along with their configuration, the keys they are using etc. This information is stored in an S3 bucket. S3 permissions are used to control access to the bucket.
Multiple clusters can use the same S3 bucket, and you can share an S3 bucket between your colleagues that administer the same clusters - this is much easier than passing around kubecfg files. But anyone with access to the S3 bucket will have administrative access to all your clusters, so you don't want to share it beyond the operations team.
So typically you have one S3 bucket for each ops team (and often the name will correspond to the name of the hosted zone above!)
In our example, we chose dev.example.com
as our hosted zone, so let's pick clusters.dev.example.com
as
the S3 bucket name.
-
Export
AWS_PROFILE
(if you need to select a profile for the AWS CLI to work) -
Create the S3 bucket using
aws s3 mb s3://clusters.dev.example.com
-
You can
export KOPS_STATE_STORE=s3://clusters.dev.example.com
and then kops will use this location by default. We suggest putting this in your bash profile or similar.
(4/5) Build your cluster configuration
Run kops create cluster
to create your cluster configuration:
kops create cluster --zones=us-east-1c useast1.dev.example.com
kops will create the configuration for your cluster. Note that it only creates the configuration, it does
not actually create the cloud resources - you'll do that in the next step with a kops update cluster
. This
give you an opportunity to review the configuration or change it.
It prints commands you can use to explore further:
- List your clusters with:
kops get cluster
- Edit this cluster with:
kops edit cluster useast1.dev.example.com
- Edit your node instance group:
kops edit ig --name=useast1.dev.example.com nodes
- Edit your master instance group:
kops edit ig --name=useast1.dev.example.com master-us-east-1c
If this is your first time using kops, do spend a few minutes to try those out! An instance group is a set of instances, which will be registered as kubernetes nodes. On AWS this is implemented via auto-scaling-groups. You can have several instance groups, for example if you wanted nodes that are a mix of spot and on-demand instances, or GPU and non-GPU instances.
(5/5) Create the cluster in AWS
Run "kops update cluster" to create your cluster in AWS:
kops update cluster useast1.dev.example.com --yes
That takes a few seconds to run, but then your cluster will likely take a few minutes to actually be ready.
kops update cluster
will be the tool you'll use whenever you change the configuration of your cluster; it
applies the changes you have made to the configuration to your cluster - reconfiguring AWS or kubernetes as needed.
For example, after you kops edit ig nodes
, then kops update cluster --yes
to apply your configuration, and
sometimes you will also have to kops rolling-update cluster
to roll out the configuration immediately.
Without --yes
, kops update cluster
will show you a preview of what it is going to do. This is handy
for production clusters!
Explore other add-ons
See the list of add-ons to explore other add-ons, including tools for logging, monitoring, network policy, visualization, and control of your Kubernetes cluster.
Cleanup
- To delete your cluster:
kops delete cluster useast1.dev.example.com --yes
What's next
- Learn more about Kubernetes concepts and
kubectl
. - Learn more about
kops
advanced usage for tutorials, best practices and advanced configuration options. - Follow
kops
community discussions on Slack: community discussions - Contribute to
kops
by addressing or raising an issue GitHub Issues
2.2.2.3 - Installing Kubernetes with Kubespray
This quickstart helps to install a Kubernetes cluster hosted on GCE, Azure, OpenStack, AWS, vSphere, Packet (bare metal), Oracle Cloud Infrastructure (Experimental) or Baremetal with Kubespray.
Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks. Kubespray provides:
- a highly available cluster
- composable attributes
- support for most popular Linux distributions
- Ubuntu 16.04, 18.04, 20.04
- CentOS/RHEL/Oracle Linux 7, 8
- Debian Buster, Jessie, Stretch, Wheezy
- Fedora 31, 32
- Fedora CoreOS
- openSUSE Leap 15
- Flatcar Container Linux by Kinvolk
- continuous integration tests
To choose a tool which best fits your use case, read this comparison to kubeadm and kops.
Creating a cluster
(1/5) Meet the underlay requirements
Provision servers with the following requirements:
- Ansible v2.9 and python-netaddr are installed on the machine that will run Ansible commands
- Jinja 2.11 (or newer) is required to run the Ansible Playbooks
- The target servers must have access to the Internet in order to pull docker images. Otherwise, additional configuration is required (See Offline Environment)
- The target servers are configured to allow IPv4 forwarding
- Your ssh key must be copied to all the servers in your inventory
- Firewalls are not managed by kubespray. You'll need to implement appropriate rules as needed. You should disable your firewall in order to avoid any issues during deployment
- If kubespray is ran from a non-root user account, correct privilege escalation method should be configured in the target servers and the
ansible_become
flag or command parameters--become
or-b
should be specified
Kubespray provides the following utilities to help provision your environment:
(2/5) Compose an inventory file
After you provision your servers, create an inventory file for Ansible. You can do this manually or via a dynamic inventory script. For more information, see "Building your own inventory".
(3/5) Plan your cluster deployment
Kubespray provides the ability to customize many aspects of the deployment:
- Choice deployment mode: kubeadm or non-kubeadm
- CNI (networking) plugins
- DNS configuration
- Choice of control plane: native/binary or containerized
- Component versions
- Calico route reflectors
- Component runtime options
- Certificate generation methods
Kubespray customizations can be made to a variable file. If you are getting started with Kubespray, consider using the Kubespray defaults to deploy your cluster and explore Kubernetes.
(4/5) Deploy a Cluster
Next, deploy your cluster:
Cluster deployment using ansible-playbook.
ansible-playbook -i your/inventory/inventory.ini cluster.yml -b -v \
--private-key=~/.ssh/private_key
Large deployments (100+ nodes) may require specific adjustments for best results.
(5/5) Verify the deployment
Kubespray provides a way to verify inter-pod connectivity and DNS resolve with Netchecker. Netchecker ensures the netchecker-agents pods can resolve DNS requests and ping each over within the default namespace. Those pods mimic similar behavior as the rest of the workloads and serve as cluster health indicators.
Cluster operations
Kubespray provides additional playbooks to manage your cluster: scale and upgrade.
Scale your cluster
You can add worker nodes from your cluster by running the scale playbook. For more information, see "Adding nodes". You can remove worker nodes from your cluster by running the remove-node playbook. For more information, see "Remove nodes".
Upgrade your cluster
You can upgrade your cluster by running the upgrade-cluster playbook. For more information, see "Upgrades".
Cleanup
You can reset your nodes and wipe out all components installed with Kubespray via the reset playbook.
Feedback
- Slack Channel: #kubespray (You can get your invite here)
- GitHub Issues
What's next
Check out planned work on Kubespray's roadmap.
2.2.3 - Turnkey Cloud Solutions
This page provides a list of Kubernetes certified solution providers. From each provider page, you can learn how to install and setup production ready clusters.
2.2.4 - Windows in Kubernetes
2.2.4.1 - Windows containers in Kubernetes
Windows applications constitute a large portion of the services and applications that run in many organizations. Windows containers provide a way to encapsulate processes and package dependencies, making it easier to use DevOps practices and follow cloud native patterns for Windows applications.
Organizations with investments in Windows-based applications and Linux-based applications don't have to look for separate orchestrators to manage their workloads, leading to increased operational efficiencies across their deployments, regardless of operating system.
Windows nodes in Kubernetes
To enable the orchestration of Windows containers in Kubernetes, include Windows nodes in your existing Linux cluster. Scheduling Windows containers in Pods on Kubernetes is similar to scheduling Linux-based containers.
In order to run Windows containers, your Kubernetes cluster must include multiple operating systems. While you can only run the control plane on Linux, you can deploy worker nodes running either Windows or Linux depending on your workload needs.
Windows nodes are supported provided that the operating system is Windows Server 2019.
This document uses the term Windows containers to mean Windows containers with process isolation. Kubernetes does not support running Windows containers with Hyper-V isolation.
Resource management
On Linux nodes, cgroups are used as a pod boundary for resource control. Containers are created within that boundary for network, process and file system isolation. The Linux cgroup APIs can be used to gather CPU, I/O, and memory use statistics.
In contrast, Windows uses a job object per container with a system namespace filter to contain all processes in a container and provide logical isolation from the host. (Job objects are a Windows process isolation mechanism and are different from what Kubernetes refers to as a Job).
There is no way to run a Windows container without the namespace filtering in place. This means that system privileges cannot be asserted in the context of the host, and thus privileged containers are not available on Windows. Containers cannot assume an identity from the host because the Security Account Manager (SAM) is separate.
Memory reservations
Windows does not have an out-of-memory process killer as Linux does. Windows always treats all user-mode memory allocations as virtual, and pagefiles are mandatory (on Linux, the kubelet will by default not start with swap space enabled).
Windows nodes do not overcommit memory for processes running in containers. The net effect is that Windows won't reach out of memory conditions the same way Linux does, and processes page to disk instead of being subject to out of memory (OOM) termination. If memory is over-provisioned and all physical memory is exhausted, then paging can slow down performance.
You can place bounds on memory use for workloads using the kubelet
parameters --kubelet-reserve
and/or --system-reserve
; these account
for memory usage on the node (outside of containers), and reduce
NodeAllocatable.
As you deploy workloads, set resource limits on containers. This also subtracts from
NodeAllocatable
and prevents the scheduler from adding more pods once a node is full.
On Windows, good practice to avoid over-provisioning is to configure the kubelet with a system reserved memory of at least 2GiB to account for Windows, Kubernetes and container runtime overheads.
CPU reservations
To account for CPU use by the operating system, the container runtime, and by Kubernetes host processes such as the kubelet, you can (and should) reserve a percentage of total CPU. You should determine this CPU reservation taking account of to the number of CPU cores available on the node. To decide on the CPU percentage to reserve, identify the maximum pod density for each node and monitor the CPU usage of the system services running there, then choose a value that meets your workload needs.
You can place bounds on CPU usage for workloads using the
kubelet parameters --kubelet-reserve
and/or --system-reserve
to
account for CPU usage on the node (outside of containers).
This reduces NodeAllocatable
.
The cluster-wide scheduler then takes this reservation into account when determining
pod placement.
On Windows, the kubelet supports a command-line flag to set the priority of the
kubelet process: --windows-priorityclass
. This flag allows the kubelet process to get
more CPU time slices when compared to other processes running on the Windows host.
More information on the allowable values and their meaning is available at
Windows Priority Classes.
To ensure that running Pods do not starve the kubelet of CPU cycles, set this flag to ABOVE_NORMAL_PRIORITY_CLASS
or above.
Compatibility and limitations
Some node features are only available if you use a specific container runtime; others are not available on Windows nodes, including:
- HugePages: not supported for Windows containers
- Privileged containers: not supported for Windows containers
- TerminationGracePeriod: requires containerD
Not all features of shared namespaces are supported. See API compatibility for more details.
See Windows OS version compatibility for details on the Windows versions that Kubernetes is tested against.
From an API and kubectl perspective, Windows containers behave in much the same way as Linux-based containers. However, there are some notable differences in key functionality which are outlined in this section.
Comparison with Linux
Key Kubernetes elements work the same way in Windows as they do in Linux. This section refers to several key workload enablers and how they map to Windows.
-
A Pod is the basic building block of Kubernetes–the smallest and simplest unit in the Kubernetes object model that you create or deploy. You may not deploy Windows and Linux containers in the same Pod. All containers in a Pod are scheduled onto a single Node where each Node represents a specific platform and architecture. The following Pod capabilities, properties and events are supported with Windows containers:
- Single or multiple containers per Pod with process isolation and volume sharing
- Pod
status
fields - Readiness and Liveness probes
- postStart & preStop container lifecycle events
- ConfigMap, Secrets: as environment variables or volumes
emptyDir
volumes- Named pipe host mounts
- Resource limits
- OS field:
FEATURE STATE:
Kubernetes v1.23 [alpha]
.spec.os.name
should be set towindows
to indicate that the current Pod uses Windows containers.IdentifyPodOS
feature gate needs to be enabled for this field to be recognized and used by control plane components and kubelet.Note:If the
IdentifyPodOS
feature gate is enabled and you set the.spec.os.name
field towindows
, you must not set the following fields in the.spec
of that Pod: *spec.hostPID
*spec.hostIPC
*spec.securityContext.seLinuxOptions
*spec.securityContext.seccompProfile
*spec.securityContext.fsGroup
*spec.securityContext.fsGroupChangePolicy
*spec.securityContext.sysctls
*spec.shareProcessNamespace
*spec.securityContext.runAsUser
*spec.securityContext.runAsGroup
*spec.securityContext.supplementalGroups
*spec.containers[*].securityContext.seLinuxOptions
*spec.containers[*].securityContext.seccompProfile
*spec.containers[*].securityContext.capabilities
*spec.containers[*].securityContext.readOnlyRootFilesystem
*spec.containers[*].securityContext.privileged
*spec.containers[*].securityContext.allowPrivilegeEscalation
*spec.containers[*].securityContext.procMount
*spec.containers[*].securityContext.runAsUser
*spec.containers[*].securityContext.runAsGroup
Note: In this table, wildcards (*) indicate all elements in a list. For example, spec.containers[*].securityContext refers to the Security Context object for all defined containers. If not, Pod API validation would fail causing admission failures.
-
Workload resources including:
- ReplicaSet
- Deployments
- StatefulSets
- DaemonSet
- Job
- CronJob
- ReplicationController
-
Services See Load balancing and Services for more details.
Pods, workload resources, and Services are critical elements to managing Windows workloads on Kubernetes. However, on their own they are not enough to enable the proper lifecycle management of Windows workloads in a dynamic cloud native environment. Kubernetes also supports:
kubectl exec
- Pod and container metrics
- Horizontal pod autoscaling
- Resource quotas
- Scheduler preemption
Networking on Windows nodes
Networking for Windows containers is exposed through CNI plugins. Windows containers function similarly to virtual machines in regards to networking. Each container has a virtual network adapter (vNIC) which is connected to a Hyper-V virtual switch (vSwitch). The Host Networking Service (HNS) and the Host Compute Service (HCS) work together to create containers and attach container vNICs to networks. HCS is responsible for the management of containers whereas HNS is responsible for the management of networking resources such as:
- Virtual networks (including creation of vSwitches)
- Endpoints / vNICs
- Namespaces
- Policies including packet encapsulations, load-balancing rules, ACLs, and NAT rules.
Container networking
The Windows HNS and vSwitch implement namespacing and can
create virtual NICs as needed for a pod or container. However, many configurations such
as DNS, routes, and metrics are stored in the Windows registry database rather than as
files inside /etc
, which is how Linux stores those configurations. The Windows registry for the container
is separate from that of the host, so concepts like mapping /etc/resolv.conf
from
the host into a container don't have the same effect they would on Linux. These must
be configured using Windows APIs run in the context of that container. Therefore
CNI implementations need to call the HNS instead of relying on file mappings to pass
network details into the pod or container.
The following networking functionality is not supported on Windows nodes:
- Host networking mode
- Local NodePort access from the node itself (works for other nodes or external clients)
- More than 64 backend pods (or unique destination addresses) for a single Service
- IPv6 communication between Windows pods connected to overlay networks
- Local Traffic Policy in non-DSR mode
- Outbound communication using the ICMP protocol via the
win-overlay
,win-bridge
, or using the Azure-CNI plugin.
Specifically, the Windows data plane (VFP) doesn't support ICMP packet transpositions, and this means:- ICMP packets directed to destinations within the same network (such as pod to pod communication via ping) work as expected and without any limitations;
- TCP/UDP packets work as expected and without any limitations;
- ICMP packets directed to pass through a remote network (e.g. pod to external internet communication via ping) cannot be transposed and thus will not be routed back to their source;
- Since TCP/UDP packets can still be transposed, you can substitute
ping <destination>
withcurl <destination>
to get some debugging insight into connectivity with the outside world.
Overlay networking support in kube-proxy is a beta feature. In addition, it requires KB4482887 to be installed on Windows Server 2019.
Network modes
Windows supports five different networking drivers/modes: L2bridge, L2tunnel, Overlay (beta), Transparent, and NAT. In a heterogeneous cluster with Windows and Linux worker nodes, you need to select a networking solution that is compatible on both Windows and Linux. The following out-of-tree plugins are supported on Windows, with recommendations on when to use each CNI:
Network Driver | Description | Container Packet Modifications | Network Plugins | Network Plugin Characteristics |
---|---|---|---|---|
L2bridge | Containers are attached to an external vSwitch. Containers are attached to the underlay network, although the physical network doesn't need to learn the container MACs because they are rewritten on ingress/egress. | MAC is rewritten to host MAC, IP may be rewritten to host IP using HNS OutboundNAT policy. | win-bridge, Azure-CNI, Flannel host-gateway uses win-bridge | win-bridge uses L2bridge network mode, connects containers to the underlay of hosts, offering best performance. Requires user-defined routes (UDR) for inter-node connectivity. |
L2Tunnel | This is a special case of l2bridge, but only used on Azure. All packets are sent to the virtualization host where SDN policy is applied. | MAC rewritten, IP visible on the underlay network | Azure-CNI | Azure-CNI allows integration of containers with Azure vNET, and allows them to leverage the set of capabilities that Azure Virtual Network provides. For example, securely connect to Azure services or use Azure NSGs. See azure-cni for some examples |
Overlay (Overlay networking for Windows in Kubernetes is in alpha stage) | Containers are given a vNIC connected to an external vSwitch. Each overlay network gets its own IP subnet, defined by a custom IP prefix.The overlay network driver uses VXLAN encapsulation. | Encapsulated with an outer header. | win-overlay, Flannel VXLAN (uses win-overlay) | win-overlay should be used when virtual container networks are desired to be isolated from underlay of hosts (e.g. for security reasons). Allows for IPs to be re-used for different overlay networks (which have different VNID tags) if you are restricted on IPs in your datacenter. This option requires KB4489899 on Windows Server 2019. |
Transparent (special use case for ovn-kubernetes) | Requires an external vSwitch. Containers are attached to an external vSwitch which enables intra-pod communication via logical networks (logical switches and routers). | Packet is encapsulated either via GENEVE or STT tunneling to reach pods which are not on the same host. Packets are forwarded or dropped via the tunnel metadata information supplied by the ovn network controller. NAT is done for north-south communication. |
ovn-kubernetes | Deploy via ansible. Distributed ACLs can be applied via Kubernetes policies. IPAM support. Load-balancing can be achieved without kube-proxy. NATing is done without using iptables/netsh. |
NAT (not used in Kubernetes) | Containers are given a vNIC connected to an internal vSwitch. DNS/DHCP is provided using an internal component called WinNAT | MAC and IP is rewritten to host MAC/IP. | nat | Included here for completeness |
As outlined above, the Flannel CNI meta plugin is also supported on Windows via the VXLAN network backend (alpha support ; delegates to win-overlay) and host-gateway network backend (stable support; delegates to win-bridge).
This plugin supports delegating to one of the reference CNI plugins (win-overlay,
win-bridge), to work in conjunction with Flannel daemon on Windows (Flanneld) for
automatic node subnet lease assignment and HNS network creation. This plugin reads
in its own configuration file (cni.conf), and aggregates it with the environment
variables from the FlannelD generated subnet.env file. It then delegates to one of
the reference CNI plugins for network plumbing, and sends the correct configuration
containing the node-assigned subnet to the IPAM plugin (for example: host-local
).
For Node, Pod, and Service objects, the following network flows are supported for TCP/UDP traffic:
- Pod → Pod (IP)
- Pod → Pod (Name)
- Pod → Service (Cluster IP)
- Pod → Service (PQDN, but only if there are no ".")
- Pod → Service (FQDN)
- Pod → external (IP)
- Pod → external (DNS)
- Node → Pod
- Pod → Node
CNI plugin limitations
- Windows reference network plugins win-bridge and win-overlay do not implement
CNI spec v0.4.0,
due to a missing
CHECK
implementation. - The Flannel VXLAN CNI plugin has the following limitations on Windows:
- Node-pod connectivity isn't possible by design. It's only possible for local pods with Flannel v0.12.0 (or higher).
- Flannel is restricted to using VNI 4096 and UDP port 4789. See the official Flannel VXLAN backend docs for more details on these parameters.
IP address management (IPAM)
The following IPAM options are supported on Windows:
- host-local
- HNS IPAM (Inbox platform IPAM, this is a fallback when no IPAM is set)
- azure-vnet-ipam (for azure-cni only)
Load balancing and Services
A Kubernetes Service is an abstraction that defines a logical set of Pods and a means to access them over a network. In a cluster that includes Windows nodes, you can use the following types of Service:
NodePort
ClusterIP
LoadBalancer
ExternalName
There are known issue with NodePort services on overlay networking, if the target destination node is running Windows Server 2022.
To avoid the issue entirely, you can configure the service with externalTrafficPolicy: Local
.
There are known issues with pod to pod connectivity on l2bridge network on Windows Server 2022 with KB5005619 or higher installed. To workaround the issue and restore pod-pod connectivity, you can disable the WinDSR feature in kube-proxy.
These issues require OS fixes. Please follow https://github.com/microsoft/Windows-Containers/issues/204 for updates.
Windows container networking differs in some important ways from Linux networking. The Microsoft documentation for Windows Container Networking provides additional details and background.
On Windows, you can use the following settings to configure Services and load balancing behavior:
Feature | Description | Supported Kubernetes version | Supported Windows OS build | How to enable |
---|---|---|---|---|
Session affinity | Ensures that connections from a particular client are passed to the same Pod each time. | v1.20+ | Windows Server vNext Insider Preview Build 19551 (or higher) | Set service.spec.sessionAffinity to "ClientIP" |
Direct Server Return (DSR) | Load balancing mode where the IP address fixups and the LBNAT occurs at the container vSwitch port directly; service traffic arrives with the source IP set as the originating pod IP. | v1.20+ | Windows Server 2019 | Set the following flags in kube-proxy: --feature-gates="WinDSR=true" --enable-dsr=true |
Preserve-Destination | Skips DNAT of service traffic, thereby preserving the virtual IP of the target service in packets reaching the backend Pod. Also disables node-node forwarding. | v1.20+ | Windows Server, version 1903 (or higher) | Set "preserve-destination": "true" in service annotations and enable DSR in kube-proxy. |
IPv4/IPv6 dual-stack networking | Native IPv4-to-IPv4 in parallel with IPv6-to-IPv6 communications to, from, and within a cluster | v1.19+ | Windows Server, version 2019 | See IPv4/IPv6 dual-stack |
Client IP preservation | Ensures that source IP of incoming ingress traffic gets preserved. Also disables node-node forwarding. | v1.20+ | Windows Server, version 2019 | Set service.spec.externalTrafficPolicy to "Local" and enable DSR in kube-proxy |
Session affinity
Setting the maximum session sticky time for Windows services using
service.spec.sessionAffinityConfig.clientIP.timeoutSeconds
is not supported.
DNS
- ClusterFirstWithHostNet is not supported for DNS. Windows treats all names with a
.
as a FQDN and skips FQDN resolution - On Linux, you have a DNS suffix list, which is used when trying to resolve PQDNs. On Windows, you can only have 1 DNS suffix, which is the DNS suffix associated with that pod's namespace (mydns.svc.cluster.local for example). Windows can resolve FQDNs and services or names resolvable with just that suffix. For example, a pod spawned in the default namespace, will have the DNS suffix default.svc.cluster.local. Inside a Windows pod, you can resolve both kubernetes.default.svc.cluster.local and kubernetes, but not the in-betweens, like kubernetes.default or kubernetes.default.svc.
- On Windows, there are multiple DNS resolvers that can be used. As these come with
slightly different behaviors, using the
Resolve-DNSName
utility for name query resolutions is recommended.
IPv6 networking
Kubernetes on Windows does not support single-stack "IPv6-only" networking. However, dual-stack IPv4/IPv6 networking for pods and nodes with single-family services is supported.
You can use IPv4/IPv6 dual-stack networking with l2bridge
networks. See configure IPv4/IPv6 dual stack for more details.
Persistent storage
Windows has a layered filesystem driver to mount container layers and create a copy filesystem based on NTFS. All file paths in the container are resolved only within the context of that container.
- With Docker, volume mounts can only target a directory in the container, and not an individual file. This limitation does not exist with CRI-containerD runtime.
- Volume mounts cannot project files or directories back to the host filesystem.
- Read-only filesystems are not supported because write access is always required for the Windows registry and SAM database. However, read-only volumes are supported.
- Volume user-masks and permissions are not available. Because the SAM is not shared between the host & container, there's no mapping between them. All permissions are resolved within the context of the container.
As a result, the following storage functionality is not supported on Windows nodes:
- Volume subpath mounts: only the entire volume can be mounted in a Windows container
- Subpath volume mounting for Secrets
- Host mount projection
- Read-only root filesystem (mapped volumes still support
readOnly
) - Block device mapping
- Memory as the storage medium (for example,
emptyDir.medium
set toMemory
) - File system features like uid/gid; per-user Linux filesystem permissions
- DefaultMode (due to UID/GID dependency)
- NFS based storage/volume support
- Expanding the mounted volume (resizefs)
Kubernetes volumes enable complex applications, with data persistence and Pod volume sharing requirements, to be deployed on Kubernetes. Management of persistent volumes associated with a specific storage back-end or protocol includes actions such as provisioning/de-provisioning/resizing of volumes, attaching/detaching a volume to/from a Kubernetes node and mounting/dismounting a volume to/from individual containers in a pod that needs to persist data.
The code implementing these volume management actions for a specific storage back-end or protocol is shipped in the form of a Kubernetes volume plugin. The following broad classes of Kubernetes volume plugins are supported on Windows:
In-tree volume plugins
Code associated with in-tree volume plugins ship as part of the core Kubernetes code base. Deployment of in-tree volume plugins do not require installation of additional scripts or deployment of separate containerized plugin components. These plugins can handle provisioning/de-provisioning and resizing of volumes in the storage backend, attaching/detaching of volumes to/from a Kubernetes node and mounting/dismounting a volume to/from individual containers in a pod. The following in-tree plugins support persistent storage on Windows nodes:
FlexVolume plugins
Code associated with FlexVolume plugins ship as out-of-tree scripts or binaries that need to be deployed directly on the host. FlexVolume plugins handle attaching/detaching of volumes to/from a Kubernetes node and mounting/dismounting a volume to/from individual containers in a pod. Provisioning/De-provisioning of persistent volumes associated with FlexVolume plugins may be handled through an external provisioner that is typically separate from the FlexVolume plugins. The following FlexVolume plugins, deployed as PowerShell scripts on the host, support Windows nodes:
CSI plugins
Kubernetes v1.19 [beta]
Code associated with CSI plugins ship as out-of-tree scripts and binaries that are typically distributed as container images and deployed using standard Kubernetes constructs like DaemonSets and StatefulSets. CSI plugins handle a wide range of volume management actions in Kubernetes: provisioning/de-provisioning/resizing of volumes, attaching/detaching of volumes to/from a Kubernetes node and mounting/dismounting a volume to/from individual containers in a pod, backup/restore of persistent data using snapshots and cloning. CSI plugins typically consist of node plugins (that run on each node as a DaemonSet) and controller plugins.
CSI node plugins (especially those associated with persistent volumes exposed as either block devices or over a shared file-system) need to perform various privileged operations like scanning of disk devices, mounting of file systems, etc. These operations differ for each host operating system. For Linux worker nodes, containerized CSI node plugins are typically deployed as privileged containers. For Windows worker nodes, privileged operations for containerized CSI node plugins is supported using csi-proxy, a community-managed, stand-alone binary that needs to be pre-installed on each Windows node.
For more details, refer to the deployment guide of the CSI plugin you wish to deploy.
Command line options for the kubelet
The behavior of some kubelet command line options behave differently on Windows, as described below:
- The
--windows-priorityclass
lets you set the scheduling priority of the kubelet process (see CPU resource management) - The
--kubelet-reserve
,--system-reserve
, and--eviction-hard
flags update NodeAllocatable - Eviction by using
--enforce-node-allocable
is not implemented - Eviction by using
--eviction-hard
and--eviction-soft
are not implemented - A kubelet running on a Windows node does not have memory
restrictions.
--kubelet-reserve
and--system-reserve
do not set limits on kubelet or processes running on the host. This means kubelet or a process on the host could cause memory resource starvation outside the node-allocatable and scheduler. - The
MemoryPressure
Condition is not implemented - The kubelet does not take OOM eviction actions
API compatibility
There are no differences in how most of the Kubernetes APIs work for Windows. The subtleties around what's different come down to differences in the OS and container runtime. In certain situations, some properties on workload resources were designed under the assumption that they would be implemented on Linux, and fail to run on Windows.
At a high level, these OS concepts are different:
- Identity - Linux uses userID (UID) and groupID (GID) which
are represented as integer types. User and group names
are not canonical - they are just an alias in
/etc/groups
or/etc/passwd
back to UID+GID. Windows uses a larger binary security identifier (SID) which is stored in the Windows Security Access Manager (SAM) database. This database is not shared between the host and containers, or between containers. - File permissions - Windows uses an access control list based on (SIDs), whereas POSIX systems such as Linux use a bitmask based on object permissions and UID+GID, plus optional access control lists.
- File paths - the convention on Windows is to use
\
instead of/
. The Go IO libraries typically accept both and just make it work, but when you're setting a path or command line that's interpreted inside a container,\
may be needed. - Signals - Windows interactive apps handle termination differently, and can
implement one or more of these:
- A UI thread handles well-defined messages including
WM_CLOSE
. - Console apps handle Ctrl-C or Ctrl-break using a Control Handler.
- Services register a Service Control Handler function that can accept
SERVICE_CONTROL_STOP
control codes.
- A UI thread handles well-defined messages including
Container exit codes follow the same convention where 0 is success, and nonzero is failure. The specific error codes may differ across Windows and Linux. However, exit codes passed from the Kubernetes components (kubelet, kube-proxy) are unchanged.
Field compatibility for container specifications
The following list documents differences between how Pod container specifications work between Windows and Linux:
- Huge pages are not implemented in the Windows container runtime, and are not available. They require asserting a user privilege that's not configurable for containers.
requests.cpu
andrequests.memory
- requests are subtracted from node available resources, so they can be used to avoid overprovisioning a node. However, they cannot be used to guarantee resources in an overprovisioned node. They should be applied to all containers as a best practice if the operator wants to avoid overprovisioning entirely.securityContext.allowPrivilegeEscalation
- not possible on Windows; none of the capabilities are hooked upsecurityContext.capabilities
- POSIX capabilities are not implemented on WindowssecurityContext.privileged
- Windows doesn't support privileged containerssecurityContext.procMount
- Windows doesn't have a/proc
filesystemsecurityContext.readOnlyRootFilesystem
- not possible on Windows; write access is required for registry & system processes to run inside the containersecurityContext.runAsGroup
- not possible on Windows as there is no GID supportsecurityContext.runAsNonRoot
- this setting will prevent containers from running asContainerAdministrator
which is the closest equivalent to a root user on Windows.securityContext.runAsUser
- userunAsUserName
insteadsecurityContext.seLinuxOptions
- not possible on Windows as SELinux is Linux-specificterminationMessagePath
- this has some limitations in that Windows doesn't support mapping single files. The default value is/dev/termination-log
, which does work because it does not exist on Windows by default.
Field compatibility for Pod specifications
The following list documents differences between how Pod specifications work between Windows and Linux:
hostIPC
andhostpid
- host namespace sharing is not possible on WindowshostNetwork
- There is no Windows OS support to share the host networkdnsPolicy
- setting the PoddnsPolicy
toClusterFirstWithHostNet
is not supported on Windows because host networking is not provided. Pods always run with a container network.podSecurityContext
(see below)shareProcessNamespace
- this is a beta feature, and depends on Linux namespaces which are not implemented on Windows. Windows cannot share process namespaces or the container's root filesystem. Only the network can be shared.terminationGracePeriodSeconds
- this is not fully implemented in Docker on Windows, see the GitHub issue. The behavior today is that the ENTRYPOINT process is sent CTRL_SHUTDOWN_EVENT, then Windows waits 5 seconds by default, and finally shuts down all processes using the normal Windows shutdown behavior. The 5 second default is actually in the Windows registry inside the container, so it can be overridden when the container is built.volumeDevices
- this is a beta feature, and is not implemented on Windows. Windows cannot attach raw block devices to pods.volumes
- If you define an
emptyDir
volume, you cannot set its volume source tomemory
.
- If you define an
- You cannot enable
mountPropagation
for volume mounts as this is not supported on Windows.
Field compatibility for Pod security context
None of the Pod securityContext
fields work on Windows.
Node problem detector
The node problem detector (see Monitor Node Health) is not compatible with Windows.
Pause container
In a Kubernetes Pod, an infrastructure or “pause” container is first created to host the container. In Linux, the cgroups and namespaces that make up a pod need a process to maintain their continued existence; the pause process provides this. Containers that belong to the same pod, including infrastructure and worker containers, share a common network endpoint (same IPv4 and / or IPv6 address, same network port spaces). Kubernetes uses pause containers to allow for worker containers crashing or restarting without losing any of the networking configuration.
Kubernetes maintains a multi-architecture image that includes support for Windows.
For Kubernetes v1.23 the recommended pause image is k8s.gcr.io/pause:3.6
.
The source code
is available on GitHub.
Microsoft maintains a different multi-architecture image, with Linux and Windows
amd64 support, that you can find as mcr.microsoft.com/oss/kubernetes/pause:3.6
.
This image is built from the same source as the Kubernetes maintained image but
all of the Windows binaries are authenticode signed by Microsoft.
The Kubernetes project recommends using the Microsoft maintained image if you are
deploying to a production or production-like environment that requires signed
binaries.
Container runtimes
You need to install a container runtime into each node in the cluster so that Pods can run there.
The following container runtimes work with Windows:
cri-containerd
Kubernetes v1.20 [stable]
You can use ContainerD 1.4.0+ as the container runtime for Kubernetes nodes that run Windows.
Learn how to install ContainerD on a Windows node.
Mirantis Container Runtime
Mirantis Container Runtime (MCR) is available as a container runtime for all Windows Server 2019 and later versions.
See Install MCR on Windows Servers for more information.
Windows OS version compatibility
On Windows nodes, strict compatibility rules apply where the host OS version must match the container base image OS version. Only Windows containers with a container operating system of Windows Server 2019 are fully supported.
For Kubernetes v1.23, operating system compatibility for Windows nodes (and Pods) is as follows:
- Windows Server LTSC release
- Windows Server 2019
- Windows Server 2022
- Windows Server SAC release
- Windows Server version 20H2
The Kubernetes version-skew policy also applies.
Security for Windows nodes
On Windows, data from Secrets are written out in clear text onto the node's local storage (as compared to using tmpfs / in-memory filesystems on Linux). As a cluster operator, you should take both of the following additional measures:
- Use file ACLs to secure the Secrets' file location.
- Apply volume-level encryption using BitLocker.
RunAsUsername can be specified for Windows Pods or containers to execute the container processes as a node-default user. This is roughly equivalent to RunAsUser.
Linux-specific pod security context privileges such as SELinux, AppArmor, Seccomp, or capabilities (POSIX capabilities), and others are not supported.
Privileged containers are not supported on Windows.
Getting help and troubleshooting
Your main source of help for troubleshooting your Kubernetes cluster should start with the Troubleshooting page.
Some additional, Windows-specific troubleshooting help is included in this section. Logs are an important element of troubleshooting issues in Kubernetes. Make sure to include them any time you seek troubleshooting assistance from other contributors. Follow the instructions in the SIG Windows contributing guide on gathering logs.
Node-level troubleshooting
-
How do I know
start.ps1
completed successfully?You should see kubelet, kube-proxy, and (if you chose Flannel as your networking solution) flanneld host-agent processes running on your node, with running logs being displayed in separate PowerShell windows. In addition to this, your Windows node should be listed as "Ready" in your Kubernetes cluster.
-
Can I configure the Kubernetes node processes to run in the background as services?
The kubelet and kube-proxy are already configured to run as native Windows Services, offering resiliency by re-starting the services automatically in the event of failure (for example a process crash). You have two options for configuring these node components as services.
-
As native Windows Services
You can run the kubelet and kube-proxy as native Windows Services using
sc.exe
.# Create the services for kubelet and kube-proxy in two separate commands sc.exe create <component_name> binPath= "<path_to_binary> --service <other_args>" # Please note that if the arguments contain spaces, they must be escaped. sc.exe create kubelet binPath= "C:\kubelet.exe --service --hostname-override 'minion' <other_args>" # Start the services Start-Service kubelet Start-Service kube-proxy # Stop the service Stop-Service kubelet (-Force) Stop-Service kube-proxy (-Force) # Query the service status Get-Service kubelet Get-Service kube-proxy
-
Using
nssm.exe
You can also always use alternative service managers like nssm.exe to run these processes (flanneld, kubelet & kube-proxy) in the background for you. You can use this sample script, leveraging nssm.exe to register kubelet, kube-proxy, and flanneld.exe to run as Windows services in the background.
register-svc.ps1 -NetworkMode <Network mode> -ManagementIP <Windows Node IP> -ClusterCIDR <Cluster subnet> -KubeDnsServiceIP <Kube-dns Service IP> -LogDir <Directory to place logs> # NetworkMode = The network mode l2bridge (flannel host-gw, also the default value) or overlay (flannel vxlan) chosen as a network solution # ManagementIP = The IP address assigned to the Windows node. You can use ipconfig to find this # ClusterCIDR = The cluster subnet range. (Default value 10.244.0.0/16) # KubeDnsServiceIP = The Kubernetes DNS service IP (Default value 10.96.0.10) # LogDir = The directory where kubelet and kube-proxy logs are redirected into their respective output files (Default value C:\k)
If the above referenced script is not suitable, you can manually configure
nssm.exe
using the following examples.# Register flanneld.exe nssm install flanneld C:\flannel\flanneld.exe nssm set flanneld AppParameters --kubeconfig-file=c:\k\config --iface=<ManagementIP> --ip-masq=1 --kube-subnet-mgr=1 nssm set flanneld AppEnvironmentExtra NODE_NAME=<hostname> nssm set flanneld AppDirectory C:\flannel nssm start flanneld # Register kubelet.exe # Microsoft releases the pause infrastructure container at mcr.microsoft.com/oss/kubernetes/pause:3.6 nssm install kubelet C:\k\kubelet.exe nssm set kubelet AppParameters --hostname-override=<hostname> --v=6 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.6 --resolv-conf="" --allow-privileged=true --enable-debugging-handlers --cluster-dns=<DNS-service-IP> --cluster-domain=cluster.local --kubeconfig=c:\k\config --hairpin-mode=promiscuous-bridge --image-pull-progress-deadline=20m --cgroups-per-qos=false --log-dir=<log directory> --logtostderr=false --enforce-node-allocatable="" --network-plugin=cni --cni-bin-dir=c:\k\cni --cni-conf-dir=c:\k\cni\config nssm set kubelet AppDirectory C:\k nssm start kubelet # Register kube-proxy.exe (l2bridge / host-gw) nssm install kube-proxy C:\k\kube-proxy.exe nssm set kube-proxy AppDirectory c:\k nssm set kube-proxy AppParameters --v=4 --proxy-mode=kernelspace --hostname-override=<hostname>--kubeconfig=c:\k\config --enable-dsr=false --log-dir=<log directory> --logtostderr=false nssm.exe set kube-proxy AppEnvironmentExtra KUBE_NETWORK=cbr0 nssm set kube-proxy DependOnService kubelet nssm start kube-proxy # Register kube-proxy.exe (overlay / vxlan) nssm install kube-proxy C:\k\kube-proxy.exe nssm set kube-proxy AppDirectory c:\k nssm set kube-proxy AppParameters --v=4 --proxy-mode=kernelspace --feature-gates="WinOverlay=true" --hostname-override=<hostname> --kubeconfig=c:\k\config --network-name=vxlan0 --source-vip=<source-vip> --enable-dsr=false --log-dir=<log directory> --logtostderr=false nssm set kube-proxy DependOnService kubelet nssm start kube-proxy
For initial troubleshooting, you can use the following flags in nssm.exe to redirect stdout and stderr to a output file:
nssm set <Service Name> AppStdout C:\k\mysvc.log nssm set <Service Name> AppStderr C:\k\mysvc.log
For additional details, see NSSM - the Non-Sucking Service Manager.
-
-
My Pods are stuck at "Container Creating" or restarting over and over
Check that your pause image is compatible with your OS version. The instructions assume that both the OS and the containers are version 1803. If you have a later version of Windows, such as an Insider build, you need to adjust the images accordingly. See Pause container for more details.
Network troubleshooting
-
My Windows Pods do not have network connectivity
If you are using virtual machines, ensure that MAC spoofing is enabled on all the VM network adapter(s).
-
My Windows Pods cannot ping external resources
Windows Pods do not have outbound rules programmed for the ICMP protocol. However, TCP/UDP is supported. When trying to demonstrate connectivity to resources outside of the cluster, substitute
ping <IP>
with correspondingcurl <IP>
commands.If you are still facing problems, most likely your network configuration in cni.conf deserves some extra attention. You can always edit this static file. The configuration update will apply to any new Kubernetes resources.
One of the Kubernetes networking requirements (see Kubernetes model) is for cluster communication to occur without NAT internally. To honor this requirement, there is an ExceptionList for all the communication where you do not want outbound NAT to occur. However, this also means that you need to exclude the external IP you are trying to query from the
ExceptionList
. Only then will the traffic originating from your Windows pods be SNAT'ed correctly to receive a response from the outside world. In this regard, yourExceptionList
incni.conf
should look as follows:"ExceptionList": [ "10.244.0.0/16", # Cluster subnet "10.96.0.0/12", # Service subnet "10.127.130.0/24" # Management (host) subnet ]
-
My Windows node cannot access
NodePort
type ServicesLocal NodePort access from the node itself fails. This is a known limitation. NodePort access works from other nodes or external clients.
-
vNICs and HNS endpoints of containers are being deleted
This issue can be caused when the
hostname-override
parameter is not passed to kube-proxy. To resolve it, users need to pass the hostname to kube-proxy as follows:C:\k\kube-proxy.exe --hostname-override=$(hostname)
-
With flannel, my nodes are having issues after rejoining a cluster
Whenever a previously deleted node is being re-joined to the cluster, flannelD tries to assign a new pod subnet to the node. Users should remove the old pod subnet configuration files in the following paths:
Remove-Item C:\k\SourceVip.json Remove-Item C:\k\SourceVipRequest.json
-
After launching
start.ps1
, flanneld is stuck in "Waiting for the Network to be created"There are numerous reports of this issue; most likely it is a timing issue for when the management IP of the flannel network is set. A workaround is to relaunch
start.ps1
or relaunch it manually as follows:[Environment]::SetEnvironmentVariable("NODE_NAME", "<Windows_Worker_Hostname>") C:\flannel\flanneld.exe --kubeconfig-file=c:\k\config --iface=<Windows_Worker_Node_IP> --ip-masq=1 --kube-subnet-mgr=1
-
My Windows Pods cannot launch because of missing
/run/flannel/subnet.env
This indicates that Flannel didn't launch correctly. You can either try to restart
flanneld.exe
or you can copy the files over manually from/run/flannel/subnet.env
on the Kubernetes master toC:\run\flannel\subnet.env
on the Windows worker node and modify theFLANNEL_SUBNET
row to a different number. For example, if node subnet 10.244.4.1/24 is desired:FLANNEL_NETWORK=10.244.0.0/16 FLANNEL_SUBNET=10.244.4.1/24 FLANNEL_MTU=1500 FLANNEL_IPMASQ=true
-
My Windows node cannot access my services using the service IP
This is a known limitation of the networking stack on Windows. However, Windows Pods can access the Service IP.
-
No network adapter is found when starting the kubelet
The Windows networking stack needs a virtual adapter for Kubernetes networking to work. If the following commands return no results (in an admin shell), virtual network creation — a necessary prerequisite for the kubelet to work — has failed:
Get-HnsNetwork | ? Name -ieq "cbr0" Get-NetAdapter | ? Name -Like "vEthernet (Ethernet*"
Often it is worthwhile to modify the InterfaceName parameter of the start.ps1 script, in cases where the host's network adapter isn't "Ethernet". Otherwise, consult the output of the
start-kubelet.ps1
script to see if there are errors during virtual network creation. -
DNS resolution is not properly working
Check the DNS limitations for Windows in this section.
-
kubectl port-forward
fails with "unable to do port forwarding: wincat not found"This was implemented in Kubernetes 1.15 by including
wincat.exe
in the pause infrastructure containermcr.microsoft.com/oss/kubernetes/pause:3.6
. Be sure to use a supported version of Kubernetes. If you would like to build your own pause infrastructure container be sure to include wincat. -
My Kubernetes installation is failing because my Windows Server node is behind a proxy
If you are behind a proxy, the following PowerShell environment variables must be defined:
[Environment]::SetEnvironmentVariable("HTTP_PROXY", "http://proxy.example.com:80/", [EnvironmentVariableTarget]::Machine) [Environment]::SetEnvironmentVariable("HTTPS_PROXY", "http://proxy.example.com:443/", [EnvironmentVariableTarget]::Machine)
Further investigation
If these steps don't resolve your problem, you can get help running Windows containers on Windows nodes in Kubernetes through:
- StackOverflow Windows Server Container topic
- Kubernetes Official Forum discuss.kubernetes.io
- Kubernetes Slack #SIG-Windows Channel
Reporting issues and feature requests
If you have what looks like a bug, or you would like to make a feature request, please use the GitHub issue tracking system. You can open issues on GitHub and assign them to SIG-Windows. You should first search the list of issues in case it was reported previously and comment with your experience on the issue and add additional logs. SIG-Windows Slack is also a great avenue to get some initial support and troubleshooting ideas prior to creating a ticket.
If filing a bug, please include detailed information about how to reproduce the problem, such as:
- Kubernetes version: output from
kubectl version
- Environment details: Cloud provider, OS distro, networking choice and configuration, and Docker version
- Detailed steps to reproduce the problem
- Relevant logs
It helps if you tag the issue as sig/windows, by commenting on the issue with /sig windows
. This helps to bring
the issue to a SIG Windows member's attention
What's next
Deployment tools
The kubeadm tool helps you to deploy a Kubernetes cluster, providing the control plane to manage the cluster it, and nodes to run your workloads. Adding Windows nodes explains how to deploy Windows nodes to your cluster using kubeadm.
The Kubernetes cluster API project also provides means to automate deployment of Windows nodes.
Windows distribution channels
For a detailed explanation of Windows distribution channels see the Microsoft documentation.
Information on the different Windows Server servicing channels including their support models can be found at Windows Server servicing channels.
2.2.4.2 - Guide for scheduling Windows containers in Kubernetes
Windows applications constitute a large portion of the services and applications that run in many organizations. This guide walks you through the steps to configure and deploy a Windows container in Kubernetes.
Objectives
- Configure an example deployment to run Windows containers on the Windows node
- (Optional) Configure an Active Directory Identity for your Pod using Group Managed Service Accounts (GMSA)
Before you begin
- Create a Kubernetes cluster that includes a control plane and a worker node running Windows Server
- It is important to note that creating and deploying services and workloads on Kubernetes behaves in much the same way for Linux and Windows containers. Kubectl commands to interface with the cluster are identical. The example in the section below is provided to jumpstart your experience with Windows containers.
Getting Started: Deploying a Windows container
To deploy a Windows container on Kubernetes, you must first create an example application.
The example YAML file below creates a simple webserver application.
Create a service spec named win-webserver.yaml
with the contents below:
apiVersion: v1
kind: Service
metadata:
name: win-webserver
labels:
app: win-webserver
spec:
ports:
# the port that this service should serve on
- port: 80
targetPort: 80
selector:
app: win-webserver
type: NodePort
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: win-webserver
name: win-webserver
spec:
replicas: 2
selector:
matchLabels:
app: win-webserver
template:
metadata:
labels:
app: win-webserver
name: win-webserver
spec:
containers:
- name: windowswebserver
image: mcr.microsoft.com/windows/servercore:ltsc2019
command:
- powershell.exe
- -command
- "<#code used from https://gist.github.com/19WAS85/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ; ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus) } ; "
nodeSelector:
kubernetes.io/os: windows
-
Check that all nodes are healthy:
kubectl get nodes
-
Deploy the service and watch for pod updates:
kubectl apply -f win-webserver.yaml kubectl get pods -o wide -w
When the service is deployed correctly both Pods are marked as Ready. To exit the watch command, press Ctrl+C.
-
Check that the deployment succeeded. To verify:
- Two containers per pod on the Windows node, use
docker ps
- Two pods listed from the Linux control plane node, use
kubectl get pods
- Node-to-pod communication across the network,
curl
port 80 of your pod IPs from the Linux control plane node to check for a web server response - Pod-to-pod communication, ping between pods (and across hosts, if you have more than one Windows node) using docker exec or kubectl exec
- Service-to-pod communication,
curl
the virtual service IP (seen underkubectl get services
) from the Linux control plane node and from individual pods - Service discovery,
curl
the service name with the Kubernetes default DNS suffix - Inbound connectivity,
curl
the NodePort from the Linux control plane node or machines outside of the cluster - Outbound connectivity,
curl
external IPs from inside the pod using kubectl exec
- Two containers per pod on the Windows node, use
Observability
Capturing logs from workloads
Logs are an important element of observability; they enable users to gain insights
into the operational aspect of workloads and are a key ingredient to troubleshooting issues.
Because Windows containers and workloads inside Windows containers behave differently from Linux containers,
users had a hard time collecting logs, limiting operational visibility.
Windows workloads for example are usually configured to log to ETW (Event Tracing for Windows)
or push entries to the application event log.
LogMonitor, an open source tool by Microsoft,
is the recommended way to monitor configured log sources inside a Windows container.
LogMonitor supports monitoring event logs, ETW providers, and custom application logs,
piping them to STDOUT for consumption by kubectl logs <pod>
.
Follow the instructions in the LogMonitor GitHub page to copy its binaries and configuration files to all your containers and add the necessary entrypoints for LogMonitor to push your logs to STDOUT.
Using configurable Container usernames
Starting with Kubernetes v1.16, Windows containers can be configured to run their entrypoints and processes with different usernames than the image defaults. The way this is achieved is a bit different from the way it is done for Linux containers. Learn more about it here.
Managing Workload Identity with Group Managed Service Accounts
Starting with Kubernetes v1.14, Windows container workloads can be configured to use Group Managed Service Accounts (GMSA). Group Managed Service Accounts are a specific type of Active Directory account that provides automatic password management, simplified service principal name (SPN) management, and the ability to delegate the management to other administrators across multiple servers. Containers configured with a GMSA can access external Active Directory Domain resources while carrying the identity configured with the GMSA. Learn more about configuring and using GMSA for Windows containers here.
Taints and Tolerations
Users today need to use some combination of taints and node selectors in order to keep Linux and Windows workloads on their respective OS-specific nodes. This likely imposes a burden only on Windows users. The recommended approach is outlined below, with one of its main goals being that this approach should not break compatibility for existing Linux workloads.
If the IdentifyPodOS
feature gate is
enabled, you can (and should) set .spec.os.name
for a Pod to indicate the operating system
that the containers in that Pod are designed for. For Pods that run Linux containers, set
.spec.os.name
to linux
. For Pods that run Windows containers, set .spec.os.name
to Windows.
The scheduler does not use the value of .spec.os.name
when assigning Pods to nodes. You should
use normal Kubernetes mechanisms for
assigning pods to nodes
to ensure that the control plane for your cluster places pods onto nodes that are running the
appropriate operating system.
no effect on the scheduling of the Windows pods, so taints and tolerations and node selectors are still required
to ensure that the Windows pods land onto appropriate Windows nodes.
Ensuring OS-specific workloads land on the appropriate container host
Users can ensure Windows containers can be scheduled on the appropriate host using Taints and Tolerations. All Kubernetes nodes today have the following default labels:
- kubernetes.io/os = [windows|linux]
- kubernetes.io/arch = [amd64|arm64|...]
If a Pod specification does not specify a nodeSelector like "kubernetes.io/os": windows
,
it is possible the Pod can be scheduled on any host, Windows or Linux.
This can be problematic since a Windows container can only run on Windows and a Linux container can only run on Linux.
The best practice is to use a nodeSelector.
However, we understand that in many cases users have a pre-existing large number of deployments for Linux containers, as well as an ecosystem of off-the-shelf configurations, such as community Helm charts, and programmatic Pod generation cases, such as with Operators. In those situations, you may be hesitant to make the configuration change to add nodeSelectors. The alternative is to use Taints. Because the kubelet can set Taints during registration, it could easily be modified to automatically add a taint when running on Windows only.
For example: --register-with-taints='os=windows:NoSchedule'
By adding a taint to all Windows nodes, nothing will be scheduled on them (that includes existing Linux Pods). In order for a Windows Pod to be scheduled on a Windows node, it would need both the nodeSelector and the appropriate matching toleration to choose Windows.
nodeSelector:
kubernetes.io/os: windows
node.kubernetes.io/windows-build: '10.0.17763'
tolerations:
- key: "os"
operator: "Equal"
value: "windows"
effect: "NoSchedule"
Handling multiple Windows versions in the same cluster
The Windows Server version used by each pod must match that of the node. If you want to use multiple Windows Server versions in the same cluster, then you should set additional node labels and nodeSelectors.
Kubernetes 1.17 automatically adds a new label node.kubernetes.io/windows-build
to simplify this.
If you're running an older version, then it's recommended to add this label manually to Windows nodes.
This label reflects the Windows major, minor, and build number that need to match for compatibility. Here are values used today for each Windows Server version.
Product Name | Build Number(s) |
---|---|
Windows Server 2019 | 10.0.17763 |
Windows Server version 1809 | 10.0.17763 |
Windows Server version 1903 | 10.0.18362 |
Simplifying with RuntimeClass
RuntimeClass can be used to simplify the process of using taints and tolerations.
A cluster administrator can create a RuntimeClass
object which is used to encapsulate these taints and tolerations.
- Save this file to
runtimeClasses.yml
. It includes the appropriatenodeSelector
for the Windows OS, architecture, and version.
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: windows-2019
handler: 'docker'
scheduling:
nodeSelector:
kubernetes.io/os: 'windows'
kubernetes.io/arch: 'amd64'
node.kubernetes.io/windows-build: '10.0.17763'
tolerations:
- effect: NoSchedule
key: os
operator: Equal
value: "windows"
- Run
kubectl create -f runtimeClasses.yml
using as a cluster administrator - Add
runtimeClassName: windows-2019
as appropriate to Pod specs
For example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: iis-2019
labels:
app: iis-2019
spec:
replicas: 1
template:
metadata:
name: iis-2019
labels:
app: iis-2019
spec:
runtimeClassName: windows-2019
containers:
- name: iis
image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
resources:
limits:
cpu: 1
memory: 800Mi
requests:
cpu: .1
memory: 300Mi
ports:
- containerPort: 80
selector:
matchLabels:
app: iis-2019
---
apiVersion: v1
kind: Service
metadata:
name: iis
spec:
type: LoadBalancer
ports:
- protocol: TCP
port: 80
selector:
app: iis-2019
2.3 - Best practices
2.3.1 - Considerations for large clusters
A cluster is a set of nodes (physical or virtual machines) running Kubernetes agents, managed by the control plane. Kubernetes v1.23 supports clusters with up to 5000 nodes. More specifically, Kubernetes is designed to accommodate configurations that meet all of the following criteria:
- No more than 110 pods per node
- No more than 5000 nodes
- No more than 150000 total pods
- No more than 300000 total containers
You can scale your cluster by adding or removing nodes. The way you do this depends on how your cluster is deployed.
Cloud provider resource quotas
To avoid running into cloud provider quota issues, when creating a cluster with many nodes, consider:
- Requesting a quota increase for cloud resources such as:
- Computer instances
- CPUs
- Storage volumes
- In-use IP addresses
- Packet filtering rule sets
- Number of load balancers
- Network subnets
- Log streams
- Gating the cluster scaling actions to bring up new nodes in batches, with a pause between batches, because some cloud providers rate limit the creation of new instances.
Control plane components
For a large cluster, you need a control plane with sufficient compute and other resources.
Typically you would run one or two control plane instances per failure zone, scaling those instances vertically first and then scaling horizontally after reaching the point of falling returns to (vertical) scale.
You should run at least one instance per failure zone to provide fault-tolerance. Kubernetes nodes do not automatically steer traffic towards control-plane endpoints that are in the same failure zone; however, your cloud provider might have its own mechanisms to do this.
For example, using a managed load balancer, you configure the load balancer to send traffic that originates from the kubelet and Pods in failure zone A, and direct that traffic only to the control plane hosts that are also in zone A. If a single control-plane host or endpoint failure zone A goes offline, that means that all the control-plane traffic for nodes in zone A is now being sent between zones. Running multiple control plane hosts in each zone makes that outcome less likely.
etcd storage
To improve performance of large clusters, you can store Event objects in a separate dedicated etcd instance.
When creating a cluster, you can (using custom tooling):
- start and configure additional etcd instance
- configure the API server to use it for storing events
See Operating etcd clusters for Kubernetes and Set up a High Availability etcd cluster with kubeadm for details on configuring and managing etcd for a large cluster.
Addon resources
Kubernetes resource limits help to minimize the impact of memory leaks and other ways that pods and containers can impact on other components. These resource limits apply to addon resources just as they apply to application workloads.
For example, you can set CPU and memory limits for a logging component:
...
containers:
- name: fluentd-cloud-logging
image: fluent/fluentd-kubernetes-daemonset:v1
resources:
limits:
cpu: 100m
memory: 200Mi
Addons' default limits are typically based on data collected from experience running each addon on small or medium Kubernetes clusters. When running on large clusters, addons often consume more of some resources than their default limits. If a large cluster is deployed without adjusting these values, the addon(s) may continuously get killed because they keep hitting the memory limit. Alternatively, the addon may run but with poor performance due to CPU time slice restrictions.
To avoid running into cluster addon resource issues, when creating a cluster with many nodes, consider the following:
- Some addons scale vertically - there is one replica of the addon for the cluster or serving a whole failure zone. For these addons, increase requests and limits as you scale out your cluster.
- Many addons scale horizontally - you add capacity by running more pods - but with a very large cluster you may also need to raise CPU or memory limits slightly. The VerticalPodAutoscaler can run in recommender mode to provide suggested figures for requests and limits.
- Some addons run as one copy per node, controlled by a DaemonSet: for example, a node-level log aggregator. Similar to the case with horizontally-scaled addons, you may also need to raise CPU or memory limits slightly.
What's next
VerticalPodAutoscaler
is a custom resource that you can deploy into your cluster
to help you manage resource requests and limits for pods.
Visit Vertical Pod Autoscaler
to learn more about VerticalPodAutoscaler
and how you can use it to scale cluster
components, including cluster-critical addons.
The cluster autoscaler integrates with a number of cloud providers to help you run the right number of nodes for the level of resource demand in your cluster.
The addon resizer helps you in resizing the addons automatically as your cluster's scale changes.
2.3.2 - Running in multiple zones
This page describes running Kubernetes across multiple zones.
Background
Kubernetes is designed so that a single Kubernetes cluster can run across multiple failure zones, typically where these zones fit within a logical grouping called a region. Major cloud providers define a region as a set of failure zones (also called availability zones) that provide a consistent set of features: within a region, each zone offers the same APIs and services.
Typical cloud architectures aim to minimize the chance that a failure in one zone also impairs services in another zone.
Control plane behavior
All control plane components support running as a pool of interchangeable resources, replicated per component.
When you deploy a cluster control plane, place replicas of control plane components across multiple failure zones. If availability is an important concern, select at least three failure zones and replicate each individual control plane component (API server, scheduler, etcd, cluster controller manager) across at least three failure zones. If you are running a cloud controller manager then you should also replicate this across all the failure zones you selected.
Node behavior
Kubernetes automatically spreads the Pods for workload resources (such as Deployment or StatefulSet) across different nodes in a cluster. This spreading helps reduce the impact of failures.
When nodes start up, the kubelet on each node automatically adds labels to the Node object that represents that specific kubelet in the Kubernetes API. These labels can include zone information.
If your cluster spans multiple zones or regions, you can use node labels in conjunction with Pod topology spread constraints to control how Pods are spread across your cluster among fault domains: regions, zones, and even specific nodes. These hints enable the scheduler to place Pods for better expected availability, reducing the risk that a correlated failure affects your whole workload.
For example, you can set a constraint to make sure that the 3 replicas of a StatefulSet are all running in different zones to each other, whenever that is feasible. You can define this declaratively without explicitly defining which availability zones are in use for each workload.
Distributing nodes across zones
Kubernetes' core does not create nodes for you; you need to do that yourself, or use a tool such as the Cluster API to manage nodes on your behalf.
Using tools such as the Cluster API you can define sets of machines to run as worker nodes for your cluster across multiple failure domains, and rules to automatically heal the cluster in case of whole-zone service disruption.
Manual zone assignment for Pods
You can apply node selector constraints to Pods that you create, as well as to Pod templates in workload resources such as Deployment, StatefulSet, or Job.
Storage access for zones
When persistent volumes are created, the PersistentVolumeLabel
admission controller
automatically adds zone labels to any PersistentVolumes that are linked to a specific
zone. The scheduler then ensures,
through its NoVolumeZoneConflict
predicate, that pods which claim a given PersistentVolume
are only placed into the same zone as that volume.
You can specify a StorageClass for PersistentVolumeClaims that specifies the failure domains (zones) that the storage in that class may use. To learn about configuring a StorageClass that is aware of failure domains or zones, see Allowed topologies.
Networking
By itself, Kubernetes does not include zone-aware networking. You can use a
network plugin
to configure cluster networking, and that network solution might have zone-specific
elements. For example, if your cloud provider supports Services with
type=LoadBalancer
, the load balancer might only send traffic to Pods running in the
same zone as the load balancer element processing a given connection.
Check your cloud provider's documentation for details.
For custom or on-premises deployments, similar considerations apply. Service and Ingress behavior, including handling of different failure zones, does vary depending on exactly how your cluster is set up.
Fault recovery
When you set up your cluster, you might also need to consider whether and how
your setup can restore service if all the failure zones in a region go
off-line at the same time. For example, do you rely on there being at least
one node able to run Pods in a zone?
Make sure that any cluster-critical repair work does not rely
on there being at least one healthy node in your cluster. For example: if all nodes
are unhealthy, you might need to run a repair Job with a special
toleration so that the repair
can complete enough to bring at least one node into service.
Kubernetes doesn't come with an answer for this challenge; however, it's something to consider.
What's next
To learn how the scheduler places Pods in a cluster, honoring the configured constraints, visit Scheduling and Eviction.
2.3.3 - Validate node setup
Node Conformance Test
Node conformance test is a containerized test framework that provides a system verification and functionality test for a node. The test validates whether the node meets the minimum requirements for Kubernetes; a node that passes the test is qualified to join a Kubernetes cluster.
Node Prerequisite
To run node conformance test, a node must satisfy the same prerequisites as a standard Kubernetes node. At a minimum, the node should have the following daemons installed:
- Container Runtime (Docker)
- Kubelet
Running Node Conformance Test
To run the node conformance test, perform the following steps:
- Work out the value of the
--kubeconfig
option for the kubelet; for example:--kubeconfig=/var/lib/kubelet/config.yaml
. Because the test framework starts a local control plane to test the kubelet, usehttp://localhost:8080
as the URL of the API server. There are some other kubelet command line parameters you may want to use:
--pod-cidr
: If you are usingkubenet
, you should specify an arbitrary CIDR to Kubelet, for example--pod-cidr=10.180.0.0/24
.--cloud-provider
: If you are using--cloud-provider=gce
, you should remove the flag to run the test.
- Run the node conformance test with command:
# $CONFIG_DIR is the pod manifest path of your Kubelet.
# $LOG_DIR is the test output path.
sudo docker run -it --rm --privileged --net=host \
-v /:/rootfs -v $CONFIG_DIR:$CONFIG_DIR -v $LOG_DIR:/var/result \
k8s.gcr.io/node-test:0.2
Running Node Conformance Test for Other Architectures
Kubernetes also provides node conformance test docker images for other architectures:
Arch | Image |
---|---|
amd64 | node-test-amd64 |
arm | node-test-arm |
arm64 | node-test-arm64 |
Running Selected Test
To run specific tests, overwrite the environment variable FOCUS
with the
regular expression of tests you want to run.
sudo docker run -it --rm --privileged --net=host \
-v /:/rootfs:ro -v $CONFIG_DIR:$CONFIG_DIR -v $LOG_DIR:/var/result \
-e FOCUS=MirrorPod \ # Only run MirrorPod test
k8s.gcr.io/node-test:0.2
To skip specific tests, overwrite the environment variable SKIP
with the
regular expression of tests you want to skip.
sudo docker run -it --rm --privileged --net=host \
-v /:/rootfs:ro -v $CONFIG_DIR:$CONFIG_DIR -v $LOG_DIR:/var/result \
-e SKIP=MirrorPod \ # Run all conformance tests but skip MirrorPod test
k8s.gcr.io/node-test:0.2
Node conformance test is a containerized version of node e2e test. By default, it runs all conformance tests.
Theoretically, you can run any node e2e test if you configure the container and mount required volumes properly. But it is strongly recommended to only run conformance test, because it requires much more complex configuration to run non-conformance test.
Caveats
- The test leaves some docker images on the node, including the node conformance test image and images of containers used in the functionality test.
- The test leaves dead containers on the node. These containers are created during the functionality test.
2.3.4 - Enforcing Pod Security Standards
This page provides an overview of best practices when it comes to enforcing Pod Security Standards.
Using the built-in Pod Security Admission Controller
Kubernetes v1.23 [beta]
The Pod Security Admission Controller intends to replace the deprecated PodSecurityPolicies.
Configure all cluster namespaces
Namespaces that lack any configuration at all should be considered significant gaps in your cluster security model. We recommend taking the time to analyze the types of workloads occurring in each namespace, and by referencing the Pod Security Standards, decide on an appropriate level for each of them. Unlabeled namespaces should only indicate that they've yet to be evaluated.
In the scenario that all workloads in all namespaces have the same security requirements, we provide an example that illustrates how the PodSecurity labels can be applied in bulk.
Embrace the principle of least privilege
In an ideal world, every pod in every namespace would meet the requirements of the restricted
policy. However, this is not possible nor practical, as some workloads will require elevated
privileges for legitimate reasons.
- Namespaces allowing
privileged
workloads should establish and enforce appropriate access controls. - For workloads running in those permissive namespaces, maintain documentation about their unique security requirements. If at all possible, consider how those requirements could be further constrained.
Adopt a multi-mode strategy
The audit
and warn
modes of the Pod Security Standards admission controller make it easy to
collect important security insights about your pods without breaking existing workloads.
It is good practice to enable these modes for all namespaces, setting them to the desired level
and version you would eventually like to enforce
. The warnings and audit annotations generated in
this phase can guide you toward that state. If you expect workload authors to make changes to fit
within the desired level, enable the warn
mode. If you expect to use audit logs to monitor/drive
changes to fit within the desired level, enable the audit
mode.
When you have the enforce
mode set to your desired value, these modes can still be useful in a
few different ways:
- By setting
warn
to the same level asenforce
, clients will receive warnings when attempting to create Pods (or resources that have Pod templates) that do not pass validation. This will help them update those resources to become compliant. - In Namespaces that pin
enforce
to a specific non-latest version, setting theaudit
andwarn
modes to the same level asenforce
, but to thelatest
version, gives visibility into settings that were allowed by previous versions but are not allowed per current best practices.
Third-party alternatives
Other alternatives for enforcing security profiles are being developed in the Kubernetes ecosystem:
The decision to go with a built-in solution (e.g. PodSecurity admission controller) versus a third-party tool is entirely dependent on your own situation. When evaluating any solution, trust of your supply chain is crucial. Ultimately, using any of the aforementioned approaches will be better than doing nothing.
2.3.5 - PKI certificates and requirements
Kubernetes requires PKI certificates for authentication over TLS. If you install Kubernetes with kubeadm, the certificates that your cluster requires are automatically generated. You can also generate your own certificates -- for example, to keep your private keys more secure by not storing them on the API server. This page explains the certificates that your cluster requires.
How certificates are used by your cluster
Kubernetes requires PKI for the following operations:
- Client certificates for the kubelet to authenticate to the API server
- Kubelet server certificates for the the API server to talk to the kubelets
- Server certificate for the API server endpoint
- Client certificates for administrators of the cluster to authenticate to the API server
- Client certificates for the API server to talk to the kubelets
- Client certificate for the API server to talk to etcd
- Client certificate/kubeconfig for the controller manager to talk to the API server
- Client certificate/kubeconfig for the scheduler to talk to the API server.
- Client and server certificates for the front-proxy
front-proxy
certificates are required only if you run kube-proxy to support an extension API server.
etcd also implements mutual TLS to authenticate clients and peers.
Where certificates are stored
If you install Kubernetes with kubeadm, most certificates are stored in /etc/kubernetes/pki
. All paths in this documentation are relative to that directory, with the exception of user account certificates which kubeadm places in /etc/kubernetes
.
Configure certificates manually
If you don't want kubeadm to generate the required certificates, you can create them using a single root CA or by providing all certificates. See Certificates for details on creating your own certificate authority. See Certificate Management with kubeadm for more on managing certificates.
Single root CA
You can create a single root CA, controlled by an administrator. This root CA can then create multiple intermediate CAs, and delegate all further creation to Kubernetes itself.
Required CAs:
path | Default CN | description |
---|---|---|
ca.crt,key | kubernetes-ca | Kubernetes general CA |
etcd/ca.crt,key | etcd-ca | For all etcd-related functions |
front-proxy-ca.crt,key | kubernetes-front-proxy-ca | For the front-end proxy |
On top of the above CAs, it is also necessary to get a public/private key pair for service account management, sa.key
and sa.pub
.
The following example illustrates the CA key and certificate files shown in the previous table:
/etc/kubernetes/pki/ca.crt
/etc/kubernetes/pki/ca.key
/etc/kubernetes/pki/etcd/ca.crt
/etc/kubernetes/pki/etcd/ca.key
/etc/kubernetes/pki/front-proxy-ca.crt
/etc/kubernetes/pki/front-proxy-ca.key
All certificates
If you don't wish to copy the CA private keys to your cluster, you can generate all certificates yourself.
Required certificates:
Default CN | Parent CA | O (in Subject) | kind | hosts (SAN) |
---|---|---|---|---|
kube-etcd | etcd-ca | server, client | <hostname> , <Host_IP> , localhost , 127.0.0.1 |
|
kube-etcd-peer | etcd-ca | server, client | <hostname> , <Host_IP> , localhost , 127.0.0.1 |
|
kube-etcd-healthcheck-client | etcd-ca | client | ||
kube-apiserver-etcd-client | etcd-ca | system:masters | client | |
kube-apiserver | kubernetes-ca | server | <hostname> , <Host_IP> , <advertise_IP> , [1] |
|
kube-apiserver-kubelet-client | kubernetes-ca | system:masters | client | |
front-proxy-client | kubernetes-front-proxy-ca | client |
[1]: any other IP or DNS name you contact your cluster on (as used by kubeadm
the load balancer stable IP and/or DNS name, kubernetes
, kubernetes.default
, kubernetes.default.svc
,
kubernetes.default.svc.cluster
, kubernetes.default.svc.cluster.local
)
where kind
maps to one or more of the x509 key usage types:
kind | Key usage |
---|---|
server | digital signature, key encipherment, server auth |
client | digital signature, key encipherment, client auth |
For kubeadm users only:
- The scenario where you are copying to your cluster CA certificates without private keys is referred as external CA in the kubeadm documentation.
- If you are comparing the above list with a kubeadm generated PKI, please be aware that
kube-etcd
,kube-etcd-peer
andkube-etcd-healthcheck-client
certificates are not generated in case of external etcd.
Certificate paths
Certificates should be placed in a recommended path (as used by kubeadm). Paths should be specified using the given argument regardless of location.
Default CN | recommended key path | recommended cert path | command | key argument | cert argument |
---|---|---|---|---|---|
etcd-ca | etcd/ca.key | etcd/ca.crt | kube-apiserver | --etcd-cafile | |
kube-apiserver-etcd-client | apiserver-etcd-client.key | apiserver-etcd-client.crt | kube-apiserver | --etcd-keyfile | --etcd-certfile |
kubernetes-ca | ca.key | ca.crt | kube-apiserver | --client-ca-file | |
kubernetes-ca | ca.key | ca.crt | kube-controller-manager | --cluster-signing-key-file | --client-ca-file, --root-ca-file, --cluster-signing-cert-file |
kube-apiserver | apiserver.key | apiserver.crt | kube-apiserver | --tls-private-key-file | --tls-cert-file |
kube-apiserver-kubelet-client | apiserver-kubelet-client.key | apiserver-kubelet-client.crt | kube-apiserver | --kubelet-client-key | --kubelet-client-certificate |
front-proxy-ca | front-proxy-ca.key | front-proxy-ca.crt | kube-apiserver | --requestheader-client-ca-file | |
front-proxy-ca | front-proxy-ca.key | front-proxy-ca.crt | kube-controller-manager | --requestheader-client-ca-file | |
front-proxy-client | front-proxy-client.key | front-proxy-client.crt | kube-apiserver | --proxy-client-key-file | --proxy-client-cert-file |
etcd-ca | etcd/ca.key | etcd/ca.crt | etcd | --trusted-ca-file, --peer-trusted-ca-file | |
kube-etcd | etcd/server.key | etcd/server.crt | etcd | --key-file | --cert-file |
kube-etcd-peer | etcd/peer.key | etcd/peer.crt | etcd | --peer-key-file | --peer-cert-file |
etcd-ca | etcd/ca.crt | etcdctl | --cacert | ||
kube-etcd-healthcheck-client | etcd/healthcheck-client.key | etcd/healthcheck-client.crt | etcdctl | --key | --cert |
Same considerations apply for the service account key pair:
private key path | public key path | command | argument |
---|---|---|---|
sa.key | kube-controller-manager | --service-account-private-key-file | |
sa.pub | kube-apiserver | --service-account-key-file |
The following example illustrates the file paths from the previous tables you need to provide if you are generating all of your own keys and certificates:
/etc/kubernetes/pki/etcd/ca.key
/etc/kubernetes/pki/etcd/ca.crt
/etc/kubernetes/pki/apiserver-etcd-client.key
/etc/kubernetes/pki/apiserver-etcd-client.crt
/etc/kubernetes/pki/ca.key
/etc/kubernetes/pki/ca.crt
/etc/kubernetes/pki/apiserver.key
/etc/kubernetes/pki/apiserver.crt
/etc/kubernetes/pki/apiserver-kubelet-client.key
/etc/kubernetes/pki/apiserver-kubelet-client.crt
/etc/kubernetes/pki/front-proxy-ca.key
/etc/kubernetes/pki/front-proxy-ca.crt
/etc/kubernetes/pki/front-proxy-client.key
/etc/kubernetes/pki/front-proxy-client.crt
/etc/kubernetes/pki/etcd/server.key
/etc/kubernetes/pki/etcd/server.crt
/etc/kubernetes/pki/etcd/peer.key
/etc/kubernetes/pki/etcd/peer.crt
/etc/kubernetes/pki/etcd/healthcheck-client.key
/etc/kubernetes/pki/etcd/healthcheck-client.crt
/etc/kubernetes/pki/sa.key
/etc/kubernetes/pki/sa.pub
Configure certificates for user accounts
You must manually configure these administrator account and service accounts:
filename | credential name | Default CN | O (in Subject) |
---|---|---|---|
admin.conf | default-admin | kubernetes-admin | system:masters |
kubelet.conf | default-auth | system:node:<nodeName> (see note) |
system:nodes |
controller-manager.conf | default-controller-manager | system:kube-controller-manager | |
scheduler.conf | default-scheduler | system:kube-scheduler |
<nodeName>
for kubelet.conf
must match precisely the value of the node name provided by the kubelet as it registers with the apiserver. For further details, read the Node Authorization.
-
For each config, generate an x509 cert/key pair with the given CN and O.
-
Run
kubectl
as follows for each config:
KUBECONFIG=<filename> kubectl config set-cluster default-cluster --server=https://<host ip>:6443 --certificate-authority <path-to-kubernetes-ca> --embed-certs
KUBECONFIG=<filename> kubectl config set-credentials <credential-name> --client-key <path-to-key>.pem --client-certificate <path-to-cert>.pem --embed-certs
KUBECONFIG=<filename> kubectl config set-context default-system --cluster default-cluster --user <credential-name>
KUBECONFIG=<filename> kubectl config use-context default-system
These files are used as follows:
filename | command | comment |
---|---|---|
admin.conf | kubectl | Configures administrator user for the cluster |
kubelet.conf | kubelet | One required for each node in the cluster. |
controller-manager.conf | kube-controller-manager | Must be added to manifest in manifests/kube-controller-manager.yaml |
scheduler.conf | kube-scheduler | Must be added to manifest in manifests/kube-scheduler.yaml |
The following files illustrate full paths to the files listed in the previous table:
/etc/kubernetes/admin.conf
/etc/kubernetes/kubelet.conf
/etc/kubernetes/controller-manager.conf
/etc/kubernetes/scheduler.conf
3 - Concepts
The Concepts section helps you learn about the parts of the Kubernetes system and the abstractions Kubernetes uses to represent your cluster, and helps you obtain a deeper understanding of how Kubernetes works.
3.1 - Overview
3.1.1 - What is Kubernetes?
This page is an overview of Kubernetes.
Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.
The name Kubernetes originates from Greek, meaning helmsman or pilot. K8s as an abbreviation results from counting the eight letters between the "K" and the "s". Google open-sourced the Kubernetes project in 2014. Kubernetes combines over 15 years of Google's experience running production workloads at scale with best-of-breed ideas and practices from the community.
Going back in time
Let's take a look at why Kubernetes is so useful by going back in time.
Traditional deployment era: Early on, organizations ran applications on physical servers. There was no way to define resource boundaries for applications in a physical server, and this caused resource allocation issues. For example, if multiple applications run on a physical server, there can be instances where one application would take up most of the resources, and as a result, the other applications would underperform. A solution for this would be to run each application on a different physical server. But this did not scale as resources were underutilized, and it was expensive for organizations to maintain many physical servers.
Virtualized deployment era: As a solution, virtualization was introduced. It allows you to run multiple Virtual Machines (VMs) on a single physical server's CPU. Virtualization allows applications to be isolated between VMs and provides a level of security as the information of one application cannot be freely accessed by another application.
Virtualization allows better utilization of resources in a physical server and allows better scalability because an application can be added or updated easily, reduces hardware costs, and much more. With virtualization you can present a set of physical resources as a cluster of disposable virtual machines.
Each VM is a full machine running all the components, including its own operating system, on top of the virtualized hardware.
Container deployment era: Containers are similar to VMs, but they have relaxed isolation properties to share the Operating System (OS) among the applications. Therefore, containers are considered lightweight. Similar to a VM, a container has its own filesystem, share of CPU, memory, process space, and more. As they are decoupled from the underlying infrastructure, they are portable across clouds and OS distributions.
Containers have become popular because they provide extra benefits, such as:
- Agile application creation and deployment: increased ease and efficiency of container image creation compared to VM image use.
- Continuous development, integration, and deployment: provides for reliable and frequent container image build and deployment with quick and efficient rollbacks (due to image immutability).
- Dev and Ops separation of concerns: create application container images at build/release time rather than deployment time, thereby decoupling applications from infrastructure.
- Observability: not only surfaces OS-level information and metrics, but also application health and other signals.
- Environmental consistency across development, testing, and production: Runs the same on a laptop as it does in the cloud.
- Cloud and OS distribution portability: Runs on Ubuntu, RHEL, CoreOS, on-premises, on major public clouds, and anywhere else.
- Application-centric management: Raises the level of abstraction from running an OS on virtual hardware to running an application on an OS using logical resources.
- Loosely coupled, distributed, elastic, liberated micro-services: applications are broken into smaller, independent pieces and can be deployed and managed dynamically – not a monolithic stack running on one big single-purpose machine.
- Resource isolation: predictable application performance.
- Resource utilization: high efficiency and density.
Why you need Kubernetes and what it can do
Containers are a good way to bundle and run your applications. In a production environment, you need to manage the containers that run the applications and ensure that there is no downtime. For example, if a container goes down, another container needs to start. Wouldn't it be easier if this behavior was handled by a system?
That's how Kubernetes comes to the rescue! Kubernetes provides you with a framework to run distributed systems resiliently. It takes care of scaling and failover for your application, provides deployment patterns, and more. For example, Kubernetes can easily manage a canary deployment for your system.
Kubernetes provides you with:
- Service discovery and load balancing Kubernetes can expose a container using the DNS name or using their own IP address. If traffic to a container is high, Kubernetes is able to load balance and distribute the network traffic so that the deployment is stable.
- Storage orchestration Kubernetes allows you to automatically mount a storage system of your choice, such as local storages, public cloud providers, and more.
- Automated rollouts and rollbacks You can describe the desired state for your deployed containers using Kubernetes, and it can change the actual state to the desired state at a controlled rate. For example, you can automate Kubernetes to create new containers for your deployment, remove existing containers and adopt all their resources to the new container.
- Automatic bin packing You provide Kubernetes with a cluster of nodes that it can use to run containerized tasks. You tell Kubernetes how much CPU and memory (RAM) each container needs. Kubernetes can fit containers onto your nodes to make the best use of your resources.
- Self-healing Kubernetes restarts containers that fail, replaces containers, kills containers that don't respond to your user-defined health check, and doesn't advertise them to clients until they are ready to serve.
- Secret and configuration management Kubernetes lets you store and manage sensitive information, such as passwords, OAuth tokens, and SSH keys. You can deploy and update secrets and application configuration without rebuilding your container images, and without exposing secrets in your stack configuration.
What Kubernetes is not
Kubernetes is not a traditional, all-inclusive PaaS (Platform as a Service) system. Since Kubernetes operates at the container level rather than at the hardware level, it provides some generally applicable features common to PaaS offerings, such as deployment, scaling, load balancing, and lets users integrate their logging, monitoring, and alerting solutions. However, Kubernetes is not monolithic, and these default solutions are optional and pluggable. Kubernetes provides the building blocks for building developer platforms, but preserves user choice and flexibility where it is important.
Kubernetes:
- Does not limit the types of applications supported. Kubernetes aims to support an extremely diverse variety of workloads, including stateless, stateful, and data-processing workloads. If an application can run in a container, it should run great on Kubernetes.
- Does not deploy source code and does not build your application. Continuous Integration, Delivery, and Deployment (CI/CD) workflows are determined by organization cultures and preferences as well as technical requirements.
- Does not provide application-level services, such as middleware (for example, message buses), data-processing frameworks (for example, Spark), databases (for example, MySQL), caches, nor cluster storage systems (for example, Ceph) as built-in services. Such components can run on Kubernetes, and/or can be accessed by applications running on Kubernetes through portable mechanisms, such as the Open Service Broker.
- Does not dictate logging, monitoring, or alerting solutions. It provides some integrations as proof of concept, and mechanisms to collect and export metrics.
- Does not provide nor mandate a configuration language/system (for example, Jsonnet). It provides a declarative API that may be targeted by arbitrary forms of declarative specifications.
- Does not provide nor adopt any comprehensive machine configuration, maintenance, management, or self-healing systems.
- Additionally, Kubernetes is not a mere orchestration system. In fact, it eliminates the need for orchestration. The technical definition of orchestration is execution of a defined workflow: first do A, then B, then C. In contrast, Kubernetes comprises a set of independent, composable control processes that continuously drive the current state towards the provided desired state. It shouldn't matter how you get from A to C. Centralized control is also not required. This results in a system that is easier to use and more powerful, robust, resilient, and extensible.
What's next
- Take a look at the Kubernetes Components
- Ready to Get Started?
3.1.2 - Kubernetes Components
When you deploy Kubernetes, you get a cluster.
A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.
The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.
This document outlines the various components you need to have for a complete and working Kubernetes cluster.
Control Plane Components
The control plane's components make global decisions about the cluster (for example, scheduling), as well as detecting and responding to cluster events (for example, starting up a new pod when a deployment's replicas
field is unsatisfied).
Control plane components can be run on any machine in the cluster. However, for simplicity, set up scripts typically start all control plane components on the same machine, and do not run user containers on this machine. See Creating Highly Available clusters with kubeadm for an example control plane setup that runs across multiple machines.
kube-apiserver
The API server is a component of the Kubernetes control plane that exposes the Kubernetes API. The API server is the front end for the Kubernetes control plane.
The main implementation of a Kubernetes API server is kube-apiserver. kube-apiserver is designed to scale horizontally—that is, it scales by deploying more instances. You can run several instances of kube-apiserver and balance traffic between those instances.
etcd
Consistent and highly-available key value store used as Kubernetes' backing store for all cluster data.
If your Kubernetes cluster uses etcd as its backing store, make sure you have a back up plan for those data.
You can find in-depth information about etcd in the official documentation.
kube-scheduler
Control plane component that watches for newly created Pods with no assigned node, and selects a node for them to run on.
Factors taken into account for scheduling decisions include: individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, and deadlines.
kube-controller-manager
Control plane component that runs controller processes.
Logically, each controller is a separate process, but to reduce complexity, they are all compiled into a single binary and run in a single process.
Some types of these controllers are:
- Node controller: Responsible for noticing and responding when nodes go down.
- Job controller: Watches for Job objects that represent one-off tasks, then creates Pods to run those tasks to completion.
- Endpoints controller: Populates the Endpoints object (that is, joins Services & Pods).
- Service Account & Token controllers: Create default accounts and API access tokens for new namespaces.
cloud-controller-manager
A Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider's API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.The cloud-controller-manager only runs controllers that are specific to your cloud provider. If you are running Kubernetes on your own premises, or in a learning environment inside your own PC, the cluster does not have a cloud controller manager.
As with the kube-controller-manager, the cloud-controller-manager combines several logically independent control loops into a single binary that you run as a single process. You can scale horizontally (run more than one copy) to improve performance or to help tolerate failures.
The following controllers can have cloud provider dependencies:
- Node controller: For checking the cloud provider to determine if a node has been deleted in the cloud after it stops responding
- Route controller: For setting up routes in the underlying cloud infrastructure
- Service controller: For creating, updating and deleting cloud provider load balancers
Node Components
Node components run on every node, maintaining running pods and providing the Kubernetes runtime environment.
kubelet
An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod.
The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn't manage containers which were not created by Kubernetes.
kube-proxy
kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept.
kube-proxy maintains network rules on nodes. These network rules allow network communication to your Pods from network sessions inside or outside of your cluster.
kube-proxy uses the operating system packet filtering layer if there is one and it's available. Otherwise, kube-proxy forwards the traffic itself.
Container runtime
The container runtime is the software that is responsible for running containers.
Kubernetes supports container runtimes such as containerd, CRI-O, and any other implementation of the Kubernetes CRI (Container Runtime Interface).
Addons
Addons use Kubernetes resources (DaemonSet,
Deployment, etc)
to implement cluster features. Because these are providing cluster-level features, namespaced resources
for addons belong within the kube-system
namespace.
Selected addons are described below; for an extended list of available addons, please see Addons.
DNS
While the other addons are not strictly required, all Kubernetes clusters should have cluster DNS, as many examples rely on it.
Cluster DNS is a DNS server, in addition to the other DNS server(s) in your environment, which serves DNS records for Kubernetes services.
Containers started by Kubernetes automatically include this DNS server in their DNS searches.
Web UI (Dashboard)
Dashboard is a general purpose, web-based UI for Kubernetes clusters. It allows users to manage and troubleshoot applications running in the cluster, as well as the cluster itself.
Container Resource Monitoring
Container Resource Monitoring records generic time-series metrics about containers in a central database, and provides a UI for browsing that data.
Cluster-level Logging
A cluster-level logging mechanism is responsible for saving container logs to a central log store with search/browsing interface.
What's next
- Learn about Nodes
- Learn about Controllers
- Learn about kube-scheduler
- Read etcd's official documentation
3.1.3 - The Kubernetes API
The core of Kubernetes' control plane is the API server. The API server exposes an HTTP API that lets end users, different parts of your cluster, and external components communicate with one another.
The Kubernetes API lets you query and manipulate the state of API objects in Kubernetes (for example: Pods, Namespaces, ConfigMaps, and Events).
Most operations can be performed through the kubectl command-line interface or other command-line tools, such as kubeadm, which in turn use the API. However, you can also access the API directly using REST calls.
Consider using one of the client libraries if you are writing an application using the Kubernetes API.
OpenAPI specification
Complete API details are documented using OpenAPI.
OpenAPI V2
The Kubernetes API server serves an aggregated OpenAPI v2 spec via the
/openapi/v2
endpoint. You can request the response format using
request headers as follows:
Header | Possible values | Notes |
---|---|---|
Accept-Encoding |
gzip |
not supplying this header is also acceptable |
Accept |
application/com.github.proto-openapi.spec.v2@v1.0+protobuf |
mainly for intra-cluster use |
application/json |
default | |
* |
serves application/json |
Kubernetes implements an alternative Protobuf based serialization format that is primarily intended for intra-cluster communication. For more information about this format, see the Kubernetes Protobuf serialization design proposal and the Interface Definition Language (IDL) files for each schema located in the Go packages that define the API objects.
OpenAPI V3
Kubernetes v1.23 [alpha]
Kubernetes v1.23 offers initial support for publishing its APIs as OpenAPI v3; this is an
alpha feature that is disabled by default.
You can enable the alpha feature by turning on the
feature gate named OpenAPIV3
for the kube-apiserver component.
With the feature enabled, the Kubernetes API server serves an
aggregated OpenAPI v3 spec per Kubernetes group version at the
/openapi/v3/apis/<group>/<version>
endpoint. Please refer to the
table below for accepted request headers.
Header | Possible values | Notes |
---|---|---|
Accept-Encoding |
gzip |
not supplying this header is also acceptable |
Accept |
application/com.github.proto-openapi.spec.v3@v1.0+protobuf |
mainly for intra-cluster use |
application/json |
default | |
* |
serves application/json |
A discovery endpoint /openapi/v3
is provided to see a list of all
group/versions available. This endpoint only returns JSON.
Persistence
Kubernetes stores the serialized state of objects by writing them into etcd.
API groups and versioning
To make it easier to eliminate fields or restructure resource representations,
Kubernetes supports multiple API versions, each at a different API path, such
as /api/v1
or /apis/rbac.authorization.k8s.io/v1alpha1
.
Versioning is done at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-life and/or experimental APIs.
To make it easier to evolve and to extend its API, Kubernetes implements API groups that can be enabled or disabled.
API resources are distinguished by their API group, resource type, namespace (for namespaced resources), and name. The API server handles the conversion between API versions transparently: all the different versions are actually representations of the same persisted data. The API server may serve the same underlying data through multiple API versions.
For example, suppose there are two API versions, v1
and v1beta1
, for the same
resource. If you originally created an object using the v1beta1
version of its
API, you can later read, update, or delete that object
using either the v1beta1
or the v1
API version.
API changes
Any system that is successful needs to grow and change as new use cases emerge or existing ones change. Therefore, Kubernetes has designed the Kubernetes API to continuously change and grow. The Kubernetes project aims to not break compatibility with existing clients, and to maintain that compatibility for a length of time so that other projects have an opportunity to adapt.
In general, new API resources and new resource fields can be added often and frequently. Elimination of resources or fields requires following the API deprecation policy.
Kubernetes makes a strong commitment to maintain compatibility for official Kubernetes APIs
once they reach general availability (GA), typically at API version v1
. Additionally,
Kubernetes keeps compatibility even for beta API versions wherever feasible:
if you adopt a beta API you can continue to interact with your cluster using that API,
even after the feature goes stable.
Refer to API versions reference for more details on the API version level definitions.
API Extension
The Kubernetes API can be extended in one of two ways:
- Custom resources let you declaratively define how the API server should provide your chosen resource API.
- You can also extend the Kubernetes API by implementing an aggregation layer.
What's next
- Learn how to extend the Kubernetes API by adding your own CustomResourceDefinition.
- Controlling Access To The Kubernetes API describes how the cluster manages authentication and authorization for API access.
- Learn about API endpoints, resource types and samples by reading API Reference.
- Learn about what constitutes a compatible change, and how to change the API, from API changes.
3.1.4 - Working with Kubernetes Objects
3.1.4.1 - Understanding Kubernetes Objects
This page explains how Kubernetes objects are represented in the Kubernetes API, and how you can express them in .yaml
format.
Understanding Kubernetes objects
Kubernetes objects are persistent entities in the Kubernetes system. Kubernetes uses these entities to represent the state of your cluster. Specifically, they can describe:
- What containerized applications are running (and on which nodes)
- The resources available to those applications
- The policies around how those applications behave, such as restart policies, upgrades, and fault-tolerance
A Kubernetes object is a "record of intent"--once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you're effectively telling the Kubernetes system what you want your cluster's workload to look like; this is your cluster's desired state.
To work with Kubernetes objects--whether to create, modify, or delete them--you'll need to use the Kubernetes API. When you use the kubectl
command-line interface, for example, the CLI makes the necessary Kubernetes API calls for you. You can also use the Kubernetes API directly in your own programs using one of the Client Libraries.
Object Spec and Status
Almost every Kubernetes object includes two nested object fields that govern
the object's configuration: the object spec
and the object status
.
For objects that have a spec
, you have to set this when you create the object,
providing a description of the characteristics you want the resource to have:
its desired state.
The status
describes the current state of the object, supplied and updated
by the Kubernetes system and its components. The Kubernetes
control plane continually
and actively manages every object's actual state to match the desired state you
supplied.
For example: in Kubernetes, a Deployment is an object that can represent an
application running on your cluster. When you create the Deployment, you
might set the Deployment spec
to specify that you want three replicas of
the application to be running. The Kubernetes system reads the Deployment
spec and starts three instances of your desired application--updating
the status to match your spec. If any of those instances should fail
(a status change), the Kubernetes system responds to the difference
between spec and status by making a correction--in this case, starting
a replacement instance.
For more information on the object spec, status, and metadata, see the Kubernetes API Conventions.
Describing a Kubernetes object
When you create an object in Kubernetes, you must provide the object spec that describes its desired state, as well as some basic information about the object (such as a name). When you use the Kubernetes API to create the object (either directly or via kubectl
), that API request must include that information as JSON in the request body. Most often, you provide the information to kubectl
in a .yaml file. kubectl
converts the information to JSON when making the API request.
Here's an example .yaml
file that shows the required fields and object spec for a Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
One way to create a Deployment using a .yaml
file like the one above is to use the
kubectl apply
command
in the kubectl
command-line interface, passing the .yaml
file as an argument. Here's an example:
kubectl apply -f https://k8s.io/examples/application/deployment.yaml
The output is similar to this:
deployment.apps/nginx-deployment created
Required Fields
In the .yaml
file for the Kubernetes object you want to create, you'll need to set values for the following fields:
apiVersion
- Which version of the Kubernetes API you're using to create this objectkind
- What kind of object you want to createmetadata
- Data that helps uniquely identify the object, including aname
string,UID
, and optionalnamespace
spec
- What state you desire for the object
The precise format of the object spec
is different for every Kubernetes object, and contains nested fields specific to that object. The Kubernetes API Reference can help you find the spec format for all of the objects you can create using Kubernetes.
For example, see the spec
field
for the Pod API reference.
For each Pod, the .spec
field specifies the pod and its desired state (such as the container image name for
each container within that pod).
Another example of an object specification is the
spec
field
for the StatefulSet API. For StatefulSet, the .spec
field specifies the StatefulSet and
its desired state.
Within the .spec
of a StatefulSet is a template
for Pod objects. That template describes Pods that the StatefulSet controller will create in order to
satisfy the StatefulSet specification.
Different kinds of object can also have different .status
; again, the API reference pages
detail the structure of that .status
field, and its content for each different type of object.
What's next
- Learn about the most important basic Kubernetes objects, such as Pod.
- Learn about controllers in Kubernetes.
- Using the Kubernetes API explains some more API concepts.
3.1.4.2 - Kubernetes Object Management
The kubectl
command-line tool supports several different ways to create and manage
Kubernetes objects. This document provides an overview of the different
approaches. Read the Kubectl book for
details of managing objects by Kubectl.
Management techniques
Management technique | Operates on | Recommended environment | Supported writers | Learning curve |
---|---|---|---|---|
Imperative commands | Live objects | Development projects | 1+ | Lowest |
Imperative object configuration | Individual files | Production projects | 1 | Moderate |
Declarative object configuration | Directories of files | Production projects | 1+ | Highest |
Imperative commands
When using imperative commands, a user operates directly on live objects
in a cluster. The user provides operations to
the kubectl
command as arguments or flags.
This is the recommended way to get started or to run a one-off task in a cluster. Because this technique operates directly on live objects, it provides no history of previous configurations.
Examples
Run an instance of the nginx container by creating a Deployment object:
kubectl create deployment nginx --image nginx
Trade-offs
Advantages compared to object configuration:
- Commands are expressed as a single action word.
- Commands require only a single step to make changes to the cluster.
Disadvantages compared to object configuration:
- Commands do not integrate with change review processes.
- Commands do not provide an audit trail associated with changes.
- Commands do not provide a source of records except for what is live.
- Commands do not provide a template for creating new objects.
Imperative object configuration
In imperative object configuration, the kubectl command specifies the operation (create, replace, etc.), optional flags and at least one file name. The file specified must contain a full definition of the object in YAML or JSON format.
See the API reference for more details on object definitions.
replace
command replaces the existing
spec with the newly provided one, dropping all changes to the object missing from
the configuration file. This approach should not be used with resource
types whose specs are updated independently of the configuration file.
Services of type LoadBalancer
, for example, have their externalIPs
field updated
independently from the configuration by the cluster.
Examples
Create the objects defined in a configuration file:
kubectl create -f nginx.yaml
Delete the objects defined in two configuration files:
kubectl delete -f nginx.yaml -f redis.yaml
Update the objects defined in a configuration file by overwriting the live configuration:
kubectl replace -f nginx.yaml
Trade-offs
Advantages compared to imperative commands:
- Object configuration can be stored in a source control system such as Git.
- Object configuration can integrate with processes such as reviewing changes before push and audit trails.
- Object configuration provides a template for creating new objects.
Disadvantages compared to imperative commands:
- Object configuration requires basic understanding of the object schema.
- Object configuration requires the additional step of writing a YAML file.
Advantages compared to declarative object configuration:
- Imperative object configuration behavior is simpler and easier to understand.
- As of Kubernetes version 1.5, imperative object configuration is more mature.
Disadvantages compared to declarative object configuration:
- Imperative object configuration works best on files, not directories.
- Updates to live objects must be reflected in configuration files, or they will be lost during the next replacement.
Declarative object configuration
When using declarative object configuration, a user operates on object
configuration files stored locally, however the user does not define the
operations to be taken on the files. Create, update, and delete operations
are automatically detected per-object by kubectl
. This enables working on
directories, where different operations might be needed for different objects.
patch
API operation to write only
observed differences, instead of using the replace
API operation to replace the entire object configuration.
Examples
Process all object configuration files in the configs
directory, and create or
patch the live objects. You can first diff
to see what changes are going to be
made, and then apply:
kubectl diff -f configs/
kubectl apply -f configs/
Recursively process directories:
kubectl diff -R -f configs/
kubectl apply -R -f configs/
Trade-offs
Advantages compared to imperative object configuration:
- Changes made directly to live objects are retained, even if they are not merged back into the configuration files.
- Declarative object configuration has better support for operating on directories and automatically detecting operation types (create, patch, delete) per-object.
Disadvantages compared to imperative object configuration:
- Declarative object configuration is harder to debug and understand results when they are unexpected.
- Partial updates using diffs create complex merge and patch operations.
What's next
- Managing Kubernetes Objects Using Imperative Commands
- Managing Kubernetes Objects Using Object Configuration (Imperative)
- Managing Kubernetes Objects Using Object Configuration (Declarative)
- Managing Kubernetes Objects Using Kustomize (Declarative)
- Kubectl Command Reference
- Kubectl Book
- Kubernetes API Reference
3.1.4.3 - Object Names and IDs
Each object in your cluster has a Name that is unique for that type of resource. Every Kubernetes object also has a UID that is unique across your whole cluster.
For example, you can only have one Pod named myapp-1234
within the same namespace, but you can have one Pod and one Deployment that are each named myapp-1234
.
For non-unique user-provided attributes, Kubernetes provides labels and annotations.
Names
A client-provided string that refers to an object in a resource URL, such as /api/v1/pods/some-name
.
Only one object of a given kind can have a given name at a time. However, if you delete the object, you can make a new object with the same name.
Below are four types of commonly used name constraints for resources.
DNS Subdomain Names
Most resource types require a name that can be used as a DNS subdomain name as defined in RFC 1123. This means the name must:
- contain no more than 253 characters
- contain only lowercase alphanumeric characters, '-' or '.'
- start with an alphanumeric character
- end with an alphanumeric character
RFC 1123 Label Names
Some resource types require their names to follow the DNS label standard as defined in RFC 1123. This means the name must:
- contain at most 63 characters
- contain only lowercase alphanumeric characters or '-'
- start with an alphanumeric character
- end with an alphanumeric character
RFC 1035 Label Names
Some resource types require their names to follow the DNS label standard as defined in RFC 1035. This means the name must:
- contain at most 63 characters
- contain only lowercase alphanumeric characters or '-'
- start with an alphabetic character
- end with an alphanumeric character
Path Segment Names
Some resource types require their names to be able to be safely encoded as a path segment. In other words, the name may not be "." or ".." and the name may not contain "/" or "%".
Here's an example manifest for a Pod named nginx-demo
.
apiVersion: v1
kind: Pod
metadata:
name: nginx-demo
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
UIDs
A Kubernetes systems-generated string to uniquely identify objects.
Every object created over the whole lifetime of a Kubernetes cluster has a distinct UID. It is intended to distinguish between historical occurrences of similar entities.
Kubernetes UIDs are universally unique identifiers (also known as UUIDs). UUIDs are standardized as ISO/IEC 9834-8 and as ITU-T X.667.
What's next
- Read about labels in Kubernetes.
- See the Identifiers and Names in Kubernetes design document.
3.1.4.4 - Namespaces
In Kubernetes, namespaces provides a mechanism for isolating groups of resources within a single cluster. Names of resources need to be unique within a namespace, but not across namespaces. Namespace-based scoping is applicable only for namespaced objects (e.g. Deployments, Services, etc) and not for cluster-wide objects (e.g. StorageClass, Nodes, PersistentVolumes, etc).
When to Use Multiple Namespaces
Namespaces are intended for use in environments with many users spread across multiple teams, or projects. For clusters with a few to tens of users, you should not need to create or think about namespaces at all. Start using namespaces when you need the features they provide.
Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces. Namespaces cannot be nested inside one another and each Kubernetes resource can only be in one namespace.
Namespaces are a way to divide cluster resources between multiple users (via resource quota).
It is not necessary to use multiple namespaces to separate slightly different resources, such as different versions of the same software: use labels to distinguish resources within the same namespace.
Working with Namespaces
Creation and deletion of namespaces are described in the Admin Guide documentation for namespaces.
kube-
, since it is reserved for Kubernetes system namespaces.
Viewing namespaces
You can list the current namespaces in a cluster using:
kubectl get namespace
NAME STATUS AGE
default Active 1d
kube-node-lease Active 1d
kube-public Active 1d
kube-system Active 1d
Kubernetes starts with four initial namespaces:
default
The default namespace for objects with no other namespacekube-system
The namespace for objects created by the Kubernetes systemkube-public
This namespace is created automatically and is readable by all users (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.kube-node-lease
This namespace holds Lease objects associated with each node. Node leases allow the kubelet to send heartbeats so that the control plane can detect node failure.
Setting the namespace for a request
To set the namespace for a current request, use the --namespace
flag.
For example:
kubectl run nginx --image=nginx --namespace=<insert-namespace-name-here>
kubectl get pods --namespace=<insert-namespace-name-here>
Setting the namespace preference
You can permanently save the namespace for all subsequent kubectl commands in that context.
kubectl config set-context --current --namespace=<insert-namespace-name-here>
# Validate it
kubectl config view --minify | grep namespace:
Namespaces and DNS
When you create a Service,
it creates a corresponding DNS entry.
This entry is of the form <service-name>.<namespace-name>.svc.cluster.local
, which means
that if a container only uses <service-name>
, it will resolve to the service which
is local to a namespace. This is useful for using the same configuration across
multiple namespaces such as Development, Staging and Production. If you want to reach
across namespaces, you need to use the fully qualified domain name (FQDN).
As a result, all namespace names must be valid RFC 1123 DNS labels.
By creating namespaces with the same name as public top-level domains, Services in these namespaces can have short DNS names that overlap with public DNS records. Workloads from any namespace performing a DNS lookup without a trailing dot will be redirected to those services, taking precedence over public DNS.
To mitigate this, limit privileges for creating namespaces to trusted users. If required, you could additionally configure third-party security controls, such as admission webhooks, to block creating any namespace with the name of public TLDs.
Not All Objects are in a Namespace
Most Kubernetes resources (e.g. pods, services, replication controllers, and others) are in some namespaces. However namespace resources are not themselves in a namespace. And low-level resources, such as nodes and persistentVolumes, are not in any namespace.
To see which Kubernetes resources are and aren't in a namespace:
# In a namespace
kubectl api-resources --namespaced=true
# Not in a namespace
kubectl api-resources --namespaced=false
Automatic labelling
Kubernetes 1.21 [beta]
The Kubernetes control plane sets an immutable label
kubernetes.io/metadata.name
on all namespaces, provided that the NamespaceDefaultLabelName
feature gate is enabled.
The value of the label is the namespace name.
What's next
- Learn more about creating a new namespace.
- Learn more about deleting a namespace.
3.1.4.5 - Labels and Selectors
Labels are key/value pairs that are attached to objects, such as pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users, but do not directly imply semantics to the core system. Labels can be used to organize and to select subsets of objects. Labels can be attached to objects at creation time and subsequently added and modified at any time. Each object can have a set of key/value labels defined. Each Key must be unique for a given object.
"metadata": {
"labels": {
"key1" : "value1",
"key2" : "value2"
}
}
Labels allow for efficient queries and watches and are ideal for use in UIs and CLIs. Non-identifying information should be recorded using annotations.
Motivation
Labels enable users to map their own organizational structures onto system objects in a loosely coupled fashion, without requiring clients to store these mappings.
Service deployments and batch processing pipelines are often multi-dimensional entities (e.g., multiple partitions or deployments, multiple release tracks, multiple tiers, multiple micro-services per tier). Management often requires cross-cutting operations, which breaks encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by the infrastructure rather than by users.
Example labels:
"release" : "stable"
,"release" : "canary"
"environment" : "dev"
,"environment" : "qa"
,"environment" : "production"
"tier" : "frontend"
,"tier" : "backend"
,"tier" : "cache"
"partition" : "customerA"
,"partition" : "customerB"
"track" : "daily"
,"track" : "weekly"
These are examples of commonly used labels; you are free to develop your own conventions. Keep in mind that label Key must be unique for a given object.
Syntax and character set
Labels are key/value pairs. Valid label keys have two segments: an optional prefix and name, separated by a slash (/
). The name segment is required and must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]
) with dashes (-
), underscores (_
), dots (.
), and alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a series of DNS labels separated by dots (.
), not longer than 253 characters in total, followed by a slash (/
).
If the prefix is omitted, the label Key is presumed to be private to the user. Automated system components (e.g. kube-scheduler
, kube-controller-manager
, kube-apiserver
, kubectl
, or other third-party automation) which add labels to end-user objects must specify a prefix.
The kubernetes.io/
and k8s.io/
prefixes are reserved for Kubernetes core components.
Valid label value:
- must be 63 characters or less (can be empty),
- unless empty, must begin and end with an alphanumeric character (
[a-z0-9A-Z]
), - could contain dashes (
-
), underscores (_
), dots (.
), and alphanumerics between.
For example, here's the configuration file for a Pod that has two labels environment: production
and app: nginx
:
apiVersion: v1
kind: Pod
metadata:
name: label-demo
labels:
environment: production
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Label selectors
Unlike names and UIDs, labels do not provide uniqueness. In general, we expect many objects to carry the same label(s).
Via a label selector, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes.
The API currently supports two types of selectors: equality-based and set-based.
A label selector can be made of multiple requirements which are comma-separated. In the case of multiple requirements, all must be satisfied so the comma separator acts as a logical AND (&&
) operator.
The semantics of empty or non-specified selectors are dependent on the context, and API types that use selectors should document the validity and meaning of them.
||
) operator. Ensure your filter statements are structured accordingly.
Equality-based requirement
Equality- or inequality-based requirements allow filtering by label keys and values. Matching objects must satisfy all of the specified label constraints, though they may have additional labels as well.
Three kinds of operators are admitted =
,==
,!=
. The first two represent equality (and are synonyms), while the latter represents inequality. For example:
environment = production
tier != frontend
The former selects all resources with key equal to environment
and value equal to production
.
The latter selects all resources with key equal to tier
and value distinct from frontend
, and all resources with no labels with the tier
key.
One could filter for resources in production
excluding frontend
using the comma operator: environment=production,tier!=frontend
One usage scenario for equality-based label requirement is for Pods to specify
node selection criteria. For example, the sample Pod below selects nodes with
the label "accelerator=nvidia-tesla-p100
".
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
containers:
- name: cuda-test
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100
Set-based requirement
Set-based label requirements allow filtering keys according to a set of values. Three kinds of operators are supported: in
,notin
and exists
(only the key identifier). For example:
environment in (production, qa)
tier notin (frontend, backend)
partition
!partition
- The first example selects all resources with key equal to
environment
and value equal toproduction
orqa
. - The second example selects all resources with key equal to
tier
and values other thanfrontend
andbackend
, and all resources with no labels with thetier
key. - The third example selects all resources including a label with key
partition
; no values are checked. - The fourth example selects all resources without a label with key
partition
; no values are checked.
Similarly the comma separator acts as an AND operator. So filtering resources with a partition
key (no matter the value) and with environment
different than qa
can be achieved using partition,environment notin (qa)
.
The set-based label selector is a general form of equality since environment=production
is equivalent to environment in (production)
; similarly for !=
and notin
.
Set-based requirements can be mixed with equality-based requirements. For example: partition in (customerA, customerB),environment!=qa
.
API
LIST and WATCH filtering
LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter. Both requirements are permitted (presented here as they would appear in a URL query string):
- equality-based requirements:
?labelSelector=environment%3Dproduction,tier%3Dfrontend
- set-based requirements:
?labelSelector=environment+in+%28production%2Cqa%29%2Ctier+in+%28frontend%29
Both label selector styles can be used to list or watch resources via a REST client. For example, targeting apiserver
with kubectl
and using equality-based one may write:
kubectl get pods -l environment=production,tier=frontend
or using set-based requirements:
kubectl get pods -l 'environment in (production),tier in (frontend)'
As already mentioned set-based requirements are more expressive. For instance, they can implement the OR operator on values:
kubectl get pods -l 'environment in (production, qa)'
or restricting negative matching via exists operator:
kubectl get pods -l 'environment,environment notin (frontend)'
Set references in API objects
Some Kubernetes objects, such as services
and replicationcontrollers
,
also use label selectors to specify sets of other resources, such as
pods.
Service and ReplicationController
The set of pods that a service
targets is defined with a label selector. Similarly, the population of pods that a replicationcontroller
should manage is also defined with a label selector.
Labels selectors for both objects are defined in json
or yaml
files using maps, and only equality-based requirement selectors are supported:
"selector": {
"component" : "redis",
}
or
selector:
component: redis
this selector (respectively in json
or yaml
format) is equivalent to component=redis
or component in (redis)
.
Resources that support set-based requirements
Newer resources, such as Job
,
Deployment
,
ReplicaSet
, and
DaemonSet
,
support set-based requirements as well.
selector:
matchLabels:
component: redis
matchExpressions:
- {key: tier, operator: In, values: [cache]}
- {key: environment, operator: NotIn, values: [dev]}
matchLabels
is a map of {key,value}
pairs. A single {key,value}
in the matchLabels
map is equivalent to an element of matchExpressions
, whose key
field is "key", the operator
is "In", and the values
array contains only "value". matchExpressions
is a list of pod selector requirements. Valid operators include In, NotIn, Exists, and DoesNotExist. The values set must be non-empty in the case of In and NotIn. All of the requirements, from both matchLabels
and matchExpressions
are ANDed together -- they must all be satisfied in order to match.
Selecting sets of nodes
One use case for selecting over labels is to constrain the set of nodes onto which a pod can schedule. See the documentation on node selection for more information.
3.1.4.6 - Annotations
You can use Kubernetes annotations to attach arbitrary non-identifying metadata to objects. Clients such as tools and libraries can retrieve this metadata.
Attaching metadata to objects
You can use either labels or annotations to attach metadata to Kubernetes objects. Labels can be used to select objects and to find collections of objects that satisfy certain conditions. In contrast, annotations are not used to identify and select objects. The metadata in an annotation can be small or large, structured or unstructured, and can include characters not permitted by labels.
Annotations, like labels, are key/value maps:
"metadata": {
"annotations": {
"key1" : "value1",
"key2" : "value2"
}
}
Here are some examples of information that could be recorded in annotations:
-
Fields managed by a declarative configuration layer. Attaching these fields as annotations distinguishes them from default values set by clients or servers, and from auto-generated fields and fields set by auto-sizing or auto-scaling systems.
-
Build, release, or image information like timestamps, release IDs, git branch, PR numbers, image hashes, and registry address.
-
Pointers to logging, monitoring, analytics, or audit repositories.
-
Client library or tool information that can be used for debugging purposes: for example, name, version, and build information.
-
User or tool/system provenance information, such as URLs of related objects from other ecosystem components.
-
Lightweight rollout tool metadata: for example, config or checkpoints.
-
Phone or pager numbers of persons responsible, or directory entries that specify where that information can be found, such as a team web site.
-
Directives from the end-user to the implementations to modify behavior or engage non-standard features.
Instead of using annotations, you could store this type of information in an external database or directory, but that would make it much harder to produce shared client libraries and tools for deployment, management, introspection, and the like.
Syntax and character set
Annotations are key/value pairs. Valid annotation keys have two segments: an optional prefix and name, separated by a slash (/
). The name segment is required and must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]
) with dashes (-
), underscores (_
), dots (.
), and alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a series of DNS labels separated by dots (.
), not longer than 253 characters in total, followed by a slash (/
).
If the prefix is omitted, the annotation Key is presumed to be private to the user. Automated system components (e.g. kube-scheduler
, kube-controller-manager
, kube-apiserver
, kubectl
, or other third-party automation) which add annotations to end-user objects must specify a prefix.
The kubernetes.io/
and k8s.io/
prefixes are reserved for Kubernetes core components.
For example, here's the configuration file for a Pod that has the annotation imageregistry: https://hub.docker.com/
:
apiVersion: v1
kind: Pod
metadata:
name: annotations-demo
annotations:
imageregistry: "https://hub.docker.com/"
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
What's next
Learn more about Labels and Selectors.
3.1.4.7 - Field Selectors
Field selectors let you select Kubernetes resources based on the value of one or more resource fields. Here are some examples of field selector queries:
metadata.name=my-service
metadata.namespace!=default
status.phase=Pending
This kubectl
command selects all Pods for which the value of the status.phase
field is Running
:
kubectl get pods --field-selector status.phase=Running
kubectl
queries kubectl get pods
and kubectl get pods --field-selector ""
equivalent.
Supported fields
Supported field selectors vary by Kubernetes resource type. All resource types support the metadata.name
and metadata.namespace
fields. Using unsupported field selectors produces an error. For example:
kubectl get ingress --field-selector foo.bar=baz
Error from server (BadRequest): Unable to find "ingresses" that match label selector "", field selector "foo.bar=baz": "foo.bar" is not a known field selector: only "metadata.name", "metadata.namespace"
Supported operators
You can use the =
, ==
, and !=
operators with field selectors (=
and ==
mean the same thing). This kubectl
command, for example, selects all Kubernetes Services that aren't in the default
namespace:
kubectl get services --all-namespaces --field-selector metadata.namespace!=default
Chained selectors
As with label and other selectors, field selectors can be chained together as a comma-separated list. This kubectl
command selects all Pods for which the status.phase
does not equal Running
and the spec.restartPolicy
field equals Always
:
kubectl get pods --field-selector=status.phase!=Running,spec.restartPolicy=Always
Multiple resource types
You can use field selectors across multiple resource types. This kubectl
command selects all Statefulsets and Services that are not in the default
namespace:
kubectl get statefulsets,services --all-namespaces --field-selector metadata.namespace!=default
3.1.4.8 - Finalizers
Finalizers are namespaced keys that tell Kubernetes to wait until specific conditions are met before it fully deletes resources marked for deletion. Finalizers alert controllers to clean up resources the deleted object owned.
When you tell Kubernetes to delete an object that has finalizers specified for
it, the Kubernetes API marks the object for deletion by populating .metadata.deletionTimestamp
,
and returns a 202
status code (HTTP "Accepted"). The target object remains in a terminating state while the
control plane, or other components, take the actions defined by the finalizers.
After these actions are complete, the controller removes the relevant finalizers
from the target object. When the metadata.finalizers
field is empty,
Kubernetes considers the deletion complete and deletes the object.
You can use finalizers to control garbage collection of resources. For example, you can define a finalizer to clean up related resources or infrastructure before the controller deletes the target resource.
You can use finalizers to control garbage collection of resources by alerting controllers to perform specific cleanup tasks before deleting the target resource.
Finalizers don't usually specify the code to execute. Instead, they are typically lists of keys on a specific resource similar to annotations. Kubernetes specifies some finalizers automatically, but you can also specify your own.
How finalizers work
When you create a resource using a manifest file, you can specify finalizers in
the metadata.finalizers
field. When you attempt to delete the resource, the
API server handling the delete request notices the values in the finalizers
field
and does the following:
- Modifies the object to add a
metadata.deletionTimestamp
field with the time you started the deletion. - Prevents the object from being removed until its
metadata.finalizers
field is empty. - Returns a
202
status code (HTTP "Accepted")
The controller managing that finalizer notices the update to the object setting the
metadata.deletionTimestamp
, indicating deletion of the object has been requested.
The controller then attempts to satisfy the requirements of the finalizers
specified for that resource. Each time a finalizer condition is satisfied, the
controller removes that key from the resource's finalizers
field. When the
finalizers
field is emptied, an object with a deletionTimestamp
field set
is automatically deleted. You can also use finalizers to prevent deletion of unmanaged resources.
A common example of a finalizer is kubernetes.io/pv-protection
, which prevents
accidental deletion of PersistentVolume
objects. When a PersistentVolume
object is in use by a Pod, Kubernetes adds the pv-protection
finalizer. If you
try to delete the PersistentVolume
, it enters a Terminating
status, but the
controller can't delete it because the finalizer exists. When the Pod stops
using the PersistentVolume
, Kubernetes clears the pv-protection
finalizer,
and the controller deletes the volume.
Owner references, labels, and finalizers
Like labels, owner references describe the relationships between objects in Kubernetes, but are used for a different purpose. When a controller manages objects like Pods, it uses labels to track changes to groups of related objects. For example, when a Job creates one or more Pods, the Job controller applies labels to those pods and tracks changes to any Pods in the cluster with the same label.
The Job controller also adds owner references to those Pods, pointing at the Job that created the Pods. If you delete the Job while these Pods are running, Kubernetes uses the owner references (not labels) to determine which Pods in the cluster need cleanup.
Kubernetes also processes finalizers when it identifies owner references on a resource targeted for deletion.
In some situations, finalizers can block the deletion of dependent objects, which can cause the targeted owner object to remain for longer than expected without being fully deleted. In these situations, you should check finalizers and owner references on the target owner and dependent objects to troubleshoot the cause.
What's next
- Read Using Finalizers to Control Deletion on the Kubernetes blog.
3.1.4.9 - Owners and Dependents
In Kubernetes, some objects are owners of other objects. For example, a ReplicaSet is the owner of a set of Pods. These owned objects are dependents of their owner.
Ownership is different from the labels and selectors
mechanism that some resources also use. For example, consider a Service that
creates EndpointSlice
objects. The Service uses labels to allow the control plane to
determine which EndpointSlice
objects are used for that Service. In addition
to the labels, each EndpointSlice
that is managed on behalf of a Service has
an owner reference. Owner references help different parts of Kubernetes avoid
interfering with objects they don’t control.
Owner references in object specifications
Dependent objects have a metadata.ownerReferences
field that references their
owner object. A valid owner reference consists of the object name and a UID
within the same namespace as the dependent object. Kubernetes sets the value of
this field automatically for objects that are dependents of other objects like
ReplicaSets, DaemonSets, Deployments, Jobs and CronJobs, and ReplicationControllers.
You can also configure these relationships manually by changing the value of
this field. However, you usually don't need to and can allow Kubernetes to
automatically manage the relationships.
Dependent objects also have an ownerReferences.blockOwnerDeletion
field that
takes a boolean value and controls whether specific dependents can block garbage
collection from deleting their owner object. Kubernetes automatically sets this
field to true
if a controller
(for example, the Deployment controller) sets the value of the
metadata.ownerReferences
field. You can also set the value of the
blockOwnerDeletion
field manually to control which dependents block garbage
collection.
A Kubernetes admission controller controls user access to change this field for dependent resources, based on the delete permissions of the owner. This control prevents unauthorized users from delaying owner object deletion.
Cross-namespace owner references are disallowed by design. Namespaced dependents can specify cluster-scoped or namespaced owners. A namespaced owner must exist in the same namespace as the dependent. If it does not, the owner reference is treated as absent, and the dependent is subject to deletion once all owners are verified absent.
Cluster-scoped dependents can only specify cluster-scoped owners. In v1.20+, if a cluster-scoped dependent specifies a namespaced kind as an owner, it is treated as having an unresolvable owner reference, and is not able to be garbage collected.
In v1.20+, if the garbage collector detects an invalid cross-namespace ownerReference
,
or a cluster-scoped dependent with an ownerReference
referencing a namespaced kind, a warning Event
with a reason of OwnerRefInvalidNamespace
and an involvedObject
of the invalid dependent is reported.
You can check for that kind of Event by running
kubectl get events -A --field-selector=reason=OwnerRefInvalidNamespace
.
Ownership and finalizers
When you tell Kubernetes to delete a resource, the API server allows the
managing controller to process any finalizer rules
for the resource. Finalizers
prevent accidental deletion of resources your cluster may still need to function
correctly. For example, if you try to delete a PersistentVolume
that is still
in use by a Pod, the deletion does not happen immediately because the
PersistentVolume
has the kubernetes.io/pv-protection
finalizer on it.
Instead, the volume remains in the Terminating
status until Kubernetes clears
the finalizer, which only happens after the PersistentVolume
is no longer
bound to a Pod.
Kubernetes also adds finalizers to an owner resource when you use either
foreground or orphan cascading deletion.
In foreground deletion, it adds the foreground
finalizer so that the
controller must delete dependent resources that also have
ownerReferences.blockOwnerDeletion=true
before it deletes the owner. If you
specify an orphan deletion policy, Kubernetes adds the orphan
finalizer so
that the controller ignores dependent resources after it deletes the owner
object.
What's next
- Learn more about Kubernetes finalizers.
- Learn about garbage collection.
- Read the API reference for object metadata.
3.1.4.10 - Recommended Labels
You can visualize and manage Kubernetes objects with more tools than kubectl and the dashboard. A common set of labels allows tools to work interoperably, describing objects in a common manner that all tools can understand.
In addition to supporting tooling, the recommended labels describe applications in a way that can be queried.
The metadata is organized around the concept of an application. Kubernetes is not a platform as a service (PaaS) and doesn't have or enforce a formal notion of an application. Instead, applications are informal and described with metadata. The definition of what an application contains is loose.
Shared labels and annotations share a common prefix: app.kubernetes.io
. Labels
without a prefix are private to users. The shared prefix ensures that shared labels
do not interfere with custom user labels.
Labels
In order to take full advantage of using these labels, they should be applied on every resource object.
Key | Description | Example | Type |
---|---|---|---|
app.kubernetes.io/name |
The name of the application | mysql |
string |
app.kubernetes.io/instance |
A unique name identifying the instance of an application | mysql-abcxzy |
string |
app.kubernetes.io/version |
The current version of the application (e.g., a semantic version, revision hash, etc.) | 5.7.21 |
string |
app.kubernetes.io/component |
The component within the architecture | database |
string |
app.kubernetes.io/part-of |
The name of a higher level application this one is part of | wordpress |
string |
app.kubernetes.io/managed-by |
The tool being used to manage the operation of an application | helm |
string |
app.kubernetes.io/created-by |
The controller/user who created this resource | controller-manager |
string |
To illustrate these labels in action, consider the following StatefulSet object:
# This is an excerpt
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
app.kubernetes.io/managed-by: helm
app.kubernetes.io/created-by: controller-manager
Applications And Instances Of Applications
An application can be installed one or more times into a Kubernetes cluster and, in some cases, the same namespace. For example, WordPress can be installed more than once where different websites are different installations of WordPress.
The name of an application and the instance name are recorded separately. For
example, WordPress has a app.kubernetes.io/name
of wordpress
while it has
an instance name, represented as app.kubernetes.io/instance
with a value of
wordpress-abcxzy
. This enables the application and instance of the application
to be identifiable. Every instance of an application must have a unique name.
Examples
To illustrate different ways to use these labels the following examples have varying complexity.
A Simple Stateless Service
Consider the case for a simple stateless service deployed using Deployment
and Service
objects. The following two snippets represent how the labels could be used in their simplest form.
The Deployment
is used to oversee the pods running the application itself.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxzy
...
The Service
is used to expose the application.
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxzy
...
Web Application With A Database
Consider a slightly more complicated application: a web application (WordPress) using a database (MySQL), installed using Helm. The following snippets illustrate the start of objects used to deploy this application.
The start to the following Deployment
is used for WordPress:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxzy
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
...
The Service
is used to expose WordPress:
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxzy
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
...
MySQL is exposed as a StatefulSet
with metadata for both it and the larger application it belongs to:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
The Service
is used to expose MySQL as part of WordPress:
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
With the MySQL StatefulSet
and Service
you'll notice information about both MySQL and WordPress, the broader application, are included.
3.2 - Cluster Architecture
3.2.1 - Nodes
Kubernetes runs your workload by placing containers into Pods to run on Nodes. A node may be a virtual or physical machine, depending on the cluster. Each node is managed by the control plane and contains the services necessary to run Pods.
Typically you have several nodes in a cluster; in a learning or resource-limited environment, you might have only one node.
The components on a node include the kubelet, a container runtime, and the kube-proxy.
Management
There are two main ways to have Nodes added to the API server:
- The kubelet on a node self-registers to the control plane
- You (or another human user) manually add a Node object
After you create a Node object, or the kubelet on a node self-registers, the control plane checks whether the new Node object is valid. For example, if you try to create a Node from the following JSON manifest:
{
"kind": "Node",
"apiVersion": "v1",
"metadata": {
"name": "10.240.79.157",
"labels": {
"name": "my-first-k8s-node"
}
}
}
Kubernetes creates a Node object internally (the representation). Kubernetes checks
that a kubelet has registered to the API server that matches the metadata.name
field of the Node. If the node is healthy (i.e. all necessary services are running),
then it is eligible to run a Pod. Otherwise, that node is ignored for any cluster activity
until it becomes healthy.
Kubernetes keeps the object for the invalid Node and continues checking to see whether it becomes healthy.
You, or a controller, must explicitly delete the Node object to stop that health checking.
The name of a Node object must be a valid DNS subdomain name.
Node name uniqueness
The name identifies a Node. Two Nodes cannot have the same name at the same time. Kubernetes also assumes that a resource with the same name is the same object. In case of a Node, it is implicitly assumed that an instance using the same name will have the same state (e.g. network settings, root disk contents) and attributes like node labels. This may lead to inconsistencies if an instance was modified without changing its name. If the Node needs to be replaced or updated significantly, the existing Node object needs to be removed from API server first and re-added after the update.
Self-registration of Nodes
When the kubelet flag --register-node
is true (the default), the kubelet will attempt to
register itself with the API server. This is the preferred pattern, used by most distros.
For self-registration, the kubelet is started with the following options:
-
--kubeconfig
- Path to credentials to authenticate itself to the API server. -
--cloud-provider
- How to talk to a cloud provider to read metadata about itself. -
--register-node
- Automatically register with the API server. -
--register-with-taints
- Register the node with the given list of taints (comma separated<key>=<value>:<effect>
).No-op if
register-node
is false. -
--node-ip
- IP address of the node. -
--node-labels
- Labels to add when registering the node in the cluster (see label restrictions enforced by the NodeRestriction admission plugin). -
--node-status-update-frequency
- Specifies how often kubelet posts its node status to the API server.
When the Node authorization mode and NodeRestriction admission plugin are enabled, kubelets are only authorized to create/modify their own Node resource.
As mentioned in the Node name uniqueness section,
when Node configuration needs to be updated, it is a good practice to re-register
the node with the API server. For example, if the kubelet being restarted with
the new set of --node-labels
, but the same Node name is used, the change will
not take an effect, as labels are being set on the Node registration.
Pods already scheduled on the Node may misbehave or cause issues if the Node configuration will be changed on kubelet restart. For example, already running Pod may be tainted against the new labels assigned to the Node, while other Pods, that are incompatible with that Pod will be scheduled based on this new label. Node re-registration ensures all Pods will be drained and properly re-scheduled.
Manual Node administration
You can create and modify Node objects using kubectl.
When you want to create Node objects manually, set the kubelet flag --register-node=false
.
You can modify Node objects regardless of the setting of --register-node
.
For example, you can set labels on an existing Node or mark it unschedulable.
You can use labels on Nodes in conjunction with node selectors on Pods to control scheduling. For example, you can constrain a Pod to only be eligible to run on a subset of the available nodes.
Marking a node as unschedulable prevents the scheduler from placing new pods onto that Node but does not affect existing Pods on the Node. This is useful as a preparatory step before a node reboot or other maintenance.
To mark a Node unschedulable, run:
kubectl cordon $NODENAME
See Safely Drain a Node for more details.
Node status
A Node's status contains the following information:
You can use kubectl
to view a Node's status and other details:
kubectl describe node <insert-node-name-here>
Each section of the output is described below.
Addresses
The usage of these fields varies depending on your cloud provider or bare metal configuration.
- HostName: The hostname as reported by the node's kernel. Can be overridden via the kubelet
--hostname-override
parameter. - ExternalIP: Typically the IP address of the node that is externally routable (available from outside the cluster).
- InternalIP: Typically the IP address of the node that is routable only within the cluster.
Conditions
The conditions
field describes the status of all Running
nodes. Examples of conditions include:
Node Condition | Description |
---|---|
Ready |
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds) |
DiskPressure |
True if pressure exists on the disk size—that is, if the disk capacity is low; otherwise False |
MemoryPressure |
True if pressure exists on the node memory—that is, if the node memory is low; otherwise False |
PIDPressure |
True if pressure exists on the processes—that is, if there are too many processes on the node; otherwise False |
NetworkUnavailable |
True if the network for the node is not correctly configured, otherwise False |
SchedulingDisabled
. SchedulingDisabled
is not a Condition in the Kubernetes API; instead,
cordoned nodes are marked Unschedulable in their spec.
In the Kubernetes API, a node's condition is represented as part of the .status
of the Node resource. For example, the following JSON structure describes a healthy node:
"conditions": [
{
"type": "Ready",
"status": "True",
"reason": "KubeletReady",
"message": "kubelet is posting ready status",
"lastHeartbeatTime": "2019-06-05T18:38:35Z",
"lastTransitionTime": "2019-06-05T11:41:27Z"
}
]
If the status
of the Ready condition remains Unknown
or False
for longer
than the pod-eviction-timeout
(an argument passed to the
kube-controller-manager), then the node controller triggers
API-initiated eviction
for all Pods assigned to that node. The default eviction timeout duration is
five minutes.
In some cases when the node is unreachable, the API server is unable to communicate
with the kubelet on the node. The decision to delete the pods cannot be communicated to
the kubelet until communication with the API server is re-established. In the meantime,
the pods that are scheduled for deletion may continue to run on the partitioned node.
The node controller does not force delete pods until it is confirmed that they have stopped
running in the cluster. You can see the pods that might be running on an unreachable node as
being in the Terminating
or Unknown
state. In cases where Kubernetes cannot deduce from the
underlying infrastructure if a node has permanently left a cluster, the cluster administrator
may need to delete the node object by hand. Deleting the node object from Kubernetes causes
all the Pod objects running on the node to be deleted from the API server and frees up their
names.
When problems occur on nodes, the Kubernetes control plane automatically creates taints that match the conditions affecting the node. The scheduler takes the Node's taints into consideration when assigning a Pod to a Node. Pods can also have tolerations that let them run on a Node even though it has a specific taint.
See Taint Nodes by Condition for more details.
Capacity and Allocatable
Describes the resources available on the node: CPU, memory, and the maximum number of pods that can be scheduled onto the node.
The fields in the capacity block indicate the total amount of resources that a Node has. The allocatable block indicates the amount of resources on a Node that is available to be consumed by normal Pods.
You may read more about capacity and allocatable resources while learning how to reserve compute resources on a Node.
Info
Describes general information about the node, such as kernel version, Kubernetes version (kubelet and kube-proxy version), container runtime details, and which operating system the node uses. The kubelet gathers this information from the node and publishes it into the Kubernetes API.
Heartbeats
Heartbeats, sent by Kubernetes nodes, help your cluster determine the availability of each node, and to take action when failures are detected.
For nodes there are two forms of heartbeats:
- updates to the
.status
of a Node - Lease objects
within the
kube-node-lease
namespace. Each Node has an associated Lease object.
Compared to updates to .status
of a Node, a Lease is a lightweight resource.
Using Leases for heartbeats reduces the performance impact of these updates
for large clusters.
The kubelet is responsible for creating and updating the .status
of Nodes,
and for updating their related Leases.
- The kubelet updates the node's
.status
either when there is change in status or if there has been no update for a configured interval. The default interval for.status
updates to Nodes is 5 minutes, which is much longer than the 40 second default timeout for unreachable nodes. - The kubelet creates and then updates its Lease object every 10 seconds
(the default update interval). Lease updates occur independently from
updates to the Node's
.status
. If the Lease update fails, the kubelet retries, using exponential backoff that starts at 200 milliseconds and capped at 7 seconds.
Node controller
The node controller is a Kubernetes control plane component that manages various aspects of nodes.
The node controller has multiple roles in a node's life. The first is assigning a CIDR block to the node when it is registered (if CIDR assignment is turned on).
The second is keeping the node controller's internal list of nodes up to date with the cloud provider's list of available machines. When running in a cloud environment and whenever a node is unhealthy, the node controller asks the cloud provider if the VM for that node is still available. If not, the node controller deletes the node from its list of nodes.
The third is monitoring the nodes' health. The node controller is responsible for:
- In the case that a node becomes unreachable, updating the NodeReady condition
of within the Node's
.status
. In this case the node controller sets the NodeReady condition toConditionUnknown
. - If a node remains unreachable: triggering
API-initiated eviction
for all of the Pods on the unreachable node. By default, the node controller
waits 5 minutes between marking the node as
ConditionUnknown
and submitting the first eviction request.
The node controller checks the state of each node every --node-monitor-period
seconds.
Rate limits on eviction
In most cases, the node controller limits the eviction rate to
--node-eviction-rate
(default 0.1) per second, meaning it won't evict pods
from more than 1 node per 10 seconds.
The node eviction behavior changes when a node in a given availability zone
becomes unhealthy. The node controller checks what percentage of nodes in the zone
are unhealthy (NodeReady condition is ConditionUnknown
or ConditionFalse
) at
the same time:
- If the fraction of unhealthy nodes is at least
--unhealthy-zone-threshold
(default 0.55), then the eviction rate is reduced. - If the cluster is small (i.e. has less than or equal to
--large-cluster-size-threshold
nodes - default 50), then evictions are stopped. - Otherwise, the eviction rate is reduced to
--secondary-node-eviction-rate
(default 0.01) per second.
The reason these policies are implemented per availability zone is because one availability zone might become partitioned from the control plane while the others remain connected. If your cluster does not span multiple cloud provider availability zones, then the eviction mechanism does not take per-zone unavailability into account.
A key reason for spreading your nodes across availability zones is so that the
workload can be shifted to healthy zones when one entire zone goes down.
Therefore, if all nodes in a zone are unhealthy, then the node controller evicts at
the normal rate of --node-eviction-rate
. The corner case is when all zones are
completely unhealthy (none of the nodes in the cluster are healthy). In such a
case, the node controller assumes that there is some problem with connectivity
between the control plane and the nodes, and doesn't perform any evictions.
(If there has been an outage and some nodes reappear, the node controller does
evict pods from the remaining nodes that are unhealthy or unreachable).
The node controller is also responsible for evicting pods running on nodes with
NoExecute
taints, unless those pods tolerate that taint.
The node controller also adds taints
corresponding to node problems like node unreachable or not ready. This means
that the scheduler won't place Pods onto unhealthy nodes.
Resource capacity tracking
Node objects track information about the Node's resource capacity: for example, the amount of memory available and the number of CPUs. Nodes that self register report their capacity during registration. If you manually add a Node, then you need to set the node's capacity information when you add it.
The Kubernetes scheduler ensures that there are enough resources for all the Pods on a Node. The scheduler checks that the sum of the requests of containers on the node is no greater than the node's capacity. That sum of requests includes all containers managed by the kubelet, but excludes any containers started directly by the container runtime, and also excludes any processes running outside of the kubelet's control.
Node topology
Kubernetes v1.16 [alpha]
If you have enabled the TopologyManager
feature gate, then
the kubelet can use topology hints when making resource assignment decisions.
See Control Topology Management Policies on a Node
for more information.
Graceful node shutdown
Kubernetes v1.21 [beta]
The kubelet attempts to detect node system shutdown and terminates pods running on the node.
Kubelet ensures that pods follow the normal pod termination process during the node shutdown.
The Graceful node shutdown feature depends on systemd since it takes advantage of systemd inhibitor locks to delay the node shutdown with a given duration.
Graceful node shutdown is controlled with the GracefulNodeShutdown
feature gate which is
enabled by default in 1.21.
Note that by default, both configuration options described below,
shutdownGracePeriod
and shutdownGracePeriodCriticalPods
are set to zero,
thus not activating Graceful node shutdown functionality.
To activate the feature, the two kubelet config settings should be configured appropriately and
set to non-zero values.
During a graceful shutdown, kubelet terminates pods in two phases:
- Terminate regular pods running on the node.
- Terminate critical pods running on the node.
Graceful node shutdown feature is configured with two
KubeletConfiguration
options:
shutdownGracePeriod
:- Specifies the total duration that the node should delay the shutdown by. This is the total grace period for pod termination for both regular and critical pods.
shutdownGracePeriodCriticalPods
:- Specifies the duration used to terminate
critical pods
during a node shutdown. This value should be less than
shutdownGracePeriod
.
- Specifies the duration used to terminate
critical pods
during a node shutdown. This value should be less than
For example, if shutdownGracePeriod=30s
, and
shutdownGracePeriodCriticalPods=10s
, kubelet will delay the node shutdown by
30 seconds. During the shutdown, the first 20 (30-10) seconds would be reserved
for gracefully terminating normal pods, and the last 10 seconds would be
reserved for terminating critical pods.
When pods were evicted during the graceful node shutdown, they are marked as shutdown.
Running kubectl get pods
shows the status of the the evicted pods as Terminated
.
And kubectl describe pod
indicates that the pod was evicted because of node shutdown:
Reason: Terminated
Message: Pod was terminated in response to imminent node shutdown.
Pod Priority based graceful node shutdown
Kubernetes v1.23 [alpha]
To provide more flexibility during graceful node shutdown around the ordering of pods during shutdown, graceful node shutdown honors the PriorityClass for Pods, provided that you enabled this feature in your cluster. The feature allows cluster administers to explicitly define the ordering of pods during graceful node shutdown based on priority classes.
The Graceful Node Shutdown feature, as described above, shuts down pods in two phases, non-critical pods, followed by critical pods. If additional flexibility is needed to explicitly define the ordering of pods during shutdown in a more granular way, pod priority based graceful shutdown can be used.
When graceful node shutdown honors pod priorities, this makes it possible to do graceful node shutdown in multiple phases, each phase shutting down a particular priority class of pods. The kubelet can be configured with the exact phases and shutdown time per phase.
Assuming the following custom pod priority classes in a cluster,
Pod priority class name | Pod priority class value |
---|---|
custom-class-a |
100000 |
custom-class-b |
10000 |
custom-class-c |
1000 |
regular/unset |
0 |
Within the kubelet configuration
the settings for shutdownGracePeriodByPodPriority
could look like:
Pod priority class value | Shutdown period |
---|---|
100000 | 10 seconds |
10000 | 180 seconds |
1000 | 120 seconds |
0 | 60 seconds |
The corresponding kubelet config YAML configuration would be:
shutdownGracePeriodByPodPriority:
- priority: 100000
shutdownGracePeriodSeconds: 10
- priority: 10000
shutdownGracePeriodSeconds: 180
- priority: 1000
shutdownGracePeriodSeconds: 120
- priority: 0
shutdownGracePeriodSeconds: 60
The above table implies that any pod with priority
value >= 100000 will get
just 10 seconds to stop, any pod with value >= 10000 and < 100000 will get 180
seconds to stop, any pod with value >= 1000 and < 10000 will get 120 seconds to stop.
Finally, all other pods will get 60 seconds to stop.
One doesn't have to specify values corresponding to all of the classes. For example, you could instead use these settings:
Pod priority class value | Shutdown period |
---|---|
100000 | 300 seconds |
1000 | 120 seconds |
0 | 60 seconds |
In the above case, the pods with custom-class-b
will go into the same bucket
as custom-class-c
for shutdown.
If there are no pods in a particular range, then the kubelet does not wait for pods in that priority range. Instead, the kubelet immediately skips to the next priority class value range.
If this feature is enabled and no configuration is provided, then no ordering action will be taken.
Using this feature, requires enabling the
GracefulNodeShutdownBasedOnPodPriority
feature gate, and setting the kubelet
config's ShutdownGracePeriodByPodPriority
to the desired configuration
containing the pod priority class values and their respective shutdown periods.
Swap memory management
Kubernetes v1.22 [alpha]
Prior to Kubernetes 1.22, nodes did not support the use of swap memory, and a kubelet would by default fail to start if swap was detected on a node. In 1.22 onwards, swap memory support can be enabled on a per-node basis.
To enable swap on a node, the NodeSwap
feature gate must be enabled on
the kubelet, and the --fail-swap-on
command line flag or failSwapOn
configuration setting
must be set to false.
A user can also optionally configure memorySwap.swapBehavior
in order to
specify how a node will use swap memory. For example,
memorySwap:
swapBehavior: LimitedSwap
The available configuration options for swapBehavior
are:
LimitedSwap
: Kubernetes workloads are limited in how much swap they can use. Workloads on the node not managed by Kubernetes can still swap.UnlimitedSwap
: Kubernetes workloads can use as much swap memory as they request, up to the system limit.
If configuration for memorySwap
is not specified and the feature gate is
enabled, by default the kubelet will apply the same behaviour as the
LimitedSwap
setting.
The behaviour of the LimitedSwap
setting depends if the node is running with
v1 or v2 of control groups (also known as "cgroups"):
- cgroupsv1: Kubernetes workloads can use any combination of memory and swap, up to the pod's memory limit, if set.
- cgroupsv2: Kubernetes workloads cannot use swap memory.
For more information, and to assist with testing and provide feedback, please see KEP-2400 and its design proposal.
What's next
- Learn about the components that make up a node.
- Read the API definition for Node.
- Read the Node section of the architecture design document.
- Read about taints and tolerations.
3.2.2 - Control Plane-Node Communication
This document catalogs the communication paths between the control plane (apiserver) and the Kubernetes cluster. The intent is to allow users to customize their installation to harden the network configuration such that the cluster can be run on an untrusted network (or on fully public IPs on a cloud provider).
Node to Control Plane
Kubernetes has a "hub-and-spoke" API pattern. All API usage from nodes (or the pods they run) terminates at the apiserver. None of the other control plane components are designed to expose remote services. The apiserver is configured to listen for remote connections on a secure HTTPS port (typically 443) with one or more forms of client authentication enabled. One or more forms of authorization should be enabled, especially if anonymous requests or service account tokens are allowed.
Nodes should be provisioned with the public root certificate for the cluster such that they can connect securely to the apiserver along with valid client credentials. A good approach is that the client credentials provided to the kubelet are in the form of a client certificate. See kubelet TLS bootstrapping for automated provisioning of kubelet client certificates.
Pods that wish to connect to the apiserver can do so securely by leveraging a service account so that Kubernetes will automatically inject the public root certificate and a valid bearer token into the pod when it is instantiated.
The kubernetes
service (in default
namespace) is configured with a virtual IP address that is redirected (via kube-proxy) to the HTTPS endpoint on the apiserver.
The control plane components also communicate with the cluster apiserver over the secure port.
As a result, the default operating mode for connections from the nodes and pods running on the nodes to the control plane is secured by default and can run over untrusted and/or public networks.
Control Plane to node
There are two primary communication paths from the control plane (apiserver) to the nodes. The first is from the apiserver to the kubelet process which runs on each node in the cluster. The second is from the apiserver to any node, pod, or service through the apiserver's proxy functionality.
apiserver to kubelet
The connections from the apiserver to the kubelet are used for:
- Fetching logs for pods.
- Attaching (through kubectl) to running pods.
- Providing the kubelet's port-forwarding functionality.
These connections terminate at the kubelet's HTTPS endpoint. By default, the apiserver does not verify the kubelet's serving certificate, which makes the connection subject to man-in-the-middle attacks and unsafe to run over untrusted and/or public networks.
To verify this connection, use the --kubelet-certificate-authority
flag to provide the apiserver with a root certificate bundle to use to verify the kubelet's serving certificate.
If that is not possible, use SSH tunneling between the apiserver and kubelet if required to avoid connecting over an untrusted or public network.
Finally, Kubelet authentication and/or authorization should be enabled to secure the kubelet API.
apiserver to nodes, pods, and services
The connections from the apiserver to a node, pod, or service default to plain HTTP connections and are therefore neither authenticated nor encrypted. They can be run over a secure HTTPS connection by prefixing https:
to the node, pod, or service name in the API URL, but they will not validate the certificate provided by the HTTPS endpoint nor provide client credentials. So while the connection will be encrypted, it will not provide any guarantees of integrity. These connections are not currently safe to run over untrusted or public networks.
SSH tunnels
Kubernetes supports SSH tunnels to protect the control plane to nodes communication paths. In this configuration, the apiserver initiates an SSH tunnel to each node in the cluster (connecting to the ssh server listening on port 22) and passes all traffic destined for a kubelet, node, pod, or service through the tunnel. This tunnel ensures that the traffic is not exposed outside of the network in which the nodes are running.
SSH tunnels are currently deprecated, so you shouldn't opt to use them unless you know what you are doing. The Konnectivity service is a replacement for this communication channel.
Konnectivity service
Kubernetes v1.18 [beta]
As a replacement to the SSH tunnels, the Konnectivity service provides TCP level proxy for the control plane to cluster communication. The Konnectivity service consists of two parts: the Konnectivity server in the control plane network and the Konnectivity agents in the nodes network. The Konnectivity agents initiate connections to the Konnectivity server and maintain the network connections. After enabling the Konnectivity service, all control plane to nodes traffic goes through these connections.
Follow the Konnectivity service task to set up the Konnectivity service in your cluster.
3.2.3 - Controllers
In robotics and automation, a control loop is a non-terminating loop that regulates the state of a system.
Here is one example of a control loop: a thermostat in a room.
When you set the temperature, that's telling the thermostat about your desired state. The actual room temperature is the current state. The thermostat acts to bring the current state closer to the desired state, by turning equipment on or off.
In Kubernetes, controllers are control loops that watch the state of your cluster, then make or request changes where needed. Each controller tries to move the current cluster state closer to the desired state.Controller pattern
A controller tracks at least one Kubernetes resource type. These objects have a spec field that represents the desired state. The controller(s) for that resource are responsible for making the current state come closer to that desired state.
The controller might carry the action out itself; more commonly, in Kubernetes, a controller will send messages to the API server that have useful side effects. You'll see examples of this below.
Control via API server
The Job controller is an example of a Kubernetes built-in controller. Built-in controllers manage state by interacting with the cluster API server.
Job is a Kubernetes resource that runs a Pod, or perhaps several Pods, to carry out a task and then stop.
(Once scheduled, Pod objects become part of the desired state for a kubelet).
When the Job controller sees a new task it makes sure that, somewhere in your cluster, the kubelets on a set of Nodes are running the right number of Pods to get the work done. The Job controller does not run any Pods or containers itself. Instead, the Job controller tells the API server to create or remove Pods. Other components in the control plane act on the new information (there are new Pods to schedule and run), and eventually the work is done.
After you create a new Job, the desired state is for that Job to be completed. The Job controller makes the current state for that Job be nearer to your desired state: creating Pods that do the work you wanted for that Job, so that the Job is closer to completion.
Controllers also update the objects that configure them.
For example: once the work is done for a Job, the Job controller
updates that Job object to mark it Finished
.
(This is a bit like how some thermostats turn a light off to indicate that your room is now at the temperature you set).
Direct control
By contrast with Job, some controllers need to make changes to things outside of your cluster.
For example, if you use a control loop to make sure there are enough Nodes in your cluster, then that controller needs something outside the current cluster to set up new Nodes when needed.
Controllers that interact with external state find their desired state from the API server, then communicate directly with an external system to bring the current state closer in line.
(There actually is a controller that horizontally scales the nodes in your cluster.)
The important point here is that the controller makes some change to bring about your desired state, and then reports current state back to your cluster's API server. Other control loops can observe that reported data and take their own actions.
In the thermostat example, if the room is very cold then a different controller might also turn on a frost protection heater. With Kubernetes clusters, the control plane indirectly works with IP address management tools, storage services, cloud provider APIs, and other services by extending Kubernetes to implement that.
Desired versus current state
Kubernetes takes a cloud-native view of systems, and is able to handle constant change.
Your cluster could be changing at any point as work happens and control loops automatically fix failures. This means that, potentially, your cluster never reaches a stable state.
As long as the controllers for your cluster are running and able to make useful changes, it doesn't matter if the overall state is stable or not.
Design
As a tenet of its design, Kubernetes uses lots of controllers that each manage a particular aspect of cluster state. Most commonly, a particular control loop (controller) uses one kind of resource as its desired state, and has a different kind of resource that it manages to make that desired state happen. For example, a controller for Jobs tracks Job objects (to discover new work) and Pod objects (to run the Jobs, and then to see when the work is finished). In this case something else creates the Jobs, whereas the Job controller creates Pods.
It's useful to have simple controllers rather than one, monolithic set of control loops that are interlinked. Controllers can fail, so Kubernetes is designed to allow for that.
There can be several controllers that create or update the same kind of object. Behind the scenes, Kubernetes controllers make sure that they only pay attention to the resources linked to their controlling resource.
For example, you can have Deployments and Jobs; these both create Pods. The Job controller does not delete the Pods that your Deployment created, because there is information (labels) the controllers can use to tell those Pods apart.
Ways of running controllers
Kubernetes comes with a set of built-in controllers that run inside the kube-controller-manager. These built-in controllers provide important core behaviors.
The Deployment controller and Job controller are examples of controllers that come as part of Kubernetes itself ("built-in" controllers). Kubernetes lets you run a resilient control plane, so that if any of the built-in controllers were to fail, another part of the control plane will take over the work.
You can find controllers that run outside the control plane, to extend Kubernetes. Or, if you want, you can write a new controller yourself. You can run your own controller as a set of Pods, or externally to Kubernetes. What fits best will depend on what that particular controller does.
What's next
- Read about the Kubernetes control plane
- Discover some of the basic Kubernetes objects
- Learn more about the Kubernetes API
- If you want to write your own controller, see Extension Patterns in Extending Kubernetes.
3.2.4 - Cloud Controller Manager
Kubernetes v1.11 [beta]
Cloud infrastructure technologies let you run Kubernetes on public, private, and hybrid clouds. Kubernetes believes in automated, API-driven infrastructure without tight coupling between components.
The cloud-controller-manager is a Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider's API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.
By decoupling the interoperability logic between Kubernetes and the underlying cloud infrastructure, the cloud-controller-manager component enables cloud providers to release features at a different pace compared to the main Kubernetes project.
The cloud-controller-manager is structured using a plugin mechanism that allows different cloud providers to integrate their platforms with Kubernetes.
Design
The cloud controller manager runs in the control plane as a replicated set of processes (usually, these are containers in Pods). Each cloud-controller-manager implements multiple controllers in a single process.
Cloud controller manager functions
The controllers inside the cloud controller manager include:
Node controller
The node controller is responsible for updating Node objects when new servers are created in your cloud infrastructure. The node controller obtains information about the hosts running inside your tenancy with the cloud provider. The node controller performs the following functions:
- Update a Node object with the corresponding server's unique identifier obtained from the cloud provider API.
- Annotating and labelling the Node object with cloud-specific information, such as the region the node is deployed into and the resources (CPU, memory, etc) that it has available.
- Obtain the node's hostname and network addresses.
- Verifying the node's health. In case a node becomes unresponsive, this controller checks with your cloud provider's API to see if the server has been deactivated / deleted / terminated. If the node has been deleted from the cloud, the controller deletes the Node object from your Kubernetes cluster.
Some cloud provider implementations split this into a node controller and a separate node lifecycle controller.
Route controller
The route controller is responsible for configuring routes in the cloud appropriately so that containers on different nodes in your Kubernetes cluster can communicate with each other.
Depending on the cloud provider, the route controller might also allocate blocks of IP addresses for the Pod network.
Service controller
Services integrate with cloud infrastructure components such as managed load balancers, IP addresses, network packet filtering, and target health checking. The service controller interacts with your cloud provider's APIs to set up load balancers and other infrastructure components when you declare a Service resource that requires them.
Authorization
This section breaks down the access that the cloud controller manager requires on various API objects, in order to perform its operations.
Node controller
The Node controller only works with Node objects. It requires full access to read and modify Node objects.
v1/Node
:
- Get
- List
- Create
- Update
- Patch
- Watch
- Delete
Route controller
The route controller listens to Node object creation and configures routes appropriately. It requires Get access to Node objects.
v1/Node
:
- Get
Service controller
The service controller listens to Service object Create, Update and Delete events and then configures Endpoints for those Services appropriately.
To access Services, it requires List, and Watch access. To update Services, it requires Patch and Update access.
To set up Endpoints resources for the Services, it requires access to Create, List, Get, Watch, and Update.
v1/Service
:
- List
- Get
- Watch
- Patch
- Update
Others
The implementation of the core of the cloud controller manager requires access to create Event objects, and to ensure secure operation, it requires access to create ServiceAccounts.
v1/Event
:
- Create
- Patch
- Update
v1/ServiceAccount
:
- Create
The RBAC ClusterRole for the cloud controller manager looks like:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cloud-controller-manager
rules:
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
- apiGroups:
- ""
resources:
- nodes
verbs:
- '*'
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- services
verbs:
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- serviceaccounts
verbs:
- create
- apiGroups:
- ""
resources:
- persistentvolumes
verbs:
- get
- list
- update
- watch
- apiGroups:
- ""
resources:
- endpoints
verbs:
- create
- get
- list
- watch
- update
What's next
Cloud Controller Manager Administration has instructions on running and managing the cloud controller manager.
To upgrade a HA control plane to use the cloud controller manager, see Migrate Replicated Control Plane To Use Cloud Controller Manager.
Want to know how to implement your own cloud controller manager, or extend an existing project?
The cloud controller manager uses Go interfaces to allow implementations from any cloud to be plugged in. Specifically, it uses the CloudProvider
interface defined in cloud.go
from kubernetes/cloud-provider.
The implementation of the shared controllers highlighted in this document (Node, Route, and Service), and some scaffolding along with the shared cloudprovider interface, is part of the Kubernetes core. Implementations specific to cloud providers are outside the core of Kubernetes and implement the CloudProvider
interface.
For more information about developing plugins, see Developing Cloud Controller Manager.
3.2.5 - Container Runtime Interface (CRI)
The CRI is a plugin interface which enables the kubelet to use a wide variety of container runtimes, without having a need to recompile the cluster components.
You need a working container runtime on each Node in your cluster, so that the kubelet can launch Pods and their containers.
The Container Runtime Interface (CRI) is the main protocol for the communication between the kubelet and Container Runtime.
The Kubernetes Container Runtime Interface (CRI) defines the main gRPC protocol for the communication between the cluster components kubelet and container runtime.
The API
Kubernetes v1.23 [stable]
The kubelet acts as a client when connecting to the container runtime via gRPC.
The runtime and image service endpoints have to be available in the container
runtime, which can be configured separately within the kubelet by using the
--image-service-endpoint
and --container-runtime-endpoint
command line
flags
For Kubernetes v1.23, the kubelet prefers to use CRI v1
.
If a container runtime does not support v1
of the CRI, then the kubelet tries to
negotiate any older supported version.
The v1.23 kubelet can also negotiate CRI v1alpha2
, but
this version is considered as deprecated.
If the kubelet cannot negotiate a supported CRI version, the kubelet gives up
and doesn't register as a node.
Upgrading
When upgrading Kubernetes, then the kubelet tries to automatically select the latest CRI version on restart of the component. If that fails, then the fallback will take place as mentioned above. If a gRPC re-dial was required because the container runtime has been upgraded, then the container runtime must also support the initially selected version or the redial is expected to fail. This requires a restart of the kubelet.
What's next
- Learn more about the CRI protocol definition
3.2.6 - Garbage Collection
Garbage collection is a collective term for the various mechanisms Kubernetes uses to clean up cluster resources. This allows the clean up of resources like the following:
- Failed pods
- Completed Jobs
- Objects without owner references
- Unused containers and container images
- Dynamically provisioned PersistentVolumes with a StorageClass reclaim policy of Delete
- Stale or expired CertificateSigningRequests (CSRs)
- Nodes deleted in the following scenarios:
- On a cloud when the cluster uses a cloud controller manager
- On-premises when the cluster uses an addon similar to a cloud controller manager
- Node Lease objects
Owners and dependents
Many objects in Kubernetes link to each other through owner references. Owner references tell the control plane which objects are dependent on others. Kubernetes uses owner references to give the control plane, and other API clients, the opportunity to clean up related resources before deleting an object. In most cases, Kubernetes manages owner references automatically.
Ownership is different from the labels and selectors
mechanism that some resources also use. For example, consider a
Service that creates
EndpointSlice
objects. The Service uses labels to allow the control plane to
determine which EndpointSlice
objects are used for that Service. In addition
to the labels, each EndpointSlice
that is managed on behalf of a Service has
an owner reference. Owner references help different parts of Kubernetes avoid
interfering with objects they don’t control.
Cross-namespace owner references are disallowed by design. Namespaced dependents can specify cluster-scoped or namespaced owners. A namespaced owner must exist in the same namespace as the dependent. If it does not, the owner reference is treated as absent, and the dependent is subject to deletion once all owners are verified absent.
Cluster-scoped dependents can only specify cluster-scoped owners. In v1.20+, if a cluster-scoped dependent specifies a namespaced kind as an owner, it is treated as having an unresolvable owner reference, and is not able to be garbage collected.
In v1.20+, if the garbage collector detects an invalid cross-namespace ownerReference
,
or a cluster-scoped dependent with an ownerReference
referencing a namespaced kind, a warning Event
with a reason of OwnerRefInvalidNamespace
and an involvedObject
of the invalid dependent is reported.
You can check for that kind of Event by running
kubectl get events -A --field-selector=reason=OwnerRefInvalidNamespace
.
Cascading deletion
Kubernetes checks for and deletes objects that no longer have owner references, like the pods left behind when you delete a ReplicaSet. When you delete an object, you can control whether Kubernetes deletes the object's dependents automatically, in a process called cascading deletion. There are two types of cascading deletion, as follows:
- Foreground cascading deletion
- Background cascading deletion
You can also control how and when garbage collection deletes resources that have owner references using Kubernetes finalizers.
Foreground cascading deletion
In foreground cascading deletion, the owner object you're deleting first enters a deletion in progress state. In this state, the following happens to the owner object:
- The Kubernetes API server sets the object's
metadata.deletionTimestamp
field to the time the object was marked for deletion. - The Kubernetes API server also sets the
metadata.finalizers
field toforegroundDeletion
. - The object remains visible through the Kubernetes API until the deletion process is complete.
After the owner object enters the deletion in progress state, the controller deletes the dependents. After deleting all the dependent objects, the controller deletes the owner object. At this point, the object is no longer visible in the Kubernetes API.
During foreground cascading deletion, the only dependents that block owner
deletion are those that have the ownerReference.blockOwnerDeletion=true
field.
See Use foreground cascading deletion
to learn more.
Background cascading deletion
In background cascading deletion, the Kubernetes API server deletes the owner object immediately and the controller cleans up the dependent objects in the background. By default, Kubernetes uses background cascading deletion unless you manually use foreground deletion or choose to orphan the dependent objects.
See Use background cascading deletion to learn more.
Orphaned dependents
When Kubernetes deletes an owner object, the dependents left behind are called orphan objects. By default, Kubernetes deletes dependent objects. To learn how to override this behaviour, see Delete owner objects and orphan dependents.
Garbage collection of unused containers and images
The kubelet performs garbage collection on unused images every five minutes and on unused containers every minute. You should avoid using external garbage collection tools, as these can break the kubelet behavior and remove containers that should exist.
To configure options for unused container and image garbage collection, tune the
kubelet using a configuration file
and change the parameters related to garbage collection using the
KubeletConfiguration
resource type.
Container image lifecycle
Kubernetes manages the lifecycle of all images through its image manager, which is part of the kubelet, with the cooperation of cadvisor. The kubelet considers the following disk usage limits when making garbage collection decisions:
HighThresholdPercent
LowThresholdPercent
Disk usage above the configured HighThresholdPercent
value triggers garbage
collection, which deletes images in order based on the last time they were used,
starting with the oldest first. The kubelet deletes images
until disk usage reaches the LowThresholdPercent
value.
Container garbage collection
The kubelet garbage collects unused containers based on the following variables, which you can define:
MinAge
: the minimum age at which the kubelet can garbage collect a container. Disable by setting to0
.MaxPerPodContainer
: the maximum number of dead containers each Pod pair can have. Disable by setting to less than0
.MaxContainers
: the maximum number of dead containers the cluster can have. Disable by setting to less than0
.
In addition to these variables, the kubelet garbage collects unidentified and deleted containers, typically starting with the oldest first.
MaxPerPodContainer
and MaxContainers
may potentially conflict with each other
in situations where retaining the maximum number of containers per Pod
(MaxPerPodContainer
) would go outside the allowable total of global dead
containers (MaxContainers
). In this situation, the kubelet adjusts
MaxPerPodContainer
to address the conflict. A worst-case scenario would be to
downgrade MaxPerPodContainer
to 1
and evict the oldest containers.
Additionally, containers owned by pods that have been deleted are removed once
they are older than MinAge
.
Configuring garbage collection
You can tune garbage collection of resources by configuring options specific to the controllers managing those resources. The following pages show you how to configure garbage collection:
What's next
- Learn more about ownership of Kubernetes objects.
- Learn more about Kubernetes finalizers.
- Learn about the TTL controller (beta) that cleans up finished Jobs.
3.3 - Containers
Each container that you run is repeatable; the standardization from having dependencies included means that you get the same behavior wherever you run it.
Containers decouple applications from underlying host infrastructure. This makes deployment easier in different cloud or OS environments.
Container images
A container image is a ready-to-run software package, containing everything needed to run an application: the code and any runtime it requires, application and system libraries, and default values for any essential settings.
By design, a container is immutable: you cannot change the code of a container that is already running. If you have a containerized application and want to make changes, you need to build a new image that includes the change, then recreate the container to start from the updated image.
Container runtimes
The container runtime is the software that is responsible for running containers.
Kubernetes supports container runtimes such as containerd, CRI-O, and any other implementation of the Kubernetes CRI (Container Runtime Interface).
What's next
- Read about container images
- Read about Pods
3.3.1 - Images
A container image represents binary data that encapsulates an application and all its software dependencies. Container images are executable software bundles that can run standalone and that make very well defined assumptions about their runtime environment.
You typically create a container image of your application and push it to a registry before referring to it in a Pod
This page provides an outline of the container image concept.
Image names
Container images are usually given a name such as pause
, example/mycontainer
, or kube-apiserver
.
Images can also include a registry hostname; for example: fictional.registry.example/imagename
,
and possibly a port number as well; for example: fictional.registry.example:10443/imagename
.
If you don't specify a registry hostname, Kubernetes assumes that you mean the Docker public registry.
After the image name part you can add a tag (in the same way you would when using with commands like docker
or podman
).
Tags let you identify different versions of the same series of images.
Image tags consist of lowercase and uppercase letters, digits, underscores (_
),
periods (.
), and dashes (-
).
There are additional rules about where you can place the separator
characters (_
, -
, and .
) inside an image tag.
If you don't specify a tag, Kubernetes assumes you mean the tag latest
.
Updating images
When you first create a Deployment,
StatefulSet, Pod, or other
object that includes a Pod template, then by default the pull policy of all
containers in that pod will be set to IfNotPresent
if it is not explicitly
specified. This policy causes the
kubelet to skip pulling an
image if it already exists.
Image pull policy
The imagePullPolicy
for a container and the tag of the image affect when the
kubelet attempts to pull (download) the specified image.
Here's a list of the values you can set for imagePullPolicy
and the effects
these values have:
IfNotPresent
- the image is pulled only if it is not already present locally.
Always
- every time the kubelet launches a container, the kubelet queries the container image registry to resolve the name to an image digest. If the kubelet has a container image with that exact digest cached locally, the kubelet uses its cached image; otherwise, the kubelet pulls the image with the resolved digest, and uses that image to launch the container.
Never
- the kubelet does not try fetching the image. If the image is somehow already present locally, the kubelet attempts to start the container; otherwise, startup fails. See pre-pulled images for more details.
The caching semantics of the underlying image provider make even
imagePullPolicy: Always
efficient, as long as the registry is reliably accessible.
Your container runtime can notice that the image layers already exist on the node
so that they don't need to be downloaded again.
You should avoid using the :latest
tag when deploying containers in production as
it is harder to track which version of the image is running and more difficult to
roll back properly.
Instead, specify a meaningful tag such as v1.42.0
.
To make sure the Pod always uses the same version of a container image, you can specify
the image's digest;
replace <image-name>:<tag>
with <image-name>@<digest>
(for example, image@sha256:45b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5cb2
).
When using image tags, if the image registry were to change the code that the tag on that image represents, you might end up with a mix of Pods running the old and new code. An image digest uniquely identifies a specific version of the image, so Kubernetes runs the same code every time it starts a container with that image name and digest specified. Specifying an image by digest fixes the code that you run so that a change at the registry cannot lead to that mix of versions.
There are third-party admission controllers that mutate Pods (and pod templates) when they are created, so that the running workload is defined based on an image digest rather than a tag. That might be useful if you want to make sure that all your workload is running the same code no matter what tag changes happen at the registry.
Default image pull policy
When you (or a controller) submit a new Pod to the API server, your cluster sets the
imagePullPolicy
field when specific conditions are met:
- if you omit the
imagePullPolicy
field, and the tag for the container image is:latest
,imagePullPolicy
is automatically set toAlways
; - if you omit the
imagePullPolicy
field, and you don't specify the tag for the container image,imagePullPolicy
is automatically set toAlways
; - if you omit the
imagePullPolicy
field, and you specify the tag for the container image that isn't:latest
, theimagePullPolicy
is automatically set toIfNotPresent
.
The value of imagePullPolicy
of the container is always set when the object is
first created, and is not updated if the image's tag later changes.
For example, if you create a Deployment with an image whose tag is not
:latest
, and later update that Deployment's image to a :latest
tag, the
imagePullPolicy
field will not change to Always
. You must manually change
the pull policy of any object after its initial creation.
Required image pull
If you would like to always force a pull, you can do one of the following:
- Set the
imagePullPolicy
of the container toAlways
. - Omit the
imagePullPolicy
and use:latest
as the tag for the image to use; Kubernetes will set the policy toAlways
when you submit the Pod. - Omit the
imagePullPolicy
and the tag for the image to use; Kubernetes will set the policy toAlways
when you submit the Pod. - Enable the AlwaysPullImages admission controller.
ImagePullBackOff
When a kubelet starts creating containers for a Pod using a container runtime,
it might be possible the container is in Waiting
state because of ImagePullBackOff
.
The status ImagePullBackOff
means that a container could not start because Kubernetes
could not pull a container image (for reasons such as invalid image name, or pulling
from a private registry without imagePullSecret
). The BackOff
part indicates
that Kubernetes will keep trying to pull the image, with an increasing back-off delay.
Kubernetes raises the delay between each attempt until it reaches a compiled-in limit, which is 300 seconds (5 minutes).
Multi-architecture images with image indexes
As well as providing binary images, a container registry can also serve a container image index. An image index can point to multiple image manifests for architecture-specific versions of a container. The idea is that you can have a name for an image (for example: pause
, example/mycontainer
, kube-apiserver
) and allow different systems to fetch the right binary image for the machine architecture they are using.
Kubernetes itself typically names container images with a suffix -$(ARCH)
. For backward compatibility, please generate the older images with suffixes. The idea is to generate say pause
image which has the manifest for all the arch(es) and say pause-amd64
which is backwards compatible for older configurations or YAML files which may have hard coded the images with suffixes.
Using a private registry
Private registries may require keys to read images from them.
Credentials can be provided in several ways:
- Configuring Nodes to Authenticate to a Private Registry
- all pods can read any configured private registries
- requires node configuration by cluster administrator
- Pre-pulled Images
- all pods can use any images cached on a node
- requires root access to all nodes to setup
- Specifying ImagePullSecrets on a Pod
- only pods which provide own keys can access the private registry
- Vendor-specific or local extensions
- if you're using a custom node configuration, you (or your cloud provider) can implement your mechanism for authenticating the node to the container registry.
These options are explained in more detail below.
Configuring nodes to authenticate to a private registry
Specific instructions for setting credentials depends on the container runtime and registry you chose to use. You should refer to your solution's documentation for the most accurate information.
For an example of configuring a private container image registry, see the Pull an Image from a Private Registry task. That example uses a private registry in Docker Hub.
Interpretation of config.json
The interpretation of config.json
varies between the original Docker
implementation and the Kubernetes interpretation. In Docker, the auths
keys
can only specify root URLs, whereas Kubernetes allows glob URLs as well as
prefix-matched paths. This means that a config.json
like this is valid:
{
"auths": {
"*my-registry.io/images": {
"auth": "…"
}
}
}
The root URL (*my-registry.io
) is matched by using the following syntax:
pattern:
{ term }
term:
'*' matches any sequence of non-Separator characters
'?' matches any single non-Separator character
'[' [ '^' ] { character-range } ']'
character class (must be non-empty)
c matches character c (c != '*', '?', '\\', '[')
'\\' c matches character c
character-range:
c matches character c (c != '\\', '-', ']')
'\\' c matches character c
lo '-' hi matches character c for lo <= c <= hi
Image pull operations would now pass the credentials to the CRI container runtime for every valid pattern. For example the following container image names would match successfully:
my-registry.io/images
my-registry.io/images/my-image
my-registry.io/images/another-image
sub.my-registry.io/images/my-image
a.sub.my-registry.io/images/my-image
The kubelet performs image pulls sequentially for every found credential. This
means, that multiple entries in config.json
are possible, too:
{
"auths": {
"my-registry.io/images": {
"auth": "…"
},
"my-registry.io/images/subpath": {
"auth": "…"
}
}
}
If now a container specifies an image my-registry.io/images/subpath/my-image
to be pulled, then the kubelet will try to download them from both
authentication sources if one of them fails.
Pre-pulled images
By default, the kubelet tries to pull each image from the specified registry.
However, if the imagePullPolicy
property of the container is set to IfNotPresent
or Never
,
then a local image is used (preferentially or exclusively, respectively).
If you want to rely on pre-pulled images as a substitute for registry authentication, you must ensure all nodes in the cluster have the same pre-pulled images.
This can be used to preload certain images for speed or as an alternative to authenticating to a private registry.
All pods will have read access to any pre-pulled images.
Specifying imagePullSecrets on a Pod
Kubernetes supports specifying container image registry keys on a Pod.
Creating a Secret with a Docker config
You need to know the username, registry password and client email address for authenticating to the registry, as well as its hostname. Run the following command, substituting the appropriate uppercase values:
kubectl create secret docker-registry <name> --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
If you already have a Docker credentials file then, rather than using the above
command, you can import the credentials file as a Kubernetes
Secrets.
Create a Secret based on existing Docker credentials explains how to set this up.
This is particularly useful if you are using multiple private container
registries, as kubectl create secret docker-registry
creates a Secret that
only works with a single private registry.
Referring to an imagePullSecrets on a Pod
Now, you can create pods which reference that secret by adding an imagePullSecrets
section to a Pod definition.
For example:
cat <<EOF > pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: foo
namespace: awesomeapps
spec:
containers:
- name: foo
image: janedoe/awesomeapp:v1
imagePullSecrets:
- name: myregistrykey
EOF
cat <<EOF >> ./kustomization.yaml
resources:
- pod.yaml
EOF
This needs to be done for each pod that is using a private registry.
However, setting of this field can be automated by setting the imagePullSecrets in a ServiceAccount resource.
Check Add ImagePullSecrets to a Service Account for detailed instructions.
You can use this in conjunction with a per-node .docker/config.json
. The credentials
will be merged.
Use cases
There are a number of solutions for configuring private registries. Here are some common use cases and suggested solutions.
- Cluster running only non-proprietary (e.g. open-source) images. No need to hide images.
- Use public images from a public registry
- No configuration required.
- Some cloud providers automatically cache or mirror public images, which improves availability and reduces the time to pull images.
- Use public images from a public registry
- Cluster running some proprietary images which should be hidden to those outside the company, but
visible to all cluster users.
- Use a hosted private registry
- Manual configuration may be required on the nodes that need to access to private registry
- Or, run an internal private registry behind your firewall with open read access.
- No Kubernetes configuration is required.
- Use a hosted container image registry service that controls image access
- It will work better with cluster autoscaling than manual node configuration.
- Or, on a cluster where changing the node configuration is inconvenient, use
imagePullSecrets
.
- Use a hosted private registry
- Cluster with proprietary images, a few of which require stricter access control.
- Ensure AlwaysPullImages admission controller is active. Otherwise, all Pods potentially have access to all images.
- Move sensitive data into a "Secret" resource, instead of packaging it in an image.
- A multi-tenant cluster where each tenant needs own private registry.
- Ensure AlwaysPullImages admission controller is active. Otherwise, all Pods of all tenants potentially have access to all images.
- Run a private registry with authorization required.
- Generate registry credential for each tenant, put into secret, and populate secret to each tenant namespace.
- The tenant adds that secret to imagePullSecrets of each namespace.
If you need access to multiple registries, you can create one secret for each registry.
What's next
- Read the OCI Image Manifest Specification.
- Learn about container image garbage collection.
- Learn more about pulling an Image from a Private Registry.
3.3.2 - Container Environment
This page describes the resources available to Containers in the Container environment.
Container environment
The Kubernetes Container environment provides several important resources to Containers:
- A filesystem, which is a combination of an image and one or more volumes.
- Information about the Container itself.
- Information about other objects in the cluster.
Container information
The hostname of a Container is the name of the Pod in which the Container is running.
It is available through the hostname
command or the
gethostname
function call in libc.
The Pod name and namespace are available as environment variables through the downward API.
User defined environment variables from the Pod definition are also available to the Container, as are any environment variables specified statically in the container image.
Cluster information
A list of all services that were running when a Container was created is available to that Container as environment variables. This list is limited to services within the same namespace as the new Container's Pod and Kubernetes control plane services.
For a service named foo that maps to a Container named bar, the following variables are defined:
FOO_SERVICE_HOST=<the host the service is running on>
FOO_SERVICE_PORT=<the port the service is running on>
Services have dedicated IP addresses and are available to the Container via DNS, if DNS addon is enabled.
What's next
- Learn more about Container lifecycle hooks.
- Get hands-on experience attaching handlers to Container lifecycle events.
3.3.3 - Runtime Class
Kubernetes v1.20 [stable]
This page describes the RuntimeClass resource and runtime selection mechanism.
RuntimeClass is a feature for selecting the container runtime configuration. The container runtime configuration is used to run a Pod's containers.
Motivation
You can set a different RuntimeClass between different Pods to provide a balance of performance versus security. For example, if part of your workload deserves a high level of information security assurance, you might choose to schedule those Pods so that they run in a container runtime that uses hardware virtualization. You'd then benefit from the extra isolation of the alternative runtime, at the expense of some additional overhead.
You can also use RuntimeClass to run different Pods with the same container runtime but with different settings.
Setup
- Configure the CRI implementation on nodes (runtime dependent)
- Create the corresponding RuntimeClass resources
1. Configure the CRI implementation on nodes
The configurations available through RuntimeClass are Container Runtime Interface (CRI) implementation dependent. See the corresponding documentation (below) for your CRI implementation for how to configure.
The configurations have a corresponding handler
name, referenced by the RuntimeClass. The
handler must be a valid DNS label name.
2. Create the corresponding RuntimeClass resources
The configurations setup in step 1 should each have an associated handler
name, which identifies
the configuration. For each handler, create a corresponding RuntimeClass object.
The RuntimeClass resource currently only has 2 significant fields: the RuntimeClass name
(metadata.name
) and the handler (handler
). The object definition looks like this:
apiVersion: node.k8s.io/v1 # RuntimeClass is defined in the node.k8s.io API group
kind: RuntimeClass
metadata:
name: myclass # The name the RuntimeClass will be referenced by
# RuntimeClass is a non-namespaced resource
handler: myconfiguration # The name of the corresponding CRI configuration
The name of a RuntimeClass object must be a valid DNS subdomain name.
Usage
Once RuntimeClasses are configured for the cluster, using them is very simple. Specify a
runtimeClassName
in the Pod spec. For example:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
runtimeClassName: myclass
# ...
This will instruct the kubelet to use the named RuntimeClass to run this pod. If the named
RuntimeClass does not exist, or the CRI cannot run the corresponding handler, the pod will enter the
Failed
terminal phase. Look for a
corresponding event for an
error message.
If no runtimeClassName
is specified, the default RuntimeHandler will be used, which is equivalent
to the behavior when the RuntimeClass feature is disabled.
CRI Configuration
For more details on setting up CRI runtimes, see CRI installation.
dockershim
Kubernetes v1.20 [deprecated]
Dockershim is deprecated as of Kubernetes v1.20, and will be removed in v1.24. For more information on the deprecation, see dockershim deprecation
RuntimeClasses with dockershim must set the runtime handler to docker
. Dockershim does not support
custom configurable runtime handlers.
containerd
Runtime handlers are configured through containerd's configuration at
/etc/containerd/config.toml
. Valid handlers are configured under the runtimes section:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.${HANDLER_NAME}]
See containerd's config documentation for more details: https://github.com/containerd/cri/blob/master/docs/config.md
CRI-O
Runtime handlers are configured through CRI-O's configuration at /etc/crio/crio.conf
. Valid
handlers are configured under the crio.runtime
table:
[crio.runtime.runtimes.${HANDLER_NAME}]
runtime_path = "${PATH_TO_BINARY}"
See CRI-O's config documentation for more details.
Scheduling
Kubernetes v1.16 [beta]
By specifying the scheduling
field for a RuntimeClass, you can set constraints to
ensure that Pods running with this RuntimeClass are scheduled to nodes that support it.
If scheduling
is not set, this RuntimeClass is assumed to be supported by all nodes.
To ensure pods land on nodes supporting a specific RuntimeClass, that set of nodes should have a
common label which is then selected by the runtimeclass.scheduling.nodeSelector
field. The
RuntimeClass's nodeSelector is merged with the pod's nodeSelector in admission, effectively taking
the intersection of the set of nodes selected by each. If there is a conflict, the pod will be
rejected.
If the supported nodes are tainted to prevent other RuntimeClass pods from running on the node, you
can add tolerations
to the RuntimeClass. As with the nodeSelector
, the tolerations are merged
with the pod's tolerations in admission, effectively taking the union of the set of nodes tolerated
by each.
To learn more about configuring the node selector and tolerations, see Assigning Pods to Nodes.
Pod Overhead
Kubernetes v1.18 [beta]
You can specify overhead resources that are associated with running a Pod. Declaring overhead allows the cluster (including the scheduler) to account for it when making decisions about Pods and resources. To use Pod overhead, you must have the PodOverhead feature gate enabled (it is on by default).
Pod overhead is defined in RuntimeClass through the overhead
fields. Through the use of these fields,
you can specify the overhead of running pods utilizing this RuntimeClass and ensure these overheads
are accounted for in Kubernetes.
What's next
- RuntimeClass Design
- RuntimeClass Scheduling Design
- Read about the Pod Overhead concept
- PodOverhead Feature Design
3.3.4 - Container Lifecycle Hooks
This page describes how kubelet managed Containers can use the Container lifecycle hook framework to run code triggered by events during their management lifecycle.
Overview
Analogous to many programming language frameworks that have component lifecycle hooks, such as Angular, Kubernetes provides Containers with lifecycle hooks. The hooks enable Containers to be aware of events in their management lifecycle and run code implemented in a handler when the corresponding lifecycle hook is executed.
Container hooks
There are two hooks that are exposed to Containers:
PostStart
This hook is executed immediately after a container is created. However, there is no guarantee that the hook will execute before the container ENTRYPOINT. No parameters are passed to the handler.
PreStop
This hook is called immediately before a container is terminated due to an API request or management
event such as a liveness/startup probe failure, preemption, resource contention and others. A call
to the PreStop
hook fails if the container is already in a terminated or completed state and the
hook must complete before the TERM signal to stop the container can be sent. The Pod's termination
grace period countdown begins before the PreStop
hook is executed, so regardless of the outcome of
the handler, the container will eventually terminate within the Pod's termination grace period. No
parameters are passed to the handler.
A more detailed description of the termination behavior can be found in Termination of Pods.
Hook handler implementations
Containers can access a hook by implementing and registering a handler for that hook. There are two types of hook handlers that can be implemented for Containers:
- Exec - Executes a specific command, such as
pre-stop.sh
, inside the cgroups and namespaces of the Container. Resources consumed by the command are counted against the Container. - HTTP - Executes an HTTP request against a specific endpoint on the Container.
Hook handler execution
When a Container lifecycle management hook is called,
the Kubernetes management system executes the handler according to the hook action,
httpGet
and tcpSocket
are executed by the kubelet process, and exec
is executed in the container.
Hook handler calls are synchronous within the context of the Pod containing the Container.
This means that for a PostStart
hook,
the Container ENTRYPOINT and hook fire asynchronously.
However, if the hook takes too long to run or hangs,
the Container cannot reach a running
state.
PreStop
hooks are not executed asynchronously from the signal to stop the Container; the hook must
complete its execution before the TERM signal can be sent. If a PreStop
hook hangs during
execution, the Pod's phase will be Terminating
and remain there until the Pod is killed after its
terminationGracePeriodSeconds
expires. This grace period applies to the total time it takes for
both the PreStop
hook to execute and for the Container to stop normally. If, for example,
terminationGracePeriodSeconds
is 60, and the hook takes 55 seconds to complete, and the Container
takes 10 seconds to stop normally after receiving the signal, then the Container will be killed
before it can stop normally, since terminationGracePeriodSeconds
is less than the total time
(55+10) it takes for these two things to happen.
If either a PostStart
or PreStop
hook fails,
it kills the Container.
Users should make their hook handlers as lightweight as possible. There are cases, however, when long running commands make sense, such as when saving state prior to stopping a Container.
Hook delivery guarantees
Hook delivery is intended to be at least once,
which means that a hook may be called multiple times for any given event,
such as for PostStart
or PreStop
.
It is up to the hook implementation to handle this correctly.
Generally, only single deliveries are made. If, for example, an HTTP hook receiver is down and is unable to take traffic, there is no attempt to resend. In some rare cases, however, double delivery may occur. For instance, if a kubelet restarts in the middle of sending a hook, the hook might be resent after the kubelet comes back up.
Debugging Hook handlers
The logs for a Hook handler are not exposed in Pod events.
If a handler fails for some reason, it broadcasts an event.
For PostStart
, this is the FailedPostStartHook
event,
and for PreStop
, this is the FailedPreStopHook
event.
To generate a failed FailedPreStopHook
event yourself, modify the lifecycle-events.yaml file to change the postStart command to "badcommand" and apply it.
Here is some example output of the resulting events you see from running kubectl describe pod lifecycle-demo
:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7s default-scheduler Successfully assigned default/lifecycle-demo to ip-XXX-XXX-XX-XX.us-east-2...
Normal Pulled 6s kubelet Successfully pulled image "nginx" in 229.604315ms
Normal Pulling 4s (x2 over 6s) kubelet Pulling image "nginx"
Normal Created 4s (x2 over 5s) kubelet Created container lifecycle-demo-container
Normal Started 4s (x2 over 5s) kubelet Started container lifecycle-demo-container
Warning FailedPostStartHook 4s (x2 over 5s) kubelet Exec lifecycle hook ([badcommand]) for Container "lifecycle-demo-container" in Pod "lifecycle-demo_default(30229739-9651-4e5a-9a32-a8f1688862db)" failed - error: command 'badcommand' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: exec: \"badcommand\": executable file not found in $PATH: unknown\r\n"
Normal Killing 4s (x2 over 5s) kubelet FailedPostStartHook
Normal Pulled 4s kubelet Successfully pulled image "nginx" in 215.66395ms
Warning BackOff 2s (x2 over 3s) kubelet Back-off restarting failed container
What's next
- Learn more about the Container environment.
- Get hands-on experience attaching handlers to Container lifecycle events.
3.4 - Workloads
A workload is an application running on Kubernetes.
Whether your workload is a single component or several that work together, on Kubernetes you run
it inside a set of pods.
In Kubernetes, a Pod
represents a set of running
containers on your cluster.
Kubernetes pods have a defined lifecycle.
For example, once a pod is running in your cluster then a critical fault on the
node where that pod is running means that
all the pods on that node fail. Kubernetes treats that level of failure as final: you
would need to create a new Pod
to recover, even if the node later becomes healthy.
However, to make life considerably easier, you don't need to manage each Pod
directly.
Instead, you can use workload resources that manage a set of pods on your behalf.
These resources configure controllers
that make sure the right number of the right kind of pod are running, to match the state
you specified.
Kubernetes provides several built-in workload resources:
Deployment
andReplicaSet
(replacing the legacy resource ReplicationController).Deployment
is a good fit for managing a stateless application workload on your cluster, where anyPod
in theDeployment
is interchangeable and can be replaced if needed.StatefulSet
lets you run one or more related Pods that do track state somehow. For example, if your workload records data persistently, you can run aStatefulSet
that matches eachPod
with aPersistentVolume
. Your code, running in thePods
for thatStatefulSet
, can replicate data to otherPods
in the sameStatefulSet
to improve overall resilience.DaemonSet
definesPods
that provide node-local facilities. These might be fundamental to the operation of your cluster, such as a networking helper tool, or be part of an add-on.
Every time you add a node to your cluster that matches the specification in aDaemonSet
, the control plane schedules aPod
for thatDaemonSet
onto the new node.Job
andCronJob
define tasks that run to completion and then stop. Jobs represent one-off tasks, whereasCronJobs
recur according to a schedule.
In the wider Kubernetes ecosystem, you can find third-party workload resources that provide
additional behaviors. Using a
custom resource definition,
you can add in a third-party workload resource if you want a specific behavior that's not part
of Kubernetes' core. For example, if you wanted to run a group of Pods
for your application but
stop work unless all the Pods are available (perhaps for some high-throughput distributed task),
then you can implement or install an extension that does provide that feature.
What's next
As well as reading about each resource, you can learn about specific tasks that relate to them:
- Run a stateless application using a
Deployment
- Run a stateful application either as a single instance or as a replicated set
- Run automated tasks with a
CronJob
To learn about Kubernetes' mechanisms for separating code from configuration, visit Configuration.
There are two supporting concepts that provide backgrounds about how Kubernetes manages pods for applications:
- Garbage collection tidies up objects from your cluster after their owning resource has been removed.
- The time-to-live after finished controller removes Jobs once a defined time has passed since they completed.
Once your application is running, you might want to make it available on the internet as
a Service
or, for web application only,
using an Ingress
.
3.4.1 - Pods
Pods are the smallest deployable units of computing that you can create and manage in Kubernetes.
A Pod (as in a pod of whales or pea pod) is a group of one or more containers, with shared storage and network resources, and a specification for how to run the containers. A Pod's contents are always co-located and co-scheduled, and run in a shared context. A Pod models an application-specific "logical host": it contains one or more application containers which are relatively tightly coupled. In non-cloud contexts, applications executed on the same physical or virtual machine are analogous to cloud applications executed on the same logical host.
As well as application containers, a Pod can contain init containers that run during Pod startup. You can also inject ephemeral containers for debugging if your cluster offers this.
What is a Pod?
The shared context of a Pod is a set of Linux namespaces, cgroups, and potentially other facets of isolation - the same things that isolate a Docker container. Within a Pod's context, the individual applications may have further sub-isolations applied.
In terms of Docker concepts, a Pod is similar to a group of Docker containers with shared namespaces and shared filesystem volumes.
Using Pods
The following is an example of a Pod which consists of a container running the image nginx:1.14.2
.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
To create the Pod shown above, run the following command:
kubectl apply -f https://k8s.io/examples/pods/simple-pod.yaml
Pods are generally not created directly and are created using workload resources. See Working with Pods for more information on how Pods are used with workload resources.
Workload resources for managing pods
Usually you don't need to create Pods directly, even singleton Pods. Instead, create them using workload resources such as Deployment or Job. If your Pods need to track state, consider the StatefulSet resource.
Pods in a Kubernetes cluster are used in two main ways:
-
Pods that run a single container. The "one-container-per-Pod" model is the most common Kubernetes use case; in this case, you can think of a Pod as a wrapper around a single container; Kubernetes manages Pods rather than managing the containers directly.
-
Pods that run multiple containers that need to work together. A Pod can encapsulate an application composed of multiple co-located containers that are tightly coupled and need to share resources. These co-located containers form a single cohesive unit of service—for example, one container serving data stored in a shared volume to the public, while a separate sidecar container refreshes or updates those files. The Pod wraps these containers, storage resources, and an ephemeral network identity together as a single unit.
Note: Grouping multiple co-located and co-managed containers in a single Pod is a relatively advanced use case. You should use this pattern only in specific instances in which your containers are tightly coupled.
Each Pod is meant to run a single instance of a given application. If you want to scale your application horizontally (to provide more overall resources by running more instances), you should use multiple Pods, one for each instance. In Kubernetes, this is typically referred to as replication. Replicated Pods are usually created and managed as a group by a workload resource and its controller.
See Pods and controllers for more information on how Kubernetes uses workload resources, and their controllers, to implement application scaling and auto-healing.
How Pods manage multiple containers
Pods are designed to support multiple cooperating processes (as containers) that form a cohesive unit of service. The containers in a Pod are automatically co-located and co-scheduled on the same physical or virtual machine in the cluster. The containers can share resources and dependencies, communicate with one another, and coordinate when and how they are terminated.
For example, you might have a container that acts as a web server for files in a shared volume, and a separate "sidecar" container that updates those files from a remote source, as in the following diagram:
Some Pods have init containers as well as app containers. Init containers run and complete before the app containers are started.
Pods natively provide two kinds of shared resources for their constituent containers: networking and storage.
Working with Pods
You'll rarely create individual Pods directly in Kubernetes—even singleton Pods. This is because Pods are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or indirectly by a controller), the new Pod is scheduled to run on a Node in your cluster. The Pod remains on that node until the Pod finishes execution, the Pod object is deleted, the Pod is evicted for lack of resources, or the node fails.
When you create the manifest for a Pod object, make sure the name specified is a valid DNS subdomain name.
Pods and controllers
You can use workload resources to create and manage multiple Pods for you. A controller for the resource handles replication and rollout and automatic healing in case of Pod failure. For example, if a Node fails, a controller notices that Pods on that Node have stopped working and creates a replacement Pod. The scheduler places the replacement Pod onto a healthy Node.
Here are some examples of workload resources that manage one or more Pods:
Pod templates
Controllers for workload resources create Pods from a pod template and manage those Pods on your behalf.
PodTemplates are specifications for creating Pods, and are included in workload resources such as Deployments, Jobs, and DaemonSets.
Each controller for a workload resource uses the PodTemplate
inside the workload
object to make actual Pods. The PodTemplate
is part of the desired state of whatever
workload resource you used to run your app.
The sample below is a manifest for a simple Job with a template
that starts one
container. The container in that Pod prints a message then pauses.
apiVersion: batch/v1
kind: Job
metadata:
name: hello
spec:
template:
# This is the pod template
spec:
containers:
- name: hello
image: busybox:1.28
command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
restartPolicy: OnFailure
# The pod template ends here
Modifying the pod template or switching to a new pod template has no direct effect on the Pods that already exist. If you change the pod template for a workload resource, that resource needs to create replacement Pods that use the updated template.
For example, the StatefulSet controller ensures that the running Pods match the current pod template for each StatefulSet object. If you edit the StatefulSet to change its pod template, the StatefulSet starts to create new Pods based on the updated template. Eventually, all of the old Pods are replaced with new Pods, and the update is complete.
Each workload resource implements its own rules for handling changes to the Pod template. If you want to read more about StatefulSet specifically, read Update strategy in the StatefulSet Basics tutorial.
On Nodes, the kubelet does not directly observe or manage any of the details around pod templates and updates; those details are abstracted away. That abstraction and separation of concerns simplifies system semantics, and makes it feasible to extend the cluster's behavior without changing existing code.
Pod update and replacement
As mentioned in the previous section, when the Pod template for a workload resource is changed, the controller creates new Pods based on the updated template instead of updating or patching the existing Pods.
Kubernetes doesn't prevent you from managing Pods directly. It is possible to
update some fields of a running Pod, in place. However, Pod update operations
like
patch
, and
replace
have some limitations:
-
Most of the metadata about a Pod is immutable. For example, you cannot change the
namespace
,name
,uid
, orcreationTimestamp
fields; thegeneration
field is unique. It only accepts updates that increment the field's current value. -
If the
metadata.deletionTimestamp
is set, no new entry can be added to themetadata.finalizers
list. -
Pod updates may not change fields other than
spec.containers[*].image
,spec.initContainers[*].image
,spec.activeDeadlineSeconds
orspec.tolerations
. Forspec.tolerations
, you can only add new entries. -
When updating the
spec.activeDeadlineSeconds
field, two types of updates are allowed:- setting the unassigned field to a positive number;
- updating the field from a positive number to a smaller, non-negative number.
Resource sharing and communication
Pods enable data sharing and communication among their constituent containers.
Storage in Pods
A Pod can specify a set of shared storage volumes. All containers in the Pod can access the shared volumes, allowing those containers to share data. Volumes also allow persistent data in a Pod to survive in case one of the containers within needs to be restarted. See Storage for more information on how Kubernetes implements shared storage and makes it available to Pods.
Pod networking
Each Pod is assigned a unique IP address for each address family. Every
container in a Pod shares the network namespace, including the IP address and
network ports. Inside a Pod (and only then), the containers that belong to the Pod
can communicate with one another using localhost
. When containers in a Pod communicate
with entities outside the Pod,
they must coordinate how they use the shared network resources (such as ports).
Within a Pod, containers share an IP address and port space, and
can find each other via localhost
. The containers in a Pod can also communicate
with each other using standard inter-process communications like SystemV semaphores
or POSIX shared memory. Containers in different Pods have distinct IP addresses
and can not communicate by IPC without
special configuration.
Containers that want to interact with a container running in a different Pod can
use IP networking to communicate.
Containers within the Pod see the system hostname as being the same as the configured
name
for the Pod. There's more about this in the networking
section.
Privileged mode for containers
In Linux, any container in a Pod can enable privileged mode using the privileged
(Linux) flag on the security context of the container spec. This is useful for containers that want to use operating system administrative capabilities such as manipulating the network stack or accessing hardware devices.
If your cluster has the WindowsHostProcessContainers
feature enabled, you can create a Windows HostProcess pod by setting the windowsOptions.hostProcess
flag on the security context of the pod spec. All containers in these pods must run as Windows HostProcess containers. HostProcess pods run directly on the host and can also be used to perform administrative tasks as is done with Linux privileged containers.
Static Pods
Static Pods are managed directly by the kubelet daemon on a specific node, without the API server observing them. Whereas most Pods are managed by the control plane (for example, a Deployment), for static Pods, the kubelet directly supervises each static Pod (and restarts it if it fails).
Static Pods are always bound to one Kubelet on a specific node. The main use for static Pods is to run a self-hosted control plane: in other words, using the kubelet to supervise the individual control plane components.
The kubelet automatically tries to create a mirror Pod on the Kubernetes API server for each static Pod. This means that the Pods running on a node are visible on the API server, but cannot be controlled from there.
spec
of a static Pod cannot refer to other API objects
(e.g., ServiceAccount,
ConfigMap,
Secret, etc).
Container probes
A probe is a diagnostic performed periodically by the kubelet on a container. To perform a diagnostic, the kubelet can invoke different actions:
ExecAction
(performed with the help of the container runtime)TCPSocketAction
(checked directly by the kubelet)HTTPGetAction
(checked directly by the kubelet)
You can read more about probes in the Pod Lifecycle documentation.
What's next
- Learn about the lifecycle of a Pod.
- Learn about RuntimeClass and how you can use it to configure different Pods with different container runtime configurations.
- Read about Pod topology spread constraints.
- Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
- Pod is a top-level resource in the Kubernetes REST API. The Pod object definition describes the object in detail.
- The Distributed System Toolkit: Patterns for Composite Containers explains common layouts for Pods with more than one container.
To understand the context for why Kubernetes wraps a common Pod API in other resources (such as StatefulSets or Deployments), you can read about the prior art, including:
3.4.1.1 - Pod Lifecycle
This page describes the lifecycle of a Pod. Pods follow a defined lifecycle, starting
in the Pending
phase, moving through Running
if at least one
of its primary containers starts OK, and then through either the Succeeded
or
Failed
phases depending on whether any container in the Pod terminated in failure.
Whilst a Pod is running, the kubelet is able to restart containers to handle some kind of faults. Within a Pod, Kubernetes tracks different container states and determines what action to take to make the Pod healthy again.
In the Kubernetes API, Pods have both a specification and an actual status. The status for a Pod object consists of a set of Pod conditions. You can also inject custom readiness information into the condition data for a Pod, if that is useful to your application.
Pods are only scheduled once in their lifetime. Once a Pod is scheduled (assigned) to a Node, the Pod runs on that Node until it stops or is terminated.
Pod lifetime
Like individual application containers, Pods are considered to be relatively ephemeral (rather than durable) entities. Pods are created, assigned a unique ID (UID), and scheduled to nodes where they remain until termination (according to restart policy) or deletion. If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period.
Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the work of managing the relatively disposable Pod instances.
A given Pod (as defined by a UID) is never "rescheduled" to a different node; instead, that Pod can be replaced by a new, near-identical Pod, with even the same name if desired, but with a different UID.
When something is said to have the same lifetime as a Pod, such as a volume, that means that the thing exists as long as that specific Pod (with that exact UID) exists. If that Pod is deleted for any reason, and even if an identical replacement is created, the related thing (a volume, in this example) is also destroyed and created anew.
A multi-container Pod that contains a file puller and a web server that uses a persistent volume for shared storage between the containers.
Pod phase
A Pod's status
field is a
PodStatus
object, which has a phase
field.
The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. The phase is not intended to be a comprehensive rollup of observations of container or Pod state, nor is it intended to be a comprehensive state machine.
The number and meanings of Pod phase values are tightly guarded.
Other than what is documented here, nothing should be assumed about Pods that
have a given phase
value.
Here are the possible values for phase
:
Value | Description |
---|---|
Pending |
The Pod has been accepted by the Kubernetes cluster, but one or more of the containers has not been set up and made ready to run. This includes time a Pod spends waiting to be scheduled as well as the time spent downloading container images over the network. |
Running |
The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting. |
Succeeded |
All containers in the Pod have terminated in success, and will not be restarted. |
Failed |
All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system. |
Unknown |
For some reason the state of the Pod could not be obtained. This phase typically occurs due to an error in communicating with the node where the Pod should be running. |
Terminating
by some kubectl commands.
This Terminating
status is not one of the Pod phases.
A Pod is granted a term to terminate gracefully, which defaults to 30 seconds.
You can use the flag --force
to terminate a Pod by force.
If a node dies or is disconnected from the rest of the cluster, Kubernetes
applies a policy for setting the phase
of all Pods on the lost node to Failed.
Container states
As well as the phase of the Pod overall, Kubernetes tracks the state of each container inside a Pod. You can use container lifecycle hooks to trigger events to run at certain points in a container's lifecycle.
Once the scheduler
assigns a Pod to a Node, the kubelet starts creating containers for that Pod
using a container runtime.
There are three possible container states: Waiting
, Running
, and Terminated
.
To check the state of a Pod's containers, you can use
kubectl describe pod <name-of-pod>
. The output shows the state for each container
within that Pod.
Each state has a specific meaning:
Waiting
If a container is not in either the Running
or Terminated
state, it is Waiting
.
A container in the Waiting
state is still running the operations it requires in
order to complete start up: for example, pulling the container image from a container
image registry, or applying Secret
data.
When you use kubectl
to query a Pod with a container that is Waiting
, you also see
a Reason field to summarize why the container is in that state.
Running
The Running
status indicates that a container is executing without issues. If there
was a postStart
hook configured, it has already executed and finished. When you use
kubectl
to query a Pod with a container that is Running
, you also see information
about when the container entered the Running
state.
Terminated
A container in the Terminated
state began execution and then either ran to
completion or failed for some reason. When you use kubectl
to query a Pod with
a container that is Terminated
, you see a reason, an exit code, and the start and
finish time for that container's period of execution.
If a container has a preStop
hook configured, this hook runs before the container enters
the Terminated
state.
Container restart policy
The spec
of a Pod has a restartPolicy
field with possible values Always, OnFailure,
and Never. The default value is Always.
The restartPolicy
applies to all containers in the Pod. restartPolicy
only
refers to restarts of the containers by the kubelet on the same node. After containers
in a Pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s,
40s, …), that is capped at five minutes. Once a container has executed for 10 minutes
without any problems, the kubelet resets the restart backoff timer for that container.
Pod conditions
A Pod has a PodStatus, which has an array of PodConditions through which the Pod has or has not passed:
PodScheduled
: the Pod has been scheduled to a node.ContainersReady
: all containers in the Pod are ready.Initialized
: all init containers have completed successfully.Ready
: the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.
Field name | Description |
---|---|
type |
Name of this Pod condition. |
status |
Indicates whether that condition is applicable, with possible values "True ", "False ", or "Unknown ". |
lastProbeTime |
Timestamp of when the Pod condition was last probed. |
lastTransitionTime |
Timestamp for when the Pod last transitioned from one status to another. |
reason |
Machine-readable, UpperCamelCase text indicating the reason for the condition's last transition. |
message |
Human-readable message indicating details about the last status transition. |
Pod readiness
Kubernetes v1.14 [stable]
Your application can inject extra feedback or signals into PodStatus:
Pod readiness. To use this, set readinessGates
in the Pod's spec
to
specify a list of additional conditions that the kubelet evaluates for Pod readiness.
Readiness gates are determined by the current state of status.condition
fields for the Pod. If Kubernetes cannot find such a condition in the
status.conditions
field of a Pod, the status of the condition
is defaulted to "False
".
Here is an example:
kind: Pod
...
spec:
readinessGates:
- conditionType: "www.example.com/feature-1"
status:
conditions:
- type: Ready # a built in PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
- type: "www.example.com/feature-1" # an extra PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
containerStatuses:
- containerID: docker://abcd...
ready: true
...
The Pod conditions you add must have names that meet the Kubernetes label key format.
Status for Pod readiness
The kubectl patch
command does not support patching object status.
To set these status.conditions
for the pod, applications and
operators should use
the PATCH
action.
You can use a Kubernetes client library to
write code that sets custom Pod conditions for Pod readiness.
For a Pod that uses custom conditions, that Pod is evaluated to be ready only when both the following statements apply:
- All containers in the Pod are ready.
- All conditions specified in
readinessGates
areTrue
.
When a Pod's containers are Ready but at least one custom condition is missing or
False
, the kubelet sets the Pod's condition to ContainersReady
.
Container probes
A probe is a diagnostic performed periodically by the kubelet on a container. To perform a diagnostic, the kubelet either executes code within the container, or makes a network request.
Check mechanisms
There are four different ways to check a container using a probe. Each probe must define exactly one of these four mechanisms:
exec
- Executes a specified command inside the container. The diagnostic is considered successful if the command exits with a status code of 0.
grpc
- Performs a remote procedure call using gRPC.
The target should implement
gRPC health checks.
The diagnostic is considered successful if the
status
of the response isSERVING
.
gRPC probes are an alpha feature and are only available if you enable theGRPCContainerProbe
feature gate. httpGet
- Performs an HTTP
GET
request against the Pod's IP address on a specified port and path. The diagnostic is considered successful if the response has a status code greater than or equal to 200 and less than 400. tcpSocket
- Performs a TCP check against the Pod's IP address on a specified port. The diagnostic is considered successful if the port is open. If the remote system (the container) closes the connection immediately after it opens, this counts as healthy.
Probe outcome
Each probe has one of three results:
Success
- The container passed the diagnostic.
Failure
- The container failed the diagnostic.
Unknown
- The diagnostic failed (no action should be taken, and the kubelet will make further checks).
Types of probe
The kubelet can optionally perform and react to three kinds of probes on running containers:
livenessProbe
- Indicates whether the container is running. If
the liveness probe fails, the kubelet kills the container, and the container
is subjected to its restart policy. If a container does not
provide a liveness probe, the default state is
Success
. readinessProbe
- Indicates whether the container is ready to respond to requests.
If the readiness probe fails, the endpoints controller removes the Pod's IP
address from the endpoints of all Services that match the Pod. The default
state of readiness before the initial delay is
Failure
. If a container does not provide a readiness probe, the default state isSuccess
. startupProbe
- Indicates whether the application within the container is started.
All other probes are disabled if a startup probe is provided, until it succeeds.
If the startup probe fails, the kubelet kills the container, and the container
is subjected to its restart policy. If a container does not
provide a startup probe, the default state is
Success
.
For more information about how to set up a liveness, readiness, or startup probe, see Configure Liveness, Readiness and Startup Probes.
When should you use a liveness probe?
Kubernetes v1.0 [stable]
If the process in your container is able to crash on its own whenever it
encounters an issue or becomes unhealthy, you do not necessarily need a liveness
probe; the kubelet will automatically perform the correct action in accordance
with the Pod's restartPolicy
.
If you'd like your container to be killed and restarted if a probe fails, then
specify a liveness probe, and specify a restartPolicy
of Always or OnFailure.
When should you use a readiness probe?
Kubernetes v1.0 [stable]
If you'd like to start sending traffic to a Pod only when a probe succeeds, specify a readiness probe. In this case, the readiness probe might be the same as the liveness probe, but the existence of the readiness probe in the spec means that the Pod will start without receiving any traffic and only start receiving traffic after the probe starts succeeding.
If you want your container to be able to take itself down for maintenance, you can specify a readiness probe that checks an endpoint specific to readiness that is different from the liveness probe.
If your app has a strict dependency on back-end services, you can implement both a liveness and a readiness probe. The liveness probe passes when the app itself is healthy, but the readiness probe additionally checks that each required back-end service is available. This helps you avoid directing traffic to Pods that can only respond with error messages.
If your container needs to work on loading large data, configuration files, or migrations during startup, you can use a startup probe. However, if you want to detect the difference between an app that has failed and an app that is still processing its startup data, you might prefer a readiness probe.
When should you use a startup probe?
Kubernetes v1.20 [stable]
Startup probes are useful for Pods that have containers that take a long time to come into service. Rather than set a long liveness interval, you can configure a separate configuration for probing the container as it starts up, allowing a time longer than the liveness interval would allow.
If your container usually starts in more than
initialDelaySeconds + failureThreshold × periodSeconds
, you should specify a
startup probe that checks the same endpoint as the liveness probe. The default for
periodSeconds
is 10s. You should then set its failureThreshold
high enough to
allow the container to start, without changing the default values of the liveness
probe. This helps to protect against deadlocks.
Termination of Pods
Because Pods represent processes running on nodes in the cluster, it is important to
allow those processes to gracefully terminate when they are no longer needed (rather
than being abruptly stopped with a KILL
signal and having no chance to clean up).
The design aim is for you to be able to request deletion and know when processes terminate, but also be able to ensure that deletes eventually complete. When you request deletion of a Pod, the cluster records and tracks the intended grace period before the Pod is allowed to be forcefully killed. With that forceful shutdown tracking in place, the kubelet attempts graceful shutdown.
Typically, the container runtime sends a TERM signal to the main process in each
container. Many container runtimes respect the STOPSIGNAL
value defined in the container
image and send this instead of TERM.
Once the grace period has expired, the KILL signal is sent to any remaining
processes, and the Pod is then deleted from the
API Server. If the kubelet or the
container runtime's management service is restarted while waiting for processes to terminate, the
cluster retries from the start including the full original grace period.
An example flow:
- You use the
kubectl
tool to manually delete a specific Pod, with the default grace period (30 seconds). - The Pod in the API server is updated with the time beyond which the Pod is considered "dead"
along with the grace period.
If you use
kubectl describe
to check on the Pod you're deleting, that Pod shows up as "Terminating". On the node where the Pod is running: as soon as the kubelet sees that a Pod has been marked as terminating (a graceful shutdown duration has been set), the kubelet begins the local Pod shutdown process.- If one of the Pod's containers has defined a
preStop
hook, the kubelet runs that hook inside of the container. If thepreStop
hook is still running after the grace period expires, the kubelet requests a small, one-off grace period extension of 2 seconds.Note: If thepreStop
hook needs longer to complete than the default grace period allows, you must modifyterminationGracePeriodSeconds
to suit this. - The kubelet triggers the container runtime to send a TERM signal to process 1 inside each
container.
Note: The containers in the Pod receive the TERM signal at different times and in an arbitrary order. If the order of shutdowns matters, consider using a
preStop
hook to synchronize.
- If one of the Pod's containers has defined a
- At the same time as the kubelet is starting graceful shutdown, the control plane removes that shutting-down Pod from Endpoints (and, if enabled, EndpointSlice) objects where these represent a Service with a configured selector. ReplicaSets and other workload resources no longer treat the shutting-down Pod as a valid, in-service replica. Pods that shut down slowly cannot continue to serve traffic as load balancers (like the service proxy) remove the Pod from the list of endpoints as soon as the termination grace period begins.
- When the grace period expires, the kubelet triggers forcible shutdown. The container runtime sends
SIGKILL
to any processes still running in any container in the Pod. The kubelet also cleans up a hiddenpause
container if that container runtime uses one. - The kubelet triggers forcible removal of Pod object from the API server, by setting grace period to 0 (immediate deletion).
- The API server deletes the Pod's API object, which is then no longer visible from any client.
Forced Pod termination
By default, all deletes are graceful within 30 seconds. The kubectl delete
command supports
the --grace-period=<seconds>
option which allows you to override the default and specify your
own value.
Setting the grace period to 0
forcibly and immediately deletes the Pod from the API
server. If the pod was still running on a node, that forcible deletion triggers the kubelet to
begin immediate cleanup.
--force
along with --grace-period=0
in order to perform force deletions.
When a force deletion is performed, the API server does not wait for confirmation from the kubelet that the Pod has been terminated on the node it was running on. It removes the Pod in the API immediately so a new Pod can be created with the same name. On the node, Pods that are set to terminate immediately will still be given a small grace period before being force killed.
If you need to force-delete Pods that are part of a StatefulSet, refer to the task documentation for deleting Pods from a StatefulSet.
Garbage collection of failed Pods
For failed Pods, the API objects remain in the cluster's API until a human or controller process explicitly removes them.
The control plane cleans up terminated Pods (with a phase of Succeeded
or
Failed
), when the number of Pods exceeds the configured threshold
(determined by terminated-pod-gc-threshold
in the kube-controller-manager).
This avoids a resource leak as Pods are created and terminated over time.
What's next
-
Get hands-on experience attaching handlers to container lifecycle events.
-
Get hands-on experience configuring Liveness, Readiness and Startup Probes.
-
Learn more about container lifecycle hooks.
-
For detailed information about Pod and container status in the API, see the API reference documentation covering
.status
for Pod.
3.4.1.2 - Init Containers
This page provides an overview of init containers: specialized containers that run before app containers in a Pod. Init containers can contain utilities or setup scripts not present in an app image.
You can specify init containers in the Pod specification alongside the containers
array (which describes app containers).
Understanding init containers
A Pod can have multiple containers running apps within it, but it can also have one or more init containers, which are run before the app containers are started.
Init containers are exactly like regular containers, except:
- Init containers always run to completion.
- Each init container must complete successfully before the next one starts.
If a Pod's init container fails, the kubelet repeatedly restarts that init container until it succeeds.
However, if the Pod has a restartPolicy
of Never, and an init container fails during startup of that Pod, Kubernetes treats the overall Pod as failed.
To specify an init container for a Pod, add the initContainers
field into
the Pod specification,
as an array of container
items (similar to the app containers
field and its contents).
See Container in the
API reference for more details.
The status of the init containers is returned in .status.initContainerStatuses
field as an array of the container statuses (similar to the .status.containerStatuses
field).
Differences from regular containers
Init containers support all the fields and features of app containers, including resource limits, volumes, and security settings. However, the resource requests and limits for an init container are handled differently, as documented in Resources.
Also, init containers do not support lifecycle
, livenessProbe
, readinessProbe
, or
startupProbe
because they must run to completion before the Pod can be ready.
If you specify multiple init containers for a Pod, kubelet runs each init container sequentially. Each init container must succeed before the next can run. When all of the init containers have run to completion, kubelet initializes the application containers for the Pod and runs them as usual.
Using init containers
Because init containers have separate images from app containers, they have some advantages for start-up related code:
- Init containers can contain utilities or custom code for setup that are not present in an app
image. For example, there is no need to make an image
FROM
another image just to use a tool likesed
,awk
,python
, ordig
during setup. - The application image builder and deployer roles can work independently without the need to jointly build a single app image.
- Init containers can run with a different view of the filesystem than app containers in the same Pod. Consequently, they can be given access to Secrets that app containers cannot access.
- Because init containers run to completion before any app containers start, init containers offer a mechanism to block or delay app container startup until a set of preconditions are met. Once preconditions are met, all of the app containers in a Pod can start in parallel.
- Init containers can securely run utilities or custom code that would otherwise make an app container image less secure. By keeping unnecessary tools separate you can limit the attack surface of your app container image.
Examples
Here are some ideas for how to use init containers:
-
Wait for a Service to be created, using a shell one-line command like:
for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; done; exit 1
-
Register this Pod with a remote server from the downward API with a command like:
curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(<POD_NAME>)&ip=$(<POD_IP>)'
-
Wait for some time before starting the app container with a command like
sleep 60
-
Clone a Git repository into a Volume
-
Place values into a configuration file and run a template tool to dynamically generate a configuration file for the main app container. For example, place the
POD_IP
value in a configuration and generate the main app configuration file using Jinja.
Init containers in use
This example defines a simple Pod that has two init containers.
The first waits for myservice
, and the second waits for mydb
. Once both
init containers complete, the Pod runs the app container from its spec
section.
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo The app is running! && sleep 3600']
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
- name: init-mydb
image: busybox:1.28
command: ['sh', '-c', "until nslookup mydb.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for mydb; sleep 2; done"]
You can start this Pod by running:
kubectl apply -f myapp.yaml
The output is similar to this:
pod/myapp-pod created
And check on its status with:
kubectl get -f myapp.yaml
The output is similar to this:
NAME READY STATUS RESTARTS AGE
myapp-pod 0/1 Init:0/2 0 6m
or for more details:
kubectl describe -f myapp.yaml
The output is similar to this:
Name: myapp-pod
Namespace: default
[...]
Labels: app=myapp
Status: Pending
[...]
Init Containers:
init-myservice:
[...]
State: Running
[...]
init-mydb:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Containers:
myapp-container:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
16s 16s 1 {default-scheduler } Normal Scheduled Successfully assigned myapp-pod to 172.17.4.201
16s 16s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Pulling pulling image "busybox"
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Pulled Successfully pulled image "busybox"
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Created Created container with docker id 5ced34a04634; Security:[seccomp=unconfined]
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice} Normal Started Started container with docker id 5ced34a04634
To see logs for the init containers in this Pod, run:
kubectl logs myapp-pod -c init-myservice # Inspect the first init container
kubectl logs myapp-pod -c init-mydb # Inspect the second init container
At this point, those init containers will be waiting to discover Services named
mydb
and myservice
.
Here's a configuration you can use to make those Services appear:
---
apiVersion: v1
kind: Service
metadata:
name: myservice
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
---
apiVersion: v1
kind: Service
metadata:
name: mydb
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9377
To create the mydb
and myservice
services:
kubectl apply -f services.yaml
The output is similar to this:
service/myservice created
service/mydb created
You'll then see that those init containers complete, and that the myapp-pod
Pod moves into the Running state:
kubectl get -f myapp.yaml
The output is similar to this:
NAME READY STATUS RESTARTS AGE
myapp-pod 1/1 Running 0 9m
This simple example should provide some inspiration for you to create your own init containers. What's next contains a link to a more detailed example.
Detailed behavior
During Pod startup, the kubelet delays running init containers until the networking and storage are ready. Then the kubelet runs the Pod's init containers in the order they appear in the Pod's spec.
Each init container must exit successfully before
the next container starts. If a container fails to start due to the runtime or
exits with failure, it is retried according to the Pod restartPolicy
. However,
if the Pod restartPolicy
is set to Always, the init containers use
restartPolicy
OnFailure.
A Pod cannot be Ready
until all init containers have succeeded. The ports on an
init container are not aggregated under a Service. A Pod that is initializing
is in the Pending
state but should have a condition Initialized
set to false.
If the Pod restarts, or is restarted, all init containers must execute again.
Changes to the init container spec are limited to the container image field. Altering an init container image field is equivalent to restarting the Pod.
Because init containers can be restarted, retried, or re-executed, init container
code should be idempotent. In particular, code that writes to files on EmptyDirs
should be prepared for the possibility that an output file already exists.
Init containers have all of the fields of an app container. However, Kubernetes
prohibits readinessProbe
from being used because init containers cannot
define readiness distinct from completion. This is enforced during validation.
Use activeDeadlineSeconds
on the Pod to prevent init containers from failing forever.
The active deadline includes init containers.
However it is recommended to use activeDeadlineSeconds
only if teams deploy their application
as a Job, because activeDeadlineSeconds
has an effect even after initContainer finished.
The Pod which is already running correctly would be killed by activeDeadlineSeconds
if you set.
The name of each app and init container in a Pod must be unique; a validation error is thrown for any container sharing a name with another.
Resources
Given the ordering and execution for init containers, the following rules for resource usage apply:
- The highest of any particular resource request or limit defined on all init containers is the effective init request/limit. If any resource has no resource limit specified this is considered as the highest limit.
- The Pod's effective request/limit for a resource is the higher of:
- the sum of all app containers request/limit for a resource
- the effective init request/limit for a resource
- Scheduling is done based on effective requests/limits, which means init containers can reserve resources for initialization that are not used during the life of the Pod.
- The QoS (quality of service) tier of the Pod's effective QoS tier is the QoS tier for init containers and app containers alike.
Quota and limits are applied based on the effective Pod request and limit.
Pod level control groups (cgroups) are based on the effective Pod request and limit, the same as the scheduler.
Pod restart reasons
A Pod can restart, causing re-execution of init containers, for the following reasons:
- The Pod infrastructure container is restarted. This is uncommon and would have to be done by someone with root access to nodes.
- All containers in a Pod are terminated while
restartPolicy
is set to Always, forcing a restart, and the init container completion record has been lost due to garbage collection.
The Pod will not be restarted when the init container image is changed, or the init container completion record has been lost due to garbage collection. This applies for Kubernetes v1.20 and later. If you are using an earlier version of Kubernetes, consult the documentation for the version you are using.
What's next
- Read about creating a Pod that has an init container
- Learn how to debug init containers
3.4.1.3 - Pod Topology Spread Constraints
Kubernetes v1.19 [stable]
You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.
Prerequisites
Node Labels
Topology spread constraints rely on node labels to identify the topology domain(s) that each Node is in. For example, a Node might have labels: node=node1,zone=us-east-1a,region=us-east-1
Suppose you have a 4-node cluster with the following labels:
NAME STATUS ROLES AGE VERSION LABELS
node1 Ready <none> 4m26s v1.16.0 node=node1,zone=zoneA
node2 Ready <none> 3m58s v1.16.0 node=node2,zone=zoneA
node3 Ready <none> 3m17s v1.16.0 node=node3,zone=zoneB
node4 Ready <none> 2m43s v1.16.0 node=node4,zone=zoneB
Then the cluster is logically viewed as below:
Instead of manually applying labels, you can also reuse the well-known labels that are created and populated automatically on most clusters.
Spread Constraints for Pods
API
The API field pod.spec.topologySpreadConstraints
is defined as below:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
topologySpreadConstraints:
- maxSkew: <integer>
topologyKey: <string>
whenUnsatisfiable: <string>
labelSelector: <object>
You can define one or multiple topologySpreadConstraint
to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across your cluster. The fields are:
- maxSkew describes the degree to which Pods may be unevenly distributed.
It must be greater than zero. Its semantics differs according to the value of
whenUnsatisfiable
:- when
whenUnsatisfiable
equals to "DoNotSchedule",maxSkew
is the maximum permitted difference between the number of matching pods in the target topology and the global minimum (the minimum number of pods that match the label selector in a topology domain. For example, if you have 3 zones with 0, 2 and 3 matching pods respectively, The global minimum is 0). - when
whenUnsatisfiable
equals to "ScheduleAnyway", scheduler gives higher precedence to topologies that would help reduce the skew.
- when
- topologyKey is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
- whenUnsatisfiable indicates how to deal with a Pod if it doesn't satisfy the spread constraint:
DoNotSchedule
(default) tells the scheduler not to schedule it.ScheduleAnyway
tells the scheduler to still schedule it while prioritizing nodes that minimize the skew.
- labelSelector is used to find matching Pods. Pods that match this label selector are counted to determine the number of Pods in their corresponding topology domain. See Label Selectors for more details.
When a Pod defines more than one topologySpreadConstraint
, those constraints are ANDed: The kube-scheduler looks for a node for the incoming Pod that satisfies all the constraints.
You can read more about this field by running kubectl explain Pod.spec.topologySpreadConstraints
.
Example: One TopologySpreadConstraint
Suppose you have a 4-node cluster where 3 Pods labeled foo:bar
are located in node1, node2 and node3 respectively:
If we want an incoming Pod to be evenly spread with existing Pods across zones, the spec can be given as:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
topologyKey: zone
implies the even distribution will only be applied to the nodes which have label pair "zone:<any value>" present. whenUnsatisfiable: DoNotSchedule
tells the scheduler to let it stay pending if the incoming Pod can't satisfy the constraint.
If the scheduler placed this incoming Pod into "zoneA", the Pods distribution would become [3, 1], hence the actual skew is 2 (3 - 1) - which violates maxSkew: 1
. In this example, the incoming Pod can only be placed onto "zoneB":
OR
You can tweak the Pod spec to meet various kinds of requirements:
- Change
maxSkew
to a bigger value like "2" so that the incoming Pod can be placed onto "zoneA" as well. - Change
topologyKey
to "node" so as to distribute the Pods evenly across nodes instead of zones. In the above example, ifmaxSkew
remains "1", the incoming Pod can only be placed onto "node4". - Change
whenUnsatisfiable: DoNotSchedule
towhenUnsatisfiable: ScheduleAnyway
to ensure the incoming Pod to be always schedulable (suppose other scheduling APIs are satisfied). However, it's preferred to be placed onto the topology domain which has fewer matching Pods. (Be aware that this preferability is jointly normalized with other internal scheduling priorities like resource usage ratio, etc.)
Example: Multiple TopologySpreadConstraints
This builds upon the previous example. Suppose you have a 4-node cluster where 3 Pods labeled foo:bar
are located in node1, node2 and node3 respectively:
You can use 2 TopologySpreadConstraints to control the Pods spreading on both zone and node:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
- maxSkew: 1
topologyKey: node
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
In this case, to match the first constraint, the incoming Pod can only be placed onto "zoneB"; while in terms of the second constraint, the incoming Pod can only be placed onto "node4". Then the results of 2 constraints are ANDed, so the only viable option is to place on "node4".
Multiple constraints can lead to conflicts. Suppose you have a 3-node cluster across 2 zones:
If you apply "two-constraints.yaml" to this cluster, you will notice "mypod" stays in Pending
state. This is because: to satisfy the first constraint, "mypod" can only be put to "zoneB"; while in terms of the second constraint, "mypod" can only put to "node2". Then a joint result of "zoneB" and "node2" returns nothing.
To overcome this situation, you can either increase the maxSkew
or modify one of the constraints to use whenUnsatisfiable: ScheduleAnyway
.
Interaction With Node Affinity and Node Selectors
The scheduler will skip the non-matching nodes from the skew calculations if the incoming Pod has spec.nodeSelector
or spec.affinity.nodeAffinity
defined.
Example: TopologySpreadConstraints with NodeAffinity
Suppose you have a 5-node cluster ranging from zoneA to zoneC:
and you know that "zoneC" must be excluded. In this case, you can compose the yaml as below, so that "mypod" will be placed onto "zoneB" instead of "zoneC". Similarly spec.nodeSelector
is also respected.
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: NotIn
values:
- zoneC
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
The scheduler doesn't have prior knowledge of all the zones or other topology domains that a cluster has. They are determined from the existing nodes in the cluster. This could lead to a problem in autoscaled clusters, when a node pool (or node group) is scaled to zero nodes and the user is expecting them to scale up, because, in this case, those topology domains won't be considered until there is at least one node in them.
Other Noticeable Semantics
There are some implicit conventions worth noting here:
-
Only the Pods holding the same namespace as the incoming Pod can be matching candidates.
-
The scheduler will bypass the nodes without
topologySpreadConstraints[*].topologyKey
present. This implies that:- the Pods located on those nodes do not impact
maxSkew
calculation - in the above example, suppose "node1" does not have label "zone", then the 2 Pods will be disregarded, hence the incoming Pod will be scheduled into "zoneA". - the incoming Pod has no chances to be scheduled onto this kind of nodes - in the above example, suppose a "node5" carrying label
{zone-typo: zoneC}
joins the cluster, it will be bypassed due to the absence of label key "zone".
- the Pods located on those nodes do not impact
-
Be aware of what will happen if the incomingPod's
topologySpreadConstraints[*].labelSelector
doesn't match its own labels. In the above example, if we remove the incoming Pod's labels, it can still be placed onto "zoneB" since the constraints are still satisfied. However, after the placement, the degree of imbalance of the cluster remains unchanged - it's still zoneA having 2 Pods which hold label {foo:bar}, and zoneB having 1 Pod which holds label {foo:bar}. So if this is not what you expect, we recommend the workload'stopologySpreadConstraints[*].labelSelector
to match its own labels.
Cluster-level default constraints
It is possible to set default topology spread constraints for a cluster. Default topology spread constraints are applied to a Pod if, and only if:
- It doesn't define any constraints in its
.spec.topologySpreadConstraints
. - It belongs to a service, replication controller, replica set or stateful set.
Default constraints can be set as part of the PodTopologySpread
plugin args
in a scheduling profile.
The constraints are specified with the same API above, except that
labelSelector
must be empty. The selectors are calculated from the services,
replication controllers, replica sets or stateful sets that the Pod belongs to.
An example configuration might look like follows:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
defaultingType: List
SelectorSpread
plugin.
It is recommended that you disable this plugin in the scheduling profile when
using default constraints for PodTopologySpread
.
Internal default constraints
Kubernetes v1.20 [beta]
With the DefaultPodTopologySpread
feature gate, enabled by default, the
legacy SelectorSpread
plugin is disabled.
kube-scheduler uses the following default topology constraints for the
PodTopologySpread
plugin configuration:
defaultConstraints:
- maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
Also, the legacy SelectorSpread
plugin, which provides an equivalent behavior,
is disabled.
The PodTopologySpread
plugin does not score the nodes that don't have
the topology keys specified in the spreading constraints. This might result
in a different default behavior compared to the legacy SelectorSpread
plugin when
using the default topology constraints.
If your nodes are not expected to have both kubernetes.io/hostname
and
topology.kubernetes.io/zone
labels set, define your own constraints
instead of using the Kubernetes defaults.
If you don't want to use the default Pod spreading constraints for your cluster,
you can disable those defaults by setting defaultingType
to List
and leaving
empty defaultConstraints
in the PodTopologySpread
plugin configuration:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints: []
defaultingType: List
Comparison with PodAffinity/PodAntiAffinity
In Kubernetes, directives related to "Affinity" control how Pods are scheduled - more packed or more scattered.
- For
PodAffinity
, you can try to pack any number of Pods into qualifying topology domain(s) - For
PodAntiAffinity
, only one Pod can be scheduled into a single topology domain.
For finer control, you can specify topology spread constraints to distribute Pods across different topology domains - to achieve either high availability or cost-saving. This can also help on rolling update workloads and scaling out replicas smoothly. See Motivation for more details.
Known Limitations
- There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in imbalanced Pods distribution. You can use Descheduler to rebalance the Pods distribution.
- Pods matched on tainted nodes are respected. See Issue 80921
What's next
- Blog: Introducing PodTopologySpread
explains
maxSkew
in details, as well as bringing up some advanced usage examples.
3.4.1.4 - Disruptions
This guide is for application owners who want to build highly available applications, and thus need to understand what types of disruptions can happen to Pods.
It is also for cluster administrators who want to perform automated cluster actions, like upgrading and autoscaling clusters.
Voluntary and involuntary disruptions
Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.
We call these unavoidable cases involuntary disruptions to an application. Examples are:
- a hardware failure of the physical machine backing the node
- cluster administrator deletes VM (instance) by mistake
- cloud provider or hypervisor failure makes VM disappear
- a kernel panic
- the node disappears from the cluster due to cluster network partition
- eviction of a pod due to the node being out-of-resources.
Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.
We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:
- deleting the deployment or other controller that manages the pod
- updating a deployment's pod template causing a restart
- directly deleting a pod (e.g. by accident)
Cluster administrator actions include:
- Draining a node for repair or upgrade.
- Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).
- Removing a pod from a node to permit something else to fit on that node.
These actions might be taken directly by the cluster administrator, or by automation run by the cluster administrator, or by your cluster hosting provider.
Ask your cluster administrator or consult your cloud provider or distribution documentation to determine if any sources of voluntary disruptions are enabled for your cluster. If none are enabled, you can skip creating Pod Disruption Budgets.
Dealing with disruptions
Here are some ways to mitigate involuntary disruptions:
- Ensure your pod requests the resources it needs.
- Replicate your application if you need higher availability. (Learn about running replicated stateless and stateful applications.)
- For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)
The frequency of voluntary disruptions varies. On a basic Kubernetes cluster, there are no automated voluntary disruptions (only user-triggered ones). However, your cluster administrator or hosting provider may run some additional services which cause voluntary disruptions. For example, rolling out node software updates can cause voluntary disruptions. Also, some implementations of cluster (node) autoscaling may cause voluntary disruptions to defragment and compact nodes. Your cluster administrator or hosting provider should have documented what level of voluntary disruptions, if any, to expect. Certain configuration options, such as using PriorityClasses in your pod spec can also cause voluntary (and involuntary) disruptions.
Pod disruption budgets
Kubernetes v1.21 [stable]
Kubernetes offers features to help you run highly available applications even when you introduce frequent voluntary disruptions.
As an application owner, you can create a PodDisruptionBudget (PDB) for each application. A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.
Cluster managers and hosting providers should use tools which respect PodDisruptionBudgets by calling the Eviction API instead of directly deleting pods or deployments.
For example, the kubectl drain
subcommand lets you mark a node as going out of
service. When you run kubectl drain
, the tool tries to evict all of the Pods on
the Node you're taking out of service. The eviction request that kubectl
submits on
your behalf may be temporarily rejected, so the tool periodically retries all failed
requests until all Pods on the target node are terminated, or until a configurable timeout
is reached.
A PDB specifies the number of replicas that an application can tolerate having, relative to how
many it is intended to have. For example, a Deployment which has a .spec.replicas: 5
is
supposed to have 5 pods at any given time. If its PDB allows for there to be 4 at a time,
then the Eviction API will allow voluntary disruption of one (but not two) pods at a time.
The group of pods that comprise the application is specified using a label selector, the same as the one used by the application's controller (deployment, stateful-set, etc).
The "intended" number of pods is computed from the .spec.replicas
of the workload resource
that is managing those pods. The control plane discovers the owning workload resource by
examining the .metadata.ownerReferences
of the Pod.
Involuntary disruptions cannot be prevented by PDBs; however they do count against the budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but workload resources (such as Deployment and StatefulSet) are not limited by PDBs when doing rolling upgrades. Instead, the handling of failures during application updates is configured in the spec for the specific workload resource.
When a pod is evicted using the eviction API, it is gracefully
terminated, honoring the
terminationGracePeriodSeconds
setting in its PodSpec.
PodDisruptionBudget example
Consider a cluster with 3 nodes, node-1
through node-3
.
The cluster is running several applications. One of them has 3 replicas initially called
pod-a
, pod-b
, and pod-c
. Another, unrelated pod without a PDB, called pod-x
, is also shown.
Initially, the pods are laid out as follows:
node-1 | node-2 | node-3 |
---|---|---|
pod-a available | pod-b available | pod-c available |
pod-x available |
All 3 pods are part of a deployment, and they collectively have a PDB which requires there be at least 2 of the 3 pods to be available at all times.
For example, assume the cluster administrator wants to reboot into a new kernel version to fix a bug in the kernel.
The cluster administrator first tries to drain node-1
using the kubectl drain
command.
That tool tries to evict pod-a
and pod-x
. This succeeds immediately.
Both pods go into the terminating
state at the same time.
This puts the cluster in this state:
node-1 draining | node-2 | node-3 |
---|---|---|
pod-a terminating | pod-b available | pod-c available |
pod-x terminating |
The deployment notices that one of the pods is terminating, so it creates a replacement
called pod-d
. Since node-1
is cordoned, it lands on another node. Something has
also created pod-y
as a replacement for pod-x
.
(Note: for a StatefulSet, pod-a
, which would be called something like pod-0
, would need
to terminate completely before its replacement, which is also called pod-0
but has a
different UID, could be created. Otherwise, the example applies to a StatefulSet as well.)
Now the cluster is in this state:
node-1 draining | node-2 | node-3 |
---|---|---|
pod-a terminating | pod-b available | pod-c available |
pod-x terminating | pod-d starting | pod-y |
At some point, the pods terminate, and the cluster looks like this:
node-1 drained | node-2 | node-3 |
---|---|---|
pod-b available | pod-c available | |
pod-d starting | pod-y |
At this point, if an impatient cluster administrator tries to drain node-2
or
node-3
, the drain command will block, because there are only 2 available
pods for the deployment, and its PDB requires at least 2. After some time passes, pod-d
becomes available.
The cluster state now looks like this:
node-1 drained | node-2 | node-3 |
---|---|---|
pod-b available | pod-c available | |
pod-d available | pod-y |
Now, the cluster administrator tries to drain node-2
.
The drain command will try to evict the two pods in some order, say
pod-b
first and then pod-d
. It will succeed at evicting pod-b
.
But, when it tries to evict pod-d
, it will be refused because that would leave only
one pod available for the deployment.
The deployment creates a replacement for pod-b
called pod-e
.
Because there are not enough resources in the cluster to schedule
pod-e
the drain will again block. The cluster may end up in this
state:
node-1 drained | node-2 | node-3 | no node |
---|---|---|---|
pod-b terminating | pod-c available | pod-e pending | |
pod-d available | pod-y |
At this point, the cluster administrator needs to add a node back to the cluster to proceed with the upgrade.
You can see how Kubernetes varies the rate at which disruptions can happen, according to:
- how many replicas an application needs
- how long it takes to gracefully shutdown an instance
- how long it takes a new instance to start up
- the type of controller
- the cluster's resource capacity
Separating Cluster Owner and Application Owner Roles
Often, it is useful to think of the Cluster Manager and Application Owner as separate roles with limited knowledge of each other. This separation of responsibilities may make sense in these scenarios:
- when there are many application teams sharing a Kubernetes cluster, and there is natural specialization of roles
- when third-party tools or services are used to automate cluster management
Pod Disruption Budgets support this separation of roles by providing an interface between the roles.
If you do not have such a separation of responsibilities in your organization, you may not need to use Pod Disruption Budgets.
How to perform Disruptive Actions on your Cluster
If you are a Cluster Administrator, and you need to perform a disruptive action on all the nodes in your cluster, such as a node or system software upgrade, here are some options:
- Accept downtime during the upgrade.
- Failover to another complete replica cluster.
- No downtime, but may be costly both for the duplicated nodes and for human effort to orchestrate the switchover.
- Write disruption tolerant applications and use PDBs.
- No downtime.
- Minimal resource duplication.
- Allows more automation of cluster administration.
- Writing disruption-tolerant applications is tricky, but the work to tolerate voluntary disruptions largely overlaps with work to support autoscaling and tolerating involuntary disruptions.
What's next
-
Follow steps to protect your application by configuring a Pod Disruption Budget.
-
Learn more about draining nodes
-
Learn about updating a deployment including steps to maintain its availability during the rollout.
3.4.1.5 - Ephemeral Containers
Kubernetes v1.23 [beta]
This page provides an overview of ephemeral containers: a special type of container that runs temporarily in an existing Pod to accomplish user-initiated actions such as troubleshooting. You use ephemeral containers to inspect services rather than to build applications.
Understanding ephemeral containers
Pods are the fundamental building block of Kubernetes applications. Since Pods are intended to be disposable and replaceable, you cannot add a container to a Pod once it has been created. Instead, you usually delete and replace Pods in a controlled fashion using deployments.
Sometimes it's necessary to inspect the state of an existing Pod, however, for example to troubleshoot a hard-to-reproduce bug. In these cases you can run an ephemeral container in an existing Pod to inspect its state and run arbitrary commands.
What is an ephemeral container?
Ephemeral containers differ from other containers in that they lack guarantees
for resources or execution, and they will never be automatically restarted, so
they are not appropriate for building applications. Ephemeral containers are
described using the same ContainerSpec
as regular containers, but many fields
are incompatible and disallowed for ephemeral containers.
- Ephemeral containers may not have ports, so fields such as
ports
,livenessProbe
,readinessProbe
are disallowed. - Pod resource allocations are immutable, so setting
resources
is disallowed. - For a complete list of allowed fields, see the EphemeralContainer reference documentation.
Ephemeral containers are created using a special ephemeralcontainers
handler
in the API rather than by adding them directly to pod.spec
, so it's not
possible to add an ephemeral container using kubectl edit
.
Like regular containers, you may not change or remove an ephemeral container after you have added it to a Pod.
Uses for ephemeral containers
Ephemeral containers are useful for interactive troubleshooting when kubectl exec
is insufficient because a container has crashed or a container image
doesn't include debugging utilities.
In particular, distroless images
enable you to deploy minimal container images that reduce attack surface
and exposure to bugs and vulnerabilities. Since distroless images do not include a
shell or any debugging utilities, it's difficult to troubleshoot distroless
images using kubectl exec
alone.
When using ephemeral containers, it's helpful to enable process namespace sharing so you can view processes in other containers.
What's next
- Learn how to debug pods using ephemeral containers.
3.4.2 - Workload Resources
3.4.2.1 - Deployments
A Deployment provides declarative updates for Pods and ReplicaSets.
You describe a desired state in a Deployment, and the Deployment Controller changes the actual state to the desired state at a controlled rate. You can define Deployments to create new ReplicaSets, or to remove existing Deployments and adopt all their resources with new Deployments.
Use Case
The following are typical use cases for Deployments:
- Create a Deployment to rollout a ReplicaSet. The ReplicaSet creates Pods in the background. Check the status of the rollout to see if it succeeds or not.
- Declare the new state of the Pods by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the Deployment manages moving the Pods from the old ReplicaSet to the new one at a controlled rate. Each new ReplicaSet updates the revision of the Deployment.
- Rollback to an earlier Deployment revision if the current state of the Deployment is not stable. Each rollback updates the revision of the Deployment.
- Scale up the Deployment to facilitate more load.
- Pause the rollout of a Deployment to apply multiple fixes to its PodTemplateSpec and then resume it to start a new rollout.
- Use the status of the Deployment as an indicator that a rollout has stuck.
- Clean up older ReplicaSets that you don't need anymore.
Creating a Deployment
The following is an example of a Deployment. It creates a ReplicaSet to bring up three nginx
Pods:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In this example:
-
A Deployment named
nginx-deployment
is created, indicated by the.metadata.name
field. -
The Deployment creates three replicated Pods, indicated by the
.spec.replicas
field. -
The
.spec.selector
field defines how the Deployment finds which Pods to manage. In this case, you select a label that is defined in the Pod template (app: nginx
). However, more sophisticated selection rules are possible, as long as the Pod template itself satisfies the rule.Note: The.spec.selector.matchLabels
field is a map of {key,value} pairs. A single {key,value} in thematchLabels
map is equivalent to an element ofmatchExpressions
, whosekey
field is "key", theoperator
is "In", and thevalues
array contains only "value". All of the requirements, from bothmatchLabels
andmatchExpressions
, must be satisfied in order to match. -
The
template
field contains the following sub-fields:- The Pods are labeled
app: nginx
using the.metadata.labels
field. - The Pod template's specification, or
.template.spec
field, indicates that the Pods run one container,nginx
, which runs thenginx
Docker Hub image at version 1.14.2. - Create one container and name it
nginx
using the.spec.template.spec.containers[0].name
field.
- The Pods are labeled
Before you begin, make sure your Kubernetes cluster is up and running. Follow the steps given below to create the above Deployment:
-
Create the Deployment by running the following command:
kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml
-
Run
kubectl get deployments
to check if the Deployment was created.If the Deployment is still being created, the output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/3 0 0 1s
When you inspect the Deployments in your cluster, the following fields are displayed:
NAME
lists the names of the Deployments in the namespace.READY
displays how many replicas of the application are available to your users. It follows the pattern ready/desired.UP-TO-DATE
displays the number of replicas that have been updated to achieve the desired state.AVAILABLE
displays how many replicas of the application are available to your users.AGE
displays the amount of time that the application has been running.
Notice how the number of desired replicas is 3 according to
.spec.replicas
field. -
To see the Deployment rollout status, run
kubectl rollout status deployment/nginx-deployment
.The output is similar to:
Waiting for rollout to finish: 2 out of 3 new replicas have been updated... deployment "nginx-deployment" successfully rolled out
-
Run the
kubectl get deployments
again a few seconds later. The output is similar to this:NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 18s
Notice that the Deployment has created all three replicas, and all replicas are up-to-date (they contain the latest Pod template) and available.
-
To see the ReplicaSet (
rs
) created by the Deployment, runkubectl get rs
. The output is similar to this:NAME DESIRED CURRENT READY AGE nginx-deployment-75675f5897 3 3 3 18s
ReplicaSet output shows the following fields:
NAME
lists the names of the ReplicaSets in the namespace.DESIRED
displays the desired number of replicas of the application, which you define when you create the Deployment. This is the desired state.CURRENT
displays how many replicas are currently running.READY
displays how many replicas of the application are available to your users.AGE
displays the amount of time that the application has been running.
Notice that the name of the ReplicaSet is always formatted as
[DEPLOYMENT-NAME]-[RANDOM-STRING]
. The random string is randomly generated and uses thepod-template-hash
as a seed. -
To see the labels automatically generated for each Pod, run
kubectl get pods --show-labels
. The output is similar to:NAME READY STATUS RESTARTS AGE LABELS nginx-deployment-75675f5897-7ci7o 1/1 Running 0 18s app=nginx,pod-template-hash=3123191453 nginx-deployment-75675f5897-kzszj 1/1 Running 0 18s app=nginx,pod-template-hash=3123191453 nginx-deployment-75675f5897-qqcnn 1/1 Running 0 18s app=nginx,pod-template-hash=3123191453
The created ReplicaSet ensures that there are three
nginx
Pods.
You must specify an appropriate selector and Pod template labels in a Deployment
(in this case, app: nginx
).
Do not overlap labels or selectors with other controllers (including other Deployments and StatefulSets). Kubernetes doesn't stop you from overlapping, and if multiple controllers have overlapping selectors those controllers might conflict and behave unexpectedly.
Pod-template-hash label
The pod-template-hash
label is added by the Deployment controller to every ReplicaSet that a Deployment creates or adopts.
This label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by hashing the PodTemplate
of the ReplicaSet and using the resulting hash as the label value that is added to the ReplicaSet selector, Pod template labels,
and in any existing Pods that the ReplicaSet might have.
Updating a Deployment
.spec.template
)
is changed, for example if the labels or container images of the template are updated. Other updates, such as scaling the Deployment, do not trigger a rollout.
Follow the steps given below to update your Deployment:
-
Let's update the nginx Pods to use the
nginx:1.16.1
image instead of thenginx:1.14.2
image.kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1
or use the following command:
kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
The output is similar to:
deployment.apps/nginx-deployment image updated
Alternatively, you can
edit
the Deployment and change.spec.template.spec.containers[0].image
fromnginx:1.14.2
tonginx:1.16.1
:kubectl edit deployment/nginx-deployment
The output is similar to:
deployment.apps/nginx-deployment edited
-
To see the rollout status, run:
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
or
deployment "nginx-deployment" successfully rolled out
Get more details on your updated Deployment:
-
After the rollout succeeds, you can view the Deployment by running
kubectl get deployments
. The output is similar to this:NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 36s
-
Run
kubectl get rs
to see that the Deployment updated the Pods by creating a new ReplicaSet and scaling it up to 3 replicas, as well as scaling down the old ReplicaSet to 0 replicas.kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-deployment-1564180365 3 3 3 6s nginx-deployment-2035384211 0 0 0 36s
-
Running
get pods
should now show only the new Pods:kubectl get pods
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-1564180365-khku8 1/1 Running 0 14s nginx-deployment-1564180365-nacti 1/1 Running 0 14s nginx-deployment-1564180365-z9gth 1/1 Running 0 14s
Next time you want to update these Pods, you only need to update the Deployment's Pod template again.
Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at least 75% of the desired number of Pods are up (25% max unavailable).
Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it ensures that at most 125% of the desired number of Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first creates a new Pod, then deletes an old Pod, and creates another new one. It does not kill old Pods until a sufficient number of new Pods have come up, and does not create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are available and that at max 4 Pods in total are available. In case of a Deployment with 4 replicas, the number of Pods would be between 3 and 5.
-
Get details of your Deployment:
kubectl describe deployments
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Thu, 30 Nov 2017 10:56:25 +0000 Labels: app=nginx Annotations: deployment.kubernetes.io/revision=2 Selector: app=nginx Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.16.1 Port: 80/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 2m deployment-controller Scaled up replica set nginx-deployment-2035384211 to 3 Normal ScalingReplicaSet 24s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 1 Normal ScalingReplicaSet 22s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 2 Normal ScalingReplicaSet 22s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 2 Normal ScalingReplicaSet 19s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 1 Normal ScalingReplicaSet 19s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 3 Normal ScalingReplicaSet 14s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 0
Here you see that when you first created the Deployment, it created a ReplicaSet (nginx-deployment-2035384211) and scaled it up to 3 replicas directly. When you updated the Deployment, it created a new ReplicaSet (nginx-deployment-1564180365) and scaled it up to 1 and waited for it to come up. Then it scaled down the old ReplicaSet to 2 and scaled up the new ReplicaSet to 2 so that at least 3 Pods were available and at most 4 Pods were created at all times. It then continued scaling up and down the new and the old ReplicaSet, with the same rolling update strategy. Finally, you'll have 3 available replicas in the new ReplicaSet, and the old ReplicaSet is scaled down to 0.
availableReplicas
, which must be between
replicas - maxUnavailable
and replicas + maxSurge
. As a result, you might notice that there are more Pods than
expected during a rollout, and that the total resources consumed by the Deployment is more than replicas + maxSurge
until the terminationGracePeriodSeconds
of the terminating Pods expires.
Rollover (aka multiple updates in-flight)
Each time a new Deployment is observed by the Deployment controller, a ReplicaSet is created to bring up
the desired Pods. If the Deployment is updated, the existing ReplicaSet that controls Pods whose labels
match .spec.selector
but whose template does not match .spec.template
are scaled down. Eventually, the new
ReplicaSet is scaled to .spec.replicas
and all old ReplicaSets is scaled to 0.
If you update a Deployment while an existing rollout is in progress, the Deployment creates a new ReplicaSet as per the update and start scaling that up, and rolls over the ReplicaSet that it was scaling up previously -- it will add it to its list of old ReplicaSets and start scaling it down.
For example, suppose you create a Deployment to create 5 replicas of nginx:1.14.2
,
but then update the Deployment to create 5 replicas of nginx:1.16.1
, when only 3
replicas of nginx:1.14.2
had been created. In that case, the Deployment immediately starts
killing the 3 nginx:1.14.2
Pods that it had created, and starts creating
nginx:1.16.1
Pods. It does not wait for the 5 replicas of nginx:1.14.2
to be created
before changing course.
Label selector updates
It is generally discouraged to make label selector updates and it is suggested to plan your selectors up front. In any case, if you need to perform a label selector update, exercise great caution and make sure you have grasped all of the implications.
apps/v1
, a Deployment's label selector is immutable after it gets created.
- Selector additions require the Pod template labels in the Deployment spec to be updated with the new label too, otherwise a validation error is returned. This change is a non-overlapping one, meaning that the new selector does not select ReplicaSets and Pods created with the old selector, resulting in orphaning all old ReplicaSets and creating a new ReplicaSet.
- Selector updates changes the existing value in a selector key -- result in the same behavior as additions.
- Selector removals removes an existing key from the Deployment selector -- do not require any changes in the Pod template labels. Existing ReplicaSets are not orphaned, and a new ReplicaSet is not created, but note that the removed label still exists in any existing Pods and ReplicaSets.
Rolling Back a Deployment
Sometimes, you may want to rollback a Deployment; for example, when the Deployment is not stable, such as crash looping. By default, all of the Deployment's rollout history is kept in the system so that you can rollback anytime you want (you can change that by modifying revision history limit).
.spec.template
) is changed,
for example if you update the labels or container images of the template. Other updates, such as scaling the Deployment,
do not create a Deployment revision, so that you can facilitate simultaneous manual- or auto-scaling.
This means that when you roll back to an earlier revision, only the Deployment's Pod template part is
rolled back.
-
Suppose that you made a typo while updating the Deployment, by putting the image name as
nginx:1.161
instead ofnginx:1.16.1
:kubectl set image deployment/nginx-deployment nginx=nginx:1.161
The output is similar to this:
deployment.apps/nginx-deployment image updated
-
The rollout gets stuck. You can verify it by checking the rollout status:
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
-
Press Ctrl-C to stop the above rollout status watch. For more information on stuck rollouts, read more here.
-
You see that the number of old replicas (
nginx-deployment-1564180365
andnginx-deployment-2035384211
) is 2, and new replicas (nginx-deployment-3066724191) is 1.kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-deployment-1564180365 3 3 3 25s nginx-deployment-2035384211 0 0 0 36s nginx-deployment-3066724191 1 1 0 6s
-
Looking at the Pods created, you see that 1 Pod created by new ReplicaSet is stuck in an image pull loop.
kubectl get pods
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-1564180365-70iae 1/1 Running 0 25s nginx-deployment-1564180365-jbqqo 1/1 Running 0 25s nginx-deployment-1564180365-hysrc 1/1 Running 0 25s nginx-deployment-3066724191-08mng 0/1 ImagePullBackOff 0 6s
Note: The Deployment controller stops the bad rollout automatically, and stops scaling up the new ReplicaSet. This depends on the rollingUpdate parameters (maxUnavailable
specifically) that you have specified. Kubernetes by default sets the value to 25%. -
Get the description of the Deployment:
kubectl describe deployment
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Tue, 15 Mar 2016 14:48:04 -0700 Labels: app=nginx Selector: app=nginx Replicas: 3 desired | 1 updated | 4 total | 3 available | 1 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.161 Port: 80/TCP Host Port: 0/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True ReplicaSetUpdated OldReplicaSets: nginx-deployment-1564180365 (3/3 replicas created) NewReplicaSet: nginx-deployment-3066724191 (1/1 replicas created) Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1m 1m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-2035384211 to 3 22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 1 22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 2 22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 2 21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 1 21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 3 13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 0 13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-3066724191 to 1
To fix this, you need to rollback to a previous revision of Deployment that is stable.
Checking Rollout History of a Deployment
Follow the steps given below to check the rollout history:
-
First, check the revisions of this Deployment:
kubectl rollout history deployment/nginx-deployment
The output is similar to this:
deployments "nginx-deployment" REVISION CHANGE-CAUSE 1 kubectl apply --filename=https://k8s.io/examples/controllers/nginx-deployment.yaml 2 kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1 3 kubectl set image deployment/nginx-deployment nginx=nginx:1.161
CHANGE-CAUSE
is copied from the Deployment annotationkubernetes.io/change-cause
to its revisions upon creation. You can specify theCHANGE-CAUSE
message by:- Annotating the Deployment with
kubectl annotate deployment/nginx-deployment kubernetes.io/change-cause="image updated to 1.16.1"
- Manually editing the manifest of the resource.
- Annotating the Deployment with
-
To see the details of each revision, run:
kubectl rollout history deployment/nginx-deployment --revision=2
The output is similar to this:
deployments "nginx-deployment" revision 2 Labels: app=nginx pod-template-hash=1159050644 Annotations: kubernetes.io/change-cause=kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1 Containers: nginx: Image: nginx:1.16.1 Port: 80/TCP QoS Tier: cpu: BestEffort memory: BestEffort Environment Variables: <none> No volumes.
Rolling Back to a Previous Revision
Follow the steps given below to rollback the Deployment from the current version to the previous version, which is version 2.
-
Now you've decided to undo the current rollout and rollback to the previous revision:
kubectl rollout undo deployment/nginx-deployment
The output is similar to this:
deployment.apps/nginx-deployment rolled back
Alternatively, you can rollback to a specific revision by specifying it with
--to-revision
:kubectl rollout undo deployment/nginx-deployment --to-revision=2
The output is similar to this:
deployment.apps/nginx-deployment rolled back
For more details about rollout related commands, read
kubectl rollout
.The Deployment is now rolled back to a previous stable revision. As you can see, a
DeploymentRollback
event for rolling back to revision 2 is generated from Deployment controller. -
Check if the rollback was successful and the Deployment is running as expected, run:
kubectl get deployment nginx-deployment
The output is similar to this:
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 30m
-
Get the description of the Deployment:
kubectl describe deployment nginx-deployment
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500 Labels: app=nginx Annotations: deployment.kubernetes.io/revision=4 kubernetes.io/change-cause=kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1 Selector: app=nginx Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 25% max unavailable, 25% max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.16.1 Port: 80/TCP Host Port: 0/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas created) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 12m deployment-controller Scaled up replica set nginx-deployment-75675f5897 to 3 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 1 Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 2 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 2 Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 1 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 3 Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 0 Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-595696685f to 1 Normal DeploymentRollback 15s deployment-controller Rolled back deployment "nginx-deployment" to revision 2 Normal ScalingReplicaSet 15s deployment-controller Scaled down replica set nginx-deployment-595696685f to 0
Scaling a Deployment
You can scale a Deployment by using the following command:
kubectl scale deployment/nginx-deployment --replicas=10
The output is similar to this:
deployment.apps/nginx-deployment scaled
Assuming horizontal Pod autoscaling is enabled in your cluster, you can setup an autoscaler for your Deployment and choose the minimum and maximum number of Pods you want to run based on the CPU utilization of your existing Pods.
kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80
The output is similar to this:
deployment.apps/nginx-deployment scaled
Proportional scaling
RollingUpdate Deployments support running multiple versions of an application at the same time. When you or an autoscaler scales a RollingUpdate Deployment that is in the middle of a rollout (either in progress or paused), the Deployment controller balances the additional replicas in the existing active ReplicaSets (ReplicaSets with Pods) in order to mitigate risk. This is called proportional scaling.
For example, you are running a Deployment with 10 replicas, maxSurge=3, and maxUnavailable=2.
-
Ensure that the 10 replicas in your Deployment are running.
kubectl get deploy
The output is similar to this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE nginx-deployment 10 10 10 10 50s
-
You update to a new image which happens to be unresolvable from inside the cluster.
kubectl set image deployment/nginx-deployment nginx=nginx:sometag
The output is similar to this:
deployment.apps/nginx-deployment image updated
-
The image update starts a new rollout with ReplicaSet nginx-deployment-1989198191, but it's blocked due to the
maxUnavailable
requirement that you mentioned above. Check out the rollout status:kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-deployment-1989198191 5 5 0 9s nginx-deployment-618515232 8 8 8 1m
-
Then a new scaling request for the Deployment comes along. The autoscaler increments the Deployment replicas to 15. The Deployment controller needs to decide where to add these new 5 replicas. If you weren't using proportional scaling, all 5 of them would be added in the new ReplicaSet. With proportional scaling, you spread the additional replicas across all ReplicaSets. Bigger proportions go to the ReplicaSets with the most replicas and lower proportions go to ReplicaSets with less replicas. Any leftovers are added to the ReplicaSet with the most replicas. ReplicaSets with zero replicas are not scaled up.
In our example above, 3 replicas are added to the old ReplicaSet and 2 replicas are added to the new ReplicaSet. The rollout process should eventually move all replicas to the new ReplicaSet, assuming the new replicas become healthy. To confirm this, run:
kubectl get deploy
The output is similar to this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx-deployment 15 18 7 8 7m
The rollout status confirms how the replicas were added to each ReplicaSet.
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE
nginx-deployment-1989198191 7 7 0 7m
nginx-deployment-618515232 11 11 11 7m
Pausing and Resuming a rollout of a Deployment
When you update a Deployment, or plan to, you can pause rollouts for that Deployment before you trigger one or more updates. When you're ready to apply those changes, you resume rollouts for the Deployment. This approach allows you to apply multiple fixes in between pausing and resuming without triggering unnecessary rollouts.
-
For example, with a Deployment that was created:
Get the Deployment details:
kubectl get deploy
The output is similar to this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE nginx 3 3 3 3 1m
Get the rollout status:
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 3 3 3 1m
-
Pause by running the following command:
kubectl rollout pause deployment/nginx-deployment
The output is similar to this:
deployment.apps/nginx-deployment paused
-
Then update the image of the Deployment:
kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
The output is similar to this:
deployment.apps/nginx-deployment image updated
-
Notice that no new rollout started:
kubectl rollout history deployment/nginx-deployment
The output is similar to this:
deployments "nginx" REVISION CHANGE-CAUSE 1 <none>
-
Get the rollout status to verify that the existing ReplicaSet has not changed:
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 3 3 3 2m
-
You can make as many updates as you wish, for example, update the resources that will be used:
kubectl set resources deployment/nginx-deployment -c=nginx --limits=cpu=200m,memory=512Mi
The output is similar to this:
deployment.apps/nginx-deployment resource requirements updated
The initial state of the Deployment prior to pausing its rollout will continue its function, but new updates to the Deployment will not have any effect as long as the Deployment rollout is paused.
-
Eventually, resume the Deployment rollout and observe a new ReplicaSet coming up with all the new updates:
kubectl rollout resume deployment/nginx-deployment
The output is similar to this:
deployment.apps/nginx-deployment resumed
-
Watch the status of the rollout until it's done.
kubectl get rs -w
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 2 2 2 2m nginx-3926361531 2 2 0 6s nginx-3926361531 2 2 1 18s nginx-2142116321 1 2 2 2m nginx-2142116321 1 2 2 2m nginx-3926361531 3 2 1 18s nginx-3926361531 3 2 1 18s nginx-2142116321 1 1 1 2m nginx-3926361531 3 3 1 18s nginx-3926361531 3 3 2 19s nginx-2142116321 0 1 1 2m nginx-2142116321 0 1 1 2m nginx-2142116321 0 0 0 2m nginx-3926361531 3 3 3 20s
-
Get the status of the latest rollout:
kubectl get rs
The output is similar to this:
NAME DESIRED CURRENT READY AGE nginx-2142116321 0 0 0 2m nginx-3926361531 3 3 3 28s
Deployment status
A Deployment enters various states during its lifecycle. It can be progressing while rolling out a new ReplicaSet, it can be complete, or it can fail to progress.
Progressing Deployment
Kubernetes marks a Deployment as progressing when one of the following tasks is performed:
- The Deployment creates a new ReplicaSet.
- The Deployment is scaling up its newest ReplicaSet.
- The Deployment is scaling down its older ReplicaSet(s).
- New Pods become ready or available (ready for at least MinReadySeconds).
When the rollout becomes “progressing”, the Deployment controller adds a condition with the following
attributes to the Deployment's .status.conditions
:
type: Progressing
status: "True"
reason: NewReplicaSetCreated
|reason: FoundNewReplicaSet
|reason: ReplicaSetUpdated
You can monitor the progress for a Deployment by using kubectl rollout status
.
Complete Deployment
Kubernetes marks a Deployment as complete when it has the following characteristics:
- All of the replicas associated with the Deployment have been updated to the latest version you've specified, meaning any updates you've requested have been completed.
- All of the replicas associated with the Deployment are available.
- No old replicas for the Deployment are running.
When the rollout becomes “complete”, the Deployment controller sets a condition with the following
attributes to the Deployment's .status.conditions
:
type: Progressing
status: "True"
reason: NewReplicaSetAvailable
This Progressing
condition will retain a status value of "True"
until a new rollout
is initiated. The condition holds even when availability of replicas changes (which
does instead affect the Available
condition).
You can check if a Deployment has completed by using kubectl rollout status
. If the rollout completed
successfully, kubectl rollout status
returns a zero exit code.
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 2 of 3 updated replicas are available...
deployment "nginx-deployment" successfully rolled out
and the exit status from kubectl rollout
is 0 (success):
echo $?
0
Failed Deployment
Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever completing. This can occur due to some of the following factors:
- Insufficient quota
- Readiness probe failures
- Image pull errors
- Insufficient permissions
- Limit ranges
- Application runtime misconfiguration
One way you can detect this condition is to specify a deadline parameter in your Deployment spec:
(.spec.progressDeadlineSeconds
). .spec.progressDeadlineSeconds
denotes the
number of seconds the Deployment controller waits before indicating (in the Deployment status) that the
Deployment progress has stalled.
The following kubectl
command sets the spec with progressDeadlineSeconds
to make the controller report
lack of progress of a rollout for a Deployment after 10 minutes:
kubectl patch deployment/nginx-deployment -p '{"spec":{"progressDeadlineSeconds":600}}'
The output is similar to this:
deployment.apps/nginx-deployment patched
Once the deadline has been exceeded, the Deployment controller adds a DeploymentCondition with the following
attributes to the Deployment's .status.conditions
:
type: Progressing
status: "False"
reason: ProgressDeadlineExceeded
This condition can also fail early and is then set to status value of "False"
due to reasons as ReplicaSetCreateError
.
Also, the deadline is not taken into account anymore once the Deployment rollout completes.
See the Kubernetes API conventions for more information on status conditions.
reason: ProgressDeadlineExceeded
. Higher level orchestrators can take advantage of it and act accordingly, for
example, rollback the Deployment to its previous version.
You may experience transient errors with your Deployments, either due to a low timeout that you have set or due to any other kind of error that can be treated as transient. For example, let's suppose you have insufficient quota. If you describe the Deployment you will notice the following section:
kubectl describe deployment nginx-deployment
The output is similar to this:
<...>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
ReplicaFailure True FailedCreate
<...>
If you run kubectl get deployment nginx-deployment -o yaml
, the Deployment status is similar to this:
status:
availableReplicas: 2
conditions:
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: Replica set "nginx-deployment-4262182780" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
- lastTransitionTime: 2016-10-04T12:25:42Z
lastUpdateTime: 2016-10-04T12:25:42Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: 'Error creating: pods "nginx-deployment-4262182780-" is forbidden: exceeded quota:
object-counts, requested: pods=1, used: pods=3, limited: pods=2'
reason: FailedCreate
status: "True"
type: ReplicaFailure
observedGeneration: 3
replicas: 2
unavailableReplicas: 2
Eventually, once the Deployment progress deadline is exceeded, Kubernetes updates the status and the reason for the Progressing condition:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
ReplicaFailure True FailedCreate
You can address an issue of insufficient quota by scaling down your Deployment, by scaling down other
controllers you may be running, or by increasing quota in your namespace. If you satisfy the quota
conditions and the Deployment controller then completes the Deployment rollout, you'll see the
Deployment's status update with a successful condition (status: "True"
and reason: NewReplicaSetAvailable
).
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
type: Available
with status: "True"
means that your Deployment has minimum availability. Minimum availability is dictated
by the parameters specified in the deployment strategy. type: Progressing
with status: "True"
means that your Deployment
is either in the middle of a rollout and it is progressing or that it has successfully completed its progress and the minimum
required new replicas are available (see the Reason of the condition for the particulars - in our case
reason: NewReplicaSetAvailable
means that the Deployment is complete).
You can check if a Deployment has failed to progress by using kubectl rollout status
. kubectl rollout status
returns a non-zero exit code if the Deployment has exceeded the progression deadline.
kubectl rollout status deployment/nginx-deployment
The output is similar to this:
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
error: deployment "nginx" exceeded its progress deadline
and the exit status from kubectl rollout
is 1 (indicating an error):
echo $?
1
Operating on a failed deployment
All actions that apply to a complete Deployment also apply to a failed Deployment. You can scale it up/down, roll back to a previous revision, or even pause it if you need to apply multiple tweaks in the Deployment Pod template.
Clean up Policy
You can set .spec.revisionHistoryLimit
field in a Deployment to specify how many old ReplicaSets for
this Deployment you want to retain. The rest will be garbage-collected in the background. By default,
it is 10.
Canary Deployment
If you want to roll out releases to a subset of users or servers using the Deployment, you can create multiple Deployments, one for each release, following the canary pattern described in managing resources.
Writing a Deployment Spec
As with all other Kubernetes configs, a Deployment needs .apiVersion
, .kind
, and .metadata
fields.
For general information about working with config files, see
deploying applications,
configuring containers, and using kubectl to manage resources documents.
The name of a Deployment object must be a valid
DNS subdomain name.
A Deployment also needs a .spec
section.
Pod Template
The .spec.template
and .spec.selector
are the only required fields of the .spec
.
The .spec.template
is a Pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a Pod template in a Deployment must specify appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with other controllers. See selector.
Only a .spec.template.spec.restartPolicy
equal to Always
is
allowed, which is the default if not specified.
Replicas
.spec.replicas
is an optional field that specifies the number of desired Pods. It defaults to 1.
Should you manually scale a Deployment, example via kubectl scale deployment deployment --replicas=X
, and then you update that Deployment based on a manifest
(for example: by running kubectl apply -f deployment.yaml
),
then applying that manifest overwrites the manual scaling that you previously did.
If a HorizontalPodAutoscaler (or any
similar API for horizontal scaling) is managing scaling for a Deployment, don't set .spec.replicas
.
Instead, allow the Kubernetes
control plane to manage the
.spec.replicas
field automatically.
Selector
.spec.selector
is a required field that specifies a label selector
for the Pods targeted by this Deployment.
.spec.selector
must match .spec.template.metadata.labels
, or it will be rejected by the API.
In API version apps/v1
, .spec.selector
and .metadata.labels
do not default to .spec.template.metadata.labels
if not set. So they must be set explicitly. Also note that .spec.selector
is immutable after creation of the Deployment in apps/v1
.
A Deployment may terminate Pods whose labels match the selector if their template is different
from .spec.template
or if the total number of such Pods exceeds .spec.replicas
. It brings up new
Pods with .spec.template
if the number of Pods is less than the desired number.
If you have multiple controllers that have overlapping selectors, the controllers will fight with each other and won't behave correctly.
Strategy
.spec.strategy
specifies the strategy used to replace old Pods by new ones.
.spec.strategy.type
can be "Recreate" or "RollingUpdate". "RollingUpdate" is
the default value.
Recreate Deployment
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate
.
Rolling Update Deployment
The Deployment updates Pods in a rolling update
fashion when .spec.strategy.type==RollingUpdate
. You can specify maxUnavailable
and maxSurge
to control
the rolling update process.
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable
is an optional field that specifies the maximum number
of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5)
or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by
rounding down. The value cannot be 0 if .spec.strategy.rollingUpdate.maxSurge
is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
Max Surge
.spec.strategy.rollingUpdate.maxSurge
is an optional field that specifies the maximum number of Pods
that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a
percentage of desired Pods (for example, 10%). The value cannot be 0 if MaxUnavailable
is 0. The absolute number
is calculated from the percentage by rounding up. The default value is 25%.
For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods does not exceed 130% of desired Pods. Once old Pods have been killed, the new ReplicaSet can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.
Progress Deadline Seconds
.spec.progressDeadlineSeconds
is an optional field that specifies the number of seconds you want
to wait for your Deployment to progress before the system reports back that the Deployment has
failed progressing - surfaced as a condition with type: Progressing
, status: "False"
.
and reason: ProgressDeadlineExceeded
in the status of the resource. The Deployment controller will keep
retrying the Deployment. This defaults to 600. In the future, once automatic rollback will be implemented, the Deployment
controller will roll back a Deployment as soon as it observes such a condition.
If specified, this field needs to be greater than .spec.minReadySeconds
.
Min Ready Seconds
.spec.minReadySeconds
is an optional field that specifies the minimum number of seconds for which a newly
created Pod should be ready without any of its containers crashing, for it to be considered available.
This defaults to 0 (the Pod will be considered available as soon as it is ready). To learn more about when
a Pod is considered ready, see Container Probes.
Revision History Limit
A Deployment's revision history is stored in the ReplicaSets it controls.
.spec.revisionHistoryLimit
is an optional field that specifies the number of old ReplicaSets to retain
to allow rollback. These old ReplicaSets consume resources in etcd
and crowd the output of kubectl get rs
. The configuration of each Deployment revision is stored in its ReplicaSets; therefore, once an old ReplicaSet is deleted, you lose the ability to rollback to that revision of Deployment. By default, 10 old ReplicaSets will be kept, however its ideal value depends on the frequency and stability of new Deployments.
More specifically, setting this field to zero means that all old ReplicaSets with 0 replicas will be cleaned up. In this case, a new Deployment rollout cannot be undone, since its revision history is cleaned up.
Paused
.spec.paused
is an optional boolean field for pausing and resuming a Deployment. The only difference between
a paused Deployment and one that is not paused, is that any changes into the PodTemplateSpec of the paused
Deployment will not trigger new rollouts as long as it is paused. A Deployment is not paused by default when
it is created.
What's next
- Learn about Pods.
- Run a Stateless Application Using a Deployment.
Deployment
is a top-level resource in the Kubernetes REST API. Read the Deployment object definition to understand the API for deployments.- Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
3.4.2.2 - ReplicaSet
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the availability of a specified number of identical Pods.
How a ReplicaSet works
A ReplicaSet is defined with fields, including a selector that specifies how to identify Pods it can acquire, a number of replicas indicating how many Pods it should be maintaining, and a pod template specifying the data of new Pods it should create to meet the number of replicas criteria. A ReplicaSet then fulfills its purpose by creating and deleting Pods as needed to reach the desired number. When a ReplicaSet needs to create new Pods, it uses its Pod template.
A ReplicaSet is linked to its Pods via the Pods' metadata.ownerReferences field, which specifies what resource the current object is owned by. All Pods acquired by a ReplicaSet have their owning ReplicaSet's identifying information within their ownerReferences field. It's through this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans accordingly.
A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no OwnerReference or the OwnerReference is not a Controller and it matches a ReplicaSet's selector, it will be immediately acquired by said ReplicaSet.
When to use a ReplicaSet
A ReplicaSet ensures that a specified number of pod replicas are running at any given time. However, a Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features. Therefore, we recommend using Deployments instead of directly using ReplicaSets, unless you require custom update orchestration or don't require updates at all.
This actually means that you may never need to manipulate ReplicaSet objects: use a Deployment instead, and define your application in the spec section.
Example
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# modify replicas according to your case
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google_samples/gb-frontend:v3
Saving this manifest into frontend.yaml
and submitting it to a Kubernetes cluster will
create the defined ReplicaSet and the Pods that it manages.
kubectl apply -f https://kubernetes.io/examples/controllers/frontend.yaml
You can then get the current ReplicaSets deployed:
kubectl get rs
And see the frontend one you created:
NAME DESIRED CURRENT READY AGE
frontend 3 3 3 6s
You can also check on the state of the ReplicaSet:
kubectl describe rs/frontend
And you will see output similar to:
Name: frontend
Namespace: default
Selector: tier=frontend
Labels: app=guestbook
tier=frontend
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"ReplicaSet","metadata":{"annotations":{},"labels":{"app":"guestbook","tier":"frontend"},"name":"frontend",...
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: tier=frontend
Containers:
php-redis:
Image: gcr.io/google_samples/gb-frontend:v3
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 117s replicaset-controller Created pod: frontend-wtsmm
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-b2zdv
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-vcmts
And lastly you can check for the Pods brought up:
kubectl get pods
You should see Pod information similar to:
NAME READY STATUS RESTARTS AGE
frontend-b2zdv 1/1 Running 0 6m36s
frontend-vcmts 1/1 Running 0 6m36s
frontend-wtsmm 1/1 Running 0 6m36s
You can also verify that the owner reference of these pods is set to the frontend ReplicaSet. To do this, get the yaml of one of the Pods running:
kubectl get pods frontend-b2zdv -o yaml
The output will look similar to this, with the frontend ReplicaSet's info set in the metadata's ownerReferences field:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2020-02-12T07:06:16Z"
generateName: frontend-
labels:
tier: frontend
name: frontend-b2zdv
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: frontend
uid: f391f6db-bb9b-4c09-ae74-6a1f77f3d5cf
...
Non-Template Pod acquisitions
While you can create bare Pods with no problems, it is strongly recommended to make sure that the bare Pods do not have labels which match the selector of one of your ReplicaSets. The reason for this is because a ReplicaSet is not limited to owning Pods specified by its template-- it can acquire other Pods in the manner specified in the previous sections.
Take the previous frontend ReplicaSet example, and the Pods specified in the following manifest:
apiVersion: v1
kind: Pod
metadata:
name: pod1
labels:
tier: frontend
spec:
containers:
- name: hello1
image: gcr.io/google-samples/hello-app:2.0
---
apiVersion: v1
kind: Pod
metadata:
name: pod2
labels:
tier: frontend
spec:
containers:
- name: hello2
image: gcr.io/google-samples/hello-app:1.0
As those Pods do not have a Controller (or any object) as their owner reference and match the selector of the frontend ReplicaSet, they will immediately be acquired by it.
Suppose you create the Pods after the frontend ReplicaSet has been deployed and has set up its initial Pod replicas to fulfill its replica count requirement:
kubectl apply -f https://kubernetes.io/examples/pods/pod-rs.yaml
The new Pods will be acquired by the ReplicaSet, and then immediately terminated as the ReplicaSet would be over its desired count.
Fetching the Pods:
kubectl get pods
The output shows that the new Pods are either already terminated, or in the process of being terminated:
NAME READY STATUS RESTARTS AGE
frontend-b2zdv 1/1 Running 0 10m
frontend-vcmts 1/1 Running 0 10m
frontend-wtsmm 1/1 Running 0 10m
pod1 0/1 Terminating 0 1s
pod2 0/1 Terminating 0 1s
If you create the Pods first:
kubectl apply -f https://kubernetes.io/examples/pods/pod-rs.yaml
And then create the ReplicaSet however:
kubectl apply -f https://kubernetes.io/examples/controllers/frontend.yaml
You shall see that the ReplicaSet has acquired the Pods and has only created new ones according to its spec until the number of its new Pods and the original matches its desired count. As fetching the Pods:
kubectl get pods
Will reveal in its output:
NAME READY STATUS RESTARTS AGE
frontend-hmmj2 1/1 Running 0 9s
pod1 1/1 Running 0 36s
pod2 1/1 Running 0 36s
In this manner, a ReplicaSet can own a non-homogenous set of Pods
Writing a ReplicaSet manifest
As with all other Kubernetes API objects, a ReplicaSet needs the apiVersion
, kind
, and metadata
fields.
For ReplicaSets, the kind
is always a ReplicaSet.
In Kubernetes 1.9 the API version apps/v1
on the ReplicaSet kind is the current version and is enabled by default. The API version apps/v1beta2
is deprecated.
Refer to the first lines of the frontend.yaml
example for guidance.
The name of a ReplicaSet object must be a valid DNS subdomain name.
A ReplicaSet also needs a .spec
section.
Pod Template
The .spec.template
is a pod template which is also
required to have labels in place. In our frontend.yaml
example we had one label: tier: frontend
.
Be careful not to overlap with the selectors of other controllers, lest they try to adopt this Pod.
For the template's restart policy field,
.spec.template.spec.restartPolicy
, the only allowed value is Always
, which is the default.
Pod Selector
The .spec.selector
field is a label selector. As discussed
earlier these are the labels used to identify potential Pods to acquire. In our
frontend.yaml
example, the selector was:
matchLabels:
tier: frontend
In the ReplicaSet, .spec.template.metadata.labels
must match spec.selector
, or it will
be rejected by the API.
.spec.selector
but different .spec.template.metadata.labels
and .spec.template.spec
fields, each ReplicaSet ignores the Pods created by the other ReplicaSet.
Replicas
You can specify how many Pods should run concurrently by setting .spec.replicas
. The ReplicaSet will create/delete
its Pods to match this number.
If you do not specify .spec.replicas
, then it defaults to 1.
Working with ReplicaSets
Deleting a ReplicaSet and its Pods
To delete a ReplicaSet and all of its Pods, use kubectl delete
. The Garbage collector automatically deletes all of the dependent Pods by default.
When using the REST API or the client-go
library, you must set propagationPolicy
to Background
or Foreground
in
the -d option.
For example:
kubectl proxy --port=8080
curl -X DELETE 'localhost:8080/apis/apps/v1/namespaces/default/replicasets/frontend' \
> -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \
> -H "Content-Type: application/json"
Deleting just a ReplicaSet
You can delete a ReplicaSet without affecting any of its Pods using kubectl delete
with the --cascade=orphan
option.
When using the REST API or the client-go
library, you must set propagationPolicy
to Orphan
.
For example:
kubectl proxy --port=8080
curl -X DELETE 'localhost:8080/apis/apps/v1/namespaces/default/replicasets/frontend' \
> -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Orphan"}' \
> -H "Content-Type: application/json"
Once the original is deleted, you can create a new ReplicaSet to replace it. As long
as the old and new .spec.selector
are the same, then the new one will adopt the old Pods.
However, it will not make any effort to make existing Pods match a new, different pod template.
To update Pods to a new spec in a controlled way, use a
Deployment, as ReplicaSets do not support a rolling update directly.
Isolating Pods from a ReplicaSet
You can remove Pods from a ReplicaSet by changing their labels. This technique may be used to remove Pods from service for debugging, data recovery, etc. Pods that are removed in this way will be replaced automatically ( assuming that the number of replicas is not also changed).
Scaling a ReplicaSet
A ReplicaSet can be easily scaled up or down by simply updating the .spec.replicas
field. The ReplicaSet controller
ensures that a desired number of Pods with a matching label selector are available and operational.
When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the available pods to prioritize scaling down pods based on the following general algorithm:
- Pending (and unschedulable) pods are scaled down first
- If
controller.kubernetes.io/pod-deletion-cost
annotation is set, then the pod with the lower value will come first. - Pods on nodes with more replicas come before pods on nodes with fewer replicas.
- If the pods' creation times differ, the pod that was created more recently
comes before the older pod (the creation times are bucketed on an integer log scale
when the
LogarithmicScaleDown
feature gate is enabled)
If all of the above match, then selection is random.
Pod deletion cost
Kubernetes v1.22 [beta]
Using the controller.kubernetes.io/pod-deletion-cost
annotation, users can set a preference regarding which pods to remove first when downscaling a ReplicaSet.
The annotation should be set on the pod, the range is [-2147483647, 2147483647]. It represents the cost of deleting a pod compared to other pods belonging to the same ReplicaSet. Pods with lower deletion cost are preferred to be deleted before pods with higher deletion cost.
The implicit value for this annotation for pods that don't set it is 0; negative values are permitted. Invalid values will be rejected by the API server.
This feature is beta and enabled by default. You can disable it using the
feature gate
PodDeletionCost
in both kube-apiserver and kube-controller-manager.
- This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion order.
- Users should avoid updating the annotation frequently, such as updating it based on a metric value, because doing so will generate a significant number of pod updates on the apiserver.
Example Use Case
The different pods of an application could have different utilization levels. On scale down, the application
may prefer to remove the pods with lower utilization. To avoid frequently updating the pods, the application
should update controller.kubernetes.io/pod-deletion-cost
once before issuing a scale down (setting the
annotation to a value proportional to pod utilization level). This works if the application itself controls
the down scaling; for example, the driver pod of a Spark deployment.
ReplicaSet as a Horizontal Pod Autoscaler Target
A ReplicaSet can also be a target for Horizontal Pod Autoscalers (HPA). That is, a ReplicaSet can be auto-scaled by an HPA. Here is an example HPA targeting the ReplicaSet we created in the previous example.
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: frontend-scaler
spec:
scaleTargetRef:
kind: ReplicaSet
name: frontend
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 50
Saving this manifest into hpa-rs.yaml
and submitting it to a Kubernetes cluster should
create the defined HPA that autoscales the target ReplicaSet depending on the CPU usage
of the replicated Pods.
kubectl apply -f https://k8s.io/examples/controllers/hpa-rs.yaml
Alternatively, you can use the kubectl autoscale
command to accomplish the same
(and it's easier!)
kubectl autoscale rs frontend --max=10 --min=3 --cpu-percent=50
Alternatives to ReplicaSet
Deployment (recommended)
Deployment
is an object which can own ReplicaSets and update
them and their Pods via declarative, server-side rolling updates.
While ReplicaSets can be used independently, today they're mainly used by Deployments as a mechanism to orchestrate Pod
creation, deletion and updates. When you use Deployments you don't have to worry about managing the ReplicaSets that
they create. Deployments own and manage their ReplicaSets.
As such, it is recommended to use Deployments when you want ReplicaSets.
Bare Pods
Unlike the case where a user directly created Pods, a ReplicaSet replaces Pods that are deleted or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we recommend that you use a ReplicaSet even if your application requires only a single Pod. Think of it similarly to a process supervisor, only it supervises multiple Pods across multiple nodes instead of individual processes on a single node. A ReplicaSet delegates local container restarts to some agent on the node (for example, Kubelet or Docker).
Job
Use a Job
instead of a ReplicaSet for Pods that are expected to terminate on their own
(that is, batch jobs).
DaemonSet
Use a DaemonSet
instead of a ReplicaSet for Pods that provide a
machine-level function, such as machine monitoring or machine logging. These Pods have a lifetime that is tied
to a machine lifetime: the Pod needs to be running on the machine before other Pods start, and are
safe to terminate when the machine is otherwise ready to be rebooted/shutdown.
ReplicationController
ReplicaSets are the successors to ReplicationControllers. The two serve the same purpose, and behave similarly, except that a ReplicationController does not support set-based selector requirements as described in the labels user guide. As such, ReplicaSets are preferred over ReplicationControllers
What's next
- Learn about Pods.
- Learn about Deployments.
- Run a Stateless Application Using a Deployment, which relies on ReplicaSets to work.
ReplicaSet
is a top-level resource in the Kubernetes REST API. Read the ReplicaSet object definition to understand the API for replica sets.- Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
3.4.2.3 - StatefulSets
StatefulSet is the workload API object used to manage stateful applications.
Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods.
Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.
If you want to use storage volumes to provide persistence for your workload, you can use a StatefulSet as part of the solution. Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing volumes to the new Pods that replace any that have failed.
Using StatefulSets
StatefulSets are valuable for applications that require one or more of the following.
- Stable, unique network identifiers.
- Stable, persistent storage.
- Ordered, graceful deployment and scaling.
- Ordered, automated rolling updates.
In the above, stable is synonymous with persistence across Pod (re)scheduling. If an application doesn't require any stable identifiers or ordered deployment, deletion, or scaling, you should deploy your application using a workload object that provides a set of stateless replicas. Deployment or ReplicaSet may be better suited to your stateless needs.
Limitations
- The storage for a given Pod must either be provisioned by a PersistentVolume Provisioner based on the requested
storage class
, or pre-provisioned by an admin. - Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.
- StatefulSets currently require a Headless Service to be responsible for the network identity of the Pods. You are responsible for creating this Service.
- StatefulSets do not provide any guarantees on the termination of pods when a StatefulSet is deleted. To achieve ordered and graceful termination of the pods in the StatefulSet, it is possible to scale the StatefulSet down to 0 prior to deletion.
- When using Rolling Updates with the default
Pod Management Policy (
OrderedReady
), it's possible to get into a broken state that requires manual intervention to repair.
Components
The example below demonstrates the components of a StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
minReadySeconds: 10 # by default is 0
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
In the above example:
- A Headless Service, named
nginx
, is used to control the network domain. - The StatefulSet, named
web
, has a Spec that indicates that 3 replicas of the nginx container will be launched in unique Pods. - The
volumeClaimTemplates
will provide stable storage using PersistentVolumes provisioned by a PersistentVolume Provisioner.
The name of a StatefulSet object must be a valid DNS subdomain name.
Pod Selector
You must set the .spec.selector
field of a StatefulSet to match the labels of its .spec.template.metadata.labels
. In 1.8 and later versions, failing to specify a matching Pod Selector will result in a validation error during StatefulSet creation.
Volume Claim Templates
You can set the .spec.volumeClaimTemplates
which can provide stable storage using PersistentVolumes provisioned by a PersistentVolume Provisioner.
Minimum ready seconds
Kubernetes v1.23 [beta]
.spec.minReadySeconds
is an optional field that specifies the minimum number of seconds for which a newly
created Pod should be ready without any of its containers crashing, for it to be considered available.
Please note that this feature is beta and enabled by default. Please opt out by unsetting the StatefulSetMinReadySeconds flag, if you don't
want this feature to be enabled. This field defaults to 0 (the Pod will be considered
available as soon as it is ready). To learn more about when a Pod is considered ready, see Container Probes.
Pod Identity
StatefulSet Pods have a unique identity that is comprised of an ordinal, a stable network identity, and stable storage. The identity sticks to the Pod, regardless of which node it's (re)scheduled on.
Ordinal Index
For a StatefulSet with N replicas, each Pod in the StatefulSet will be assigned an integer ordinal, from 0 up through N-1, that is unique over the Set.
Stable Network ID
Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet
and the ordinal of the Pod. The pattern for the constructed hostname
is $(statefulset name)-$(ordinal)
. The example above will create three Pods
named web-0,web-1,web-2
.
A StatefulSet can use a Headless Service
to control the domain of its Pods. The domain managed by this Service takes the form:
$(service name).$(namespace).svc.cluster.local
, where "cluster.local" is the
cluster domain.
As each Pod is created, it gets a matching DNS subdomain, taking the form:
$(podname).$(governing service domain)
, where the governing service is defined
by the serviceName
field on the StatefulSet.
Depending on how DNS is configured in your cluster, you may not be able to look up the DNS name for a newly-run Pod immediately. This behavior can occur when other clients in the cluster have already sent queries for the hostname of the Pod before it was created. Negative caching (normal in DNS) means that the results of previous failed lookups are remembered and reused, even after the Pod is running, for at least a few seconds.
If you need to discover Pods promptly after they are created, you have a few options:
- Query the Kubernetes API directly (for example, using a watch) rather than relying on DNS lookups.
- Decrease the time of caching in your Kubernetes DNS provider (typically this means editing the config map for CoreDNS, which currently caches for 30 seconds).
As mentioned in the limitations section, you are responsible for creating the Headless Service responsible for the network identity of the pods.
Here are some examples of choices for Cluster Domain, Service name, StatefulSet name, and how that affects the DNS names for the StatefulSet's Pods.
Cluster Domain | Service (ns/name) | StatefulSet (ns/name) | StatefulSet Domain | Pod DNS | Pod Hostname |
---|---|---|---|---|---|
cluster.local | default/nginx | default/web | nginx.default.svc.cluster.local | web-{0..N-1}.nginx.default.svc.cluster.local | web-{0..N-1} |
cluster.local | foo/nginx | foo/web | nginx.foo.svc.cluster.local | web-{0..N-1}.nginx.foo.svc.cluster.local | web-{0..N-1} |
kube.local | foo/nginx | foo/web | nginx.foo.svc.kube.local | web-{0..N-1}.nginx.foo.svc.kube.local | web-{0..N-1} |
cluster.local
unless
otherwise configured.
Stable Storage
For each VolumeClaimTemplate entry defined in a StatefulSet, each Pod receives one PersistentVolumeClaim. In the nginx example above, each Pod receives a single PersistentVolume with a StorageClass of my-storage-class
and 1 Gib of provisioned storage. If no StorageClass
is specified, then the default StorageClass will be used. When a Pod is (re)scheduled
onto a node, its volumeMounts
mount the PersistentVolumes associated with its
PersistentVolume Claims. Note that, the PersistentVolumes associated with the
Pods' PersistentVolume Claims are not deleted when the Pods, or StatefulSet are deleted.
This must be done manually.
Pod Name Label
When the StatefulSet Controller creates a Pod,
it adds a label, statefulset.kubernetes.io/pod-name
, that is set to the name of
the Pod. This label allows you to attach a Service to a specific Pod in
the StatefulSet.
Deployment and Scaling Guarantees
- For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
- When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
- Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
- Before a Pod is terminated, all of its successors must be completely shutdown.
The StatefulSet should not specify a pod.Spec.TerminationGracePeriodSeconds
of 0. This practice is unsafe and strongly discouraged. For further explanation, please refer to force deleting StatefulSet Pods.
When the nginx example above is created, three Pods will be deployed in the order web-0, web-1, web-2. web-1 will not be deployed before web-0 is Running and Ready, and web-2 will not be deployed until web-1 is Running and Ready. If web-0 should fail, after web-1 is Running and Ready, but before web-2 is launched, web-2 will not be launched until web-0 is successfully relaunched and becomes Running and Ready.
If a user were to scale the deployed example by patching the StatefulSet such that
replicas=1
, web-2 would be terminated first. web-1 would not be terminated until web-2
is fully shutdown and deleted. If web-0 were to fail after web-2 has been terminated and
is completely shutdown, but prior to web-1's termination, web-1 would not be terminated
until web-0 is Running and Ready.
Pod Management Policies
In Kubernetes 1.7 and later, StatefulSet allows you to relax its ordering guarantees while
preserving its uniqueness and identity guarantees via its .spec.podManagementPolicy
field.
OrderedReady Pod Management
OrderedReady
pod management is the default for StatefulSets. It implements the behavior
described above.
Parallel Pod Management
Parallel
pod management tells the StatefulSet controller to launch or
terminate all Pods in parallel, and to not wait for Pods to become Running
and Ready or completely terminated prior to launching or terminating another
Pod. This option only affects the behavior for scaling operations. Updates are not
affected.
Update strategies
A StatefulSet's .spec.updateStrategy
field allows you to configure
and disable automated rolling updates for containers, labels, resource request/limits, and
annotations for the Pods in a StatefulSet. There are two possible values:
OnDelete
- When a StatefulSet's
.spec.updateStrategy.type
is set toOnDelete
, the StatefulSet controller will not automatically update the Pods in a StatefulSet. Users must manually delete Pods to cause the controller to create new Pods that reflect modifications made to a StatefulSet's.spec.template
. RollingUpdate
- The
RollingUpdate
update strategy implements automated, rolling update for the Pods in a StatefulSet. This is the default update strategy.
Rolling Updates
When a StatefulSet's .spec.updateStrategy.type
is set to RollingUpdate
, the
StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed
in the same order as Pod termination (from the largest ordinal to the smallest), updating
each Pod one at a time.
The Kubernetes control plane waits until an updated Pod is Running and Ready prior
to updating its predecessor. If you have set .spec.minReadySeconds
(see Minimum Ready Seconds), the control plane additionally waits that amount of time after the Pod turns ready, before moving on.
Partitioned rolling updates
The RollingUpdate
update strategy can be partitioned, by specifying a
.spec.updateStrategy.rollingUpdate.partition
. If a partition is specified, all Pods with an
ordinal that is greater than or equal to the partition will be updated when the StatefulSet's
.spec.template
is updated. All Pods with an ordinal that is less than the partition will not
be updated, and, even if they are deleted, they will be recreated at the previous version. If a
StatefulSet's .spec.updateStrategy.rollingUpdate.partition
is greater than its .spec.replicas
,
updates to its .spec.template
will not be propagated to its Pods.
In most cases you will not need to use a partition, but they are useful if you want to stage an
update, roll out a canary, or perform a phased roll out.
Forced rollback
When using Rolling Updates with the default
Pod Management Policy (OrderedReady
),
it's possible to get into a broken state that requires manual intervention to repair.
If you update the Pod template to a configuration that never becomes Running and Ready (for example, due to a bad binary or application-level configuration error), StatefulSet will stop the rollout and wait.
In this state, it's not enough to revert the Pod template to a good configuration. Due to a known issue, StatefulSet will continue to wait for the broken Pod to become Ready (which never happens) before it will attempt to revert it back to the working configuration.
After reverting the template, you must also delete any Pods that StatefulSet had already attempted to run with the bad configuration. StatefulSet will then begin to recreate the Pods using the reverted template.
PersistentVolumeClaim retention
Kubernetes v1.23 [alpha]
The optional .spec.persistentVolumeClaimRetentionPolicy
field controls if
and how PVCs are deleted during the lifecycle of a StatefulSet. You must enable the
StatefulSetAutoDeletePVC
feature gate
to use this field. Once enabled, there are two policies you can configure for each
StatefulSet:
whenDeleted
- configures the volume retention behavior that applies when the StatefulSet is deleted
whenScaled
- configures the volume retention behavior that applies when the replica count of the StatefulSet is reduced; for example, when scaling down the set.
For each policy that you can configure, you can set the value to either Delete
or Retain
.
Delete
- The PVCs created from the StatefulSet
volumeClaimTemplate
are deleted for each Pod affected by the policy. With thewhenDeleted
policy all PVCs from thevolumeClaimTemplate
are deleted after their Pods have been deleted. With thewhenScaled
policy, only PVCs corresponding to Pod replicas being scaled down are deleted, after their Pods have been deleted. Retain
(default)- PVCs from the
volumeClaimTemplate
are not affected when their Pod is deleted. This is the behavior before this new feature.
Bear in mind that these policies only apply when Pods are being removed due to the StatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSet fails due to node failure, and the control plane creates a replacement Pod, the StatefulSet retains the existing PVC. The existing volume is unaffected, and the cluster will attach it to the node where the new Pod is about to launch.
The default for policies is Retain
, matching the StatefulSet behavior before this new feature.
Here is an example policy.
apiVersion: apps/v1
kind: StatefulSet
...
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Delete
...
The StatefulSet controller adds owner
references
to its PVCs, which are then deleted by the garbage collector after the Pod is terminated. This enables the Pod to
cleanly unmount all volumes before the PVCs are deleted (and before the backing PV and
volume are deleted, depending on the retain policy). When you set the whenDeleted
policy to Delete
, an owner reference to the StatefulSet instance is placed on all PVCs
associated with that StatefulSet.
The whenScaled
policy must delete PVCs only when a Pod is scaled down, and not when a
Pod is deleted for another reason. When reconciling, the StatefulSet controller compares
its desired replica count to the actual Pods present on the cluster. Any StatefulSet Pod
whose id greater than the replica count is condemned and marked for deletion. If the
whenScaled
policy is Delete
, the condemned Pods are first set as owners to the
associated StatefulSet template PVCs, before the Pod is deleted. This causes the PVCs
to be garbage collected after only the condemned Pods have terminated.
This means that if the controller crashes and restarts, no Pod will be deleted before its owner reference has been updated appropriate to the policy. If a condemned Pod is force-deleted while the controller is down, the owner reference may or may not have been set up, depending on when the controller crashed. It may take several reconcile loops to update the owner references, so some condemned Pods may have set up owner references and other may not. For this reason we recommend waiting for the controller to come back up, which will verify owner references before terminating Pods. If that is not possible, the operator should verify the owner references on PVCs to ensure the expected objects are deleted when Pods are force-deleted.
Replicas
.spec.replicas
is an optional field that specifies the number of desired Pods. It defaults to 1.
Should you manually scale a deployment, example via kubectl scale statefulset statefulset --replicas=X
, and then you update that StatefulSet
based on a manifest (for example: by running kubectl apply -f statefulset.yaml
), then applying that manifest overwrites the manual scaling
that you previously did.
If a HorizontalPodAutoscaler
(or any similar API for horizontal scaling) is managing scaling for a
Statefulset, don't set .spec.replicas
. Instead, allow the Kubernetes
control plane to manage
the .spec.replicas
field automatically.
What's next
- Learn about Pods.
- Find out how to use StatefulSets
- Follow an example of deploying a stateful application.
- Follow an example of deploying Cassandra with Stateful Sets.
- Follow an example of running a replicated stateful application.
- Learn how to scale a StatefulSet.
- Learn what's involved when you delete a StatefulSet.
- Learn how to configure a Pod to use a volume for storage.
- Learn how to configure a Pod to use a PersistentVolume for storage.
StatefulSet
is a top-level resource in the Kubernetes REST API. Read the StatefulSet object definition to understand the API for stateful sets.- Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
3.4.2.4 - DaemonSet
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
Some typical uses of a DaemonSet are:
- running a cluster storage daemon on every node
- running a logs collection daemon on every node
- running a node monitoring daemon on every node
In a simple case, one DaemonSet, covering all nodes, would be used for each type of daemon. A more complex setup might use multiple DaemonSets for a single type of daemon, but with different flags and/or different memory and cpu requests for different hardware types.
Writing a DaemonSet Spec
Create a DaemonSet
You can describe a DaemonSet in a YAML file. For example, the daemonset.yaml
file below
describes a DaemonSet that runs the fluentd-elasticsearch Docker image:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
tolerations:
# this toleration is to have the daemonset runnable on master nodes
# remove it if your masters can't run pods
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Create a DaemonSet based on the YAML file:
kubectl apply -f https://k8s.io/examples/controllers/daemonset.yaml
Required Fields
As with all other Kubernetes config, a DaemonSet needs apiVersion
, kind
, and metadata
fields. For
general information about working with config files, see
running stateless applications
and object management using kubectl.
The name of a DaemonSet object must be a valid DNS subdomain name.
A DaemonSet also needs a
.spec
section.
Pod Template
The .spec.template
is one of the required fields in .spec
.
The .spec.template
is a pod template.
It has exactly the same schema as a Pod,
except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a Pod template in a DaemonSet has to specify appropriate labels (see pod selector).
A Pod Template in a DaemonSet must have a RestartPolicy
equal to Always
, or be unspecified, which defaults to Always
.
Pod Selector
The .spec.selector
field is a pod selector. It works the same as the .spec.selector
of
a Job.
As of Kubernetes 1.8, you must specify a pod selector that matches the labels of the
.spec.template
. The pod selector will no longer be defaulted when left empty. Selector
defaulting was not compatible with kubectl apply
. Also, once a DaemonSet is created,
its .spec.selector
can not be mutated. Mutating the pod selector can lead to the
unintentional orphaning of Pods, and it was found to be confusing to users.
The .spec.selector
is an object consisting of two fields:
matchLabels
- works the same as the.spec.selector
of a ReplicationController.matchExpressions
- allows to build more sophisticated selectors by specifying key, list of values and an operator that relates the key and values.
When the two are specified the result is ANDed.
If the .spec.selector
is specified, it must match the .spec.template.metadata.labels
.
Config with these not matching will be rejected by the API.
Running Pods on select Nodes
If you specify a .spec.template.spec.nodeSelector
, then the DaemonSet controller will
create Pods on nodes which match that node selector.
Likewise if you specify a .spec.template.spec.affinity
,
then DaemonSet controller will create Pods on nodes which match that
node affinity.
If you do not specify either, then the DaemonSet controller will create Pods on all nodes.
How Daemon Pods are scheduled
Scheduled by default scheduler
Kubernetes v1.23 [stable]
A DaemonSet ensures that all eligible nodes run a copy of a Pod. Normally, the node that a Pod runs on is selected by the Kubernetes scheduler. However, DaemonSet pods are created and scheduled by the DaemonSet controller instead. That introduces the following issues:
- Inconsistent Pod behavior: Normal Pods waiting to be scheduled are created
and in
Pending
state, but DaemonSet pods are not created inPending
state. This is confusing to the user. - Pod preemption is handled by default scheduler. When preemption is enabled, the DaemonSet controller will make scheduling decisions without considering pod priority and preemption.
ScheduleDaemonSetPods
allows you to schedule DaemonSets using the default
scheduler instead of the DaemonSet controller, by adding the NodeAffinity
term
to the DaemonSet pods, instead of the .spec.nodeName
term. The default
scheduler is then used to bind the pod to the target host. If node affinity of
the DaemonSet pod already exists, it is replaced (the original node affinity was
taken into account before selecting the target host). The DaemonSet controller only
performs these operations when creating or modifying DaemonSet pods, and no
changes are made to the spec.template
of the DaemonSet.
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- target-host-name
In addition, node.kubernetes.io/unschedulable:NoSchedule
toleration is added
automatically to DaemonSet Pods. The default scheduler ignores
unschedulable
Nodes when scheduling DaemonSet Pods.
Taints and Tolerations
Although Daemon Pods respect taints and tolerations, the following tolerations are added to DaemonSet Pods automatically according to the related features.
Toleration Key | Effect | Version | Description |
---|---|---|---|
node.kubernetes.io/not-ready |
NoExecute | 1.13+ | DaemonSet pods will not be evicted when there are node problems such as a network partition. |
node.kubernetes.io/unreachable |
NoExecute | 1.13+ | DaemonSet pods will not be evicted when there are node problems such as a network partition. |
node.kubernetes.io/disk-pressure |
NoSchedule | 1.8+ | DaemonSet pods tolerate disk-pressure attributes by default scheduler. |
node.kubernetes.io/memory-pressure |
NoSchedule | 1.8+ | DaemonSet pods tolerate memory-pressure attributes by default scheduler. |
node.kubernetes.io/unschedulable |
NoSchedule | 1.12+ | DaemonSet pods tolerate unschedulable attributes by default scheduler. |
node.kubernetes.io/network-unavailable |
NoSchedule | 1.12+ | DaemonSet pods, who uses host network, tolerate network-unavailable attributes by default scheduler. |
Communicating with Daemon Pods
Some possible patterns for communicating with Pods in a DaemonSet are:
- Push: Pods in the DaemonSet are configured to send updates to another service, such as a stats database. They do not have clients.
- NodeIP and Known Port: Pods in the DaemonSet can use a
hostPort
, so that the pods are reachable via the node IPs. Clients know the list of node IPs somehow, and know the port by convention. - DNS: Create a headless service
with the same pod selector, and then discover DaemonSets using the
endpoints
resource or retrieve multiple A records from DNS. - Service: Create a service with the same Pod selector, and use the service to reach a daemon on a random node. (No way to reach specific node.)
Updating a DaemonSet
If node labels are changed, the DaemonSet will promptly add Pods to newly matching nodes and delete Pods from newly not-matching nodes.
You can modify the Pods that a DaemonSet creates. However, Pods do not allow all fields to be updated. Also, the DaemonSet controller will use the original template the next time a node (even with the same name) is created.
You can delete a DaemonSet. If you specify --cascade=orphan
with kubectl
, then the Pods
will be left on the nodes. If you subsequently create a new DaemonSet with the same selector,
the new DaemonSet adopts the existing Pods. If any Pods need replacing the DaemonSet replaces
them according to its updateStrategy
.
You can perform a rolling update on a DaemonSet.
Alternatives to DaemonSet
Init scripts
It is certainly possible to run daemon processes by directly starting them on a node (e.g. using
init
, upstartd
, or systemd
). This is perfectly fine. However, there are several advantages to
running such processes via a DaemonSet:
- Ability to monitor and manage logs for daemons in the same way as applications.
- Same config language and tools (e.g. Pod templates,
kubectl
) for daemons and applications. - Running daemons in containers with resource limits increases isolation between daemons from app containers. However, this can also be accomplished by running the daemons in a container but not in a Pod (e.g. start directly via Docker).
Bare Pods
It is possible to create Pods directly which specify a particular node to run on. However, a DaemonSet replaces Pods that are deleted or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, you should use a DaemonSet rather than creating individual Pods.
Static Pods
It is possible to create Pods by writing a file to a certain directory watched by Kubelet. These are called static pods. Unlike DaemonSet, static Pods cannot be managed with kubectl or other Kubernetes API clients. Static Pods do not depend on the apiserver, making them useful in cluster bootstrapping cases. Also, static Pods may be deprecated in the future.
Deployments
DaemonSets are similar to Deployments in that they both create Pods, and those Pods have processes which are not expected to terminate (e.g. web servers, storage servers).
Use a Deployment for stateless services, like frontends, where scaling up and down the number of replicas and rolling out updates are more important than controlling exactly which host the Pod runs on. Use a DaemonSet when it is important that a copy of a Pod always run on all or certain hosts, if the DaemonSet provides node-level functionality that allows other Pods to run correctly on that particular node.
For example, network plugins often include a component that runs as a DaemonSet. The DaemonSet component makes sure that the node where it's running has working cluster networking.
What's next
- Learn about Pods.
- Learn about static Pods, which are useful for running Kubernetes control plane components.
- Find out how to use DaemonSets
- Perform a rolling update on a DaemonSet
- Perform a rollback on a DaemonSet (for example, if a roll out didn't work how you expected).
- Understand how Kubernetes assigns Pods to Nodes.
- Learn about device plugins and add ons, which often run as DaemonSets.
DaemonSet
is a top-level resource in the Kubernetes REST API. Read the DaemonSet object definition to understand the API for daemon sets.
3.4.2.5 - Jobs
A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate. As pods successfully complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the task (ie, Job) is complete. Deleting a Job will clean up the Pods it created. Suspending a Job will delete its active Pods until the Job is resumed again.
A simple case is to create one Job object in order to reliably run one Pod to completion. The Job object will start a new Pod if the first Pod fails or is deleted (for example due to a node hardware failure or a node reboot).
You can also use a Job to run multiple Pods in parallel.
If you want to run a Job (either a single task, or several in parallel) on a schedule, see CronJob.
Running an example Job
Here is an example Job config. It computes π to 2000 places and prints it out. It takes around 10s to complete.
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
You can run the example with this command:
kubectl apply -f https://kubernetes.io/examples/controllers/job.yaml
The output is similar to this:
job.batch/pi created
Check on the status of the Job with kubectl
:
kubectl describe jobs/pi
The output is similar to this:
Name: pi
Namespace: default
Selector: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
Labels: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
job-name=pi
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"pi","namespace":"default"},"spec":{"backoffLimit":4,"template":...
Parallelism: 1
Completions: 1
Start Time: Mon, 02 Dec 2019 15:20:11 +0200
Completed At: Mon, 02 Dec 2019 15:21:16 +0200
Duration: 65s
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
job-name=pi
Containers:
pi:
Image: perl
Port: <none>
Host Port: <none>
Command:
perl
-Mbignum=bpi
-wle
print bpi(2000)
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 14m job-controller Created pod: pi-5rwd7
To view completed Pods of a Job, use kubectl get pods
.
To list all the Pods that belong to a Job in a machine readable form, you can use a command like this:
pods=$(kubectl get pods --selector=job-name=pi --output=jsonpath='{.items[*].metadata.name}')
echo $pods
The output is similar to this:
pi-5rwd7
Here, the selector is the same as the selector for the Job. The --output=jsonpath
option specifies an expression
with the name from each Pod in the returned list.
View the standard output of one of the pods:
kubectl logs $pods
The output is similar to this:
3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185950244594553469083026425223082533446850352619311881710100031378387528865875332083814206171776691473035982534904287554687311595628638823537875937519577818577805321712268066130019278766111959092164201989380952572010654858632788659361533818279682303019520353018529689957736225994138912497217752834791315155748572424541506959508295331168617278558890750983817546374649393192550604009277016711390098488240128583616035637076601047101819429555961989467678374494482553797747268471040475346462080466842590694912933136770289891521047521620569660240580381501935112533824300355876402474964732639141992726042699227967823547816360093417216412199245863150302861829745557067498385054945885869269956909272107975093029553211653449872027559602364806654991198818347977535663698074265425278625518184175746728909777727938000816470600161452491921732172147723501414419735685481613611573525521334757418494684385233239073941433345477624168625189835694855620992192221842725502542568876717904946016534668049886272327917860857843838279679766814541009538837863609506800642251252051173929848960841284886269456042419652850222106611863067442786220391949450471237137869609563643719172874677646575739624138908658326459958133904780275901
Writing a Job spec
As with all other Kubernetes config, a Job needs apiVersion
, kind
, and metadata
fields.
Its name must be a valid DNS subdomain name.
A Job also needs a .spec
section.
Pod Template
The .spec.template
is the only required field of the .spec
.
The .spec.template
is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a pod template in a Job must specify appropriate labels (see pod selector) and an appropriate restart policy.
Only a RestartPolicy
equal to Never
or OnFailure
is allowed.
Pod selector
The .spec.selector
field is optional. In almost all cases you should not specify it.
See section specifying your own pod selector.
Parallel execution for Jobs
There are three main types of task suitable to run as a Job:
- Non-parallel Jobs
- normally, only one Pod is started, unless the Pod fails.
- the Job is complete as soon as its Pod terminates successfully.
- Parallel Jobs with a fixed completion count:
- specify a non-zero positive value for
.spec.completions
. - the Job represents the overall task, and is complete when there are
.spec.completions
successful Pods. - when using
.spec.completionMode="Indexed"
, each Pod gets a different index in the range 0 to.spec.completions-1
.
- specify a non-zero positive value for
- Parallel Jobs with a work queue:
- do not specify
.spec.completions
, default to.spec.parallelism
. - the Pods must coordinate amongst themselves or an external service to determine what each should work on. For example, a Pod might fetch a batch of up to N items from the work queue.
- each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
- when any Pod from the Job terminates with success, no new Pods are created.
- once at least one Pod has terminated with success and all Pods are terminated, then the Job is completed with success.
- once any Pod has exited with success, no other Pod should still be doing any work for this task or writing any output. They should all be in the process of exiting.
- do not specify
For a non-parallel Job, you can leave both .spec.completions
and .spec.parallelism
unset. When both are
unset, both are defaulted to 1.
For a fixed completion count Job, you should set .spec.completions
to the number of completions needed.
You can set .spec.parallelism
, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions
unset, and set .spec.parallelism
to
a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
Controlling parallelism
The requested parallelism (.spec.parallelism
) can be set to any non-negative value.
If it is unspecified, it defaults to 1.
If it is specified as 0, then the Job is effectively paused until it is increased.
Actual parallelism (number of pods running at any instant) may be more or less than requested parallelism, for a variety of reasons:
- For fixed completion count Jobs, the actual number of pods running in parallel will not exceed the number of
remaining completions. Higher values of
.spec.parallelism
are effectively ignored. - For work queue Jobs, no new Pods are started after any Pod has succeeded -- remaining Pods are allowed to complete, however.
- If the Job Controller has not had time to react.
- If the Job controller failed to create Pods for any reason (lack of
ResourceQuota
, lack of permission, etc.), then there may be fewer pods than requested. - The Job controller may throttle new Pod creation due to excessive previous pod failures in the same Job.
- When a Pod is gracefully shut down, it takes time to stop.
Completion mode
Kubernetes v1.22 [beta]
Jobs with fixed completion count - that is, jobs that have non null
.spec.completions
- can have a completion mode that is specified in .spec.completionMode
:
-
NonIndexed
(default): the Job is considered complete when there have been.spec.completions
successfully completed Pods. In other words, each Pod completion is homologous to each other. Note that Jobs that have null.spec.completions
are implicitlyNonIndexed
. -
Indexed
: the Pods of a Job get an associated completion index from 0 to.spec.completions-1
. The index is available through three mechanisms:- The Pod annotation
batch.kubernetes.io/job-completion-index
. - As part of the Pod hostname, following the pattern
$(job-name)-$(index)
. When you use an Indexed Job in combination with a Service, Pods within the Job can use the deterministic hostnames to address each other via DNS. - From the containarized task, in the environment variable
JOB_COMPLETION_INDEX
.
The Job is considered complete when there is one successfully completed Pod for each index. For more information about how to use this mode, see Indexed Job for Parallel Processing with Static Work Assignment. Note that, although rare, more than one Pod could be started for the same index, but only one of them will count towards the completion count.
- The Pod annotation
Handling Pod and container failures
A container in a Pod may fail for a number of reasons, such as because the process in it exited with
a non-zero exit code, or the container was killed for exceeding a memory limit, etc. If this
happens, and the .spec.template.spec.restartPolicy = "OnFailure"
, then the Pod stays
on the node, but the container is re-run. Therefore, your program needs to handle the case when it is
restarted locally, or else specify .spec.template.spec.restartPolicy = "Never"
.
See pod lifecycle for more information on restartPolicy
.
An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node
(node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the
.spec.template.spec.restartPolicy = "Never"
. When a Pod fails, then the Job controller
starts a new Pod. This means that your application needs to handle the case when it is restarted in a new
pod. In particular, it needs to handle temporary files, locks, incomplete output and the like
caused by previous runs.
Note that even if you specify .spec.parallelism = 1
and .spec.completions = 1
and
.spec.template.spec.restartPolicy = "Never"
, the same program may
sometimes be started twice.
If you do specify .spec.parallelism
and .spec.completions
both greater than 1, then there may be
multiple pods running at once. Therefore, your pods must also be tolerant of concurrency.
Pod backoff failure policy
There are situations where you want to fail a Job after some amount of retries
due to a logical error in configuration etc.
To do so, set .spec.backoffLimit
to specify the number of retries before
considering a Job as failed. The back-off limit is set by default to 6. Failed
Pods associated with the Job are recreated by the Job controller with an
exponential back-off delay (10s, 20s, 40s ...) capped at six minutes. The
back-off count is reset when a Job's Pod is deleted or successful without any
other Pods for the Job failing around that time.
restartPolicy = "OnFailure"
, keep in mind that your Pod running the Job
will be terminated once the job backoff limit has been reached. This can make debugging the Job's executable more difficult. We suggest setting
restartPolicy = "Never"
when debugging the Job or using a logging system to ensure output
from failed Jobs is not lost inadvertently.
Job termination and cleanup
When a Job completes, no more Pods are created, but the Pods are usually not deleted either.
Keeping them around
allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output.
The job object also remains after it is completed so that you can view its status. It is up to the user to delete
old jobs after noting their status. Delete the job with kubectl
(e.g. kubectl delete jobs/pi
or kubectl delete -f ./job.yaml
). When you delete the job using kubectl
, all the pods it created are deleted too.
By default, a Job will run uninterrupted unless a Pod fails (restartPolicy=Never
) or a Container exits in error (restartPolicy=OnFailure
), at which point the Job defers to the
.spec.backoffLimit
described above. Once .spec.backoffLimit
has been reached the Job will be marked as failed and any running Pods will be terminated.
Another way to terminate a Job is by setting an active deadline.
Do this by setting the .spec.activeDeadlineSeconds
field of the Job to a number of seconds.
The activeDeadlineSeconds
applies to the duration of the job, no matter how many Pods are created.
Once a Job reaches activeDeadlineSeconds
, all of its running Pods are terminated and the Job status will become type: Failed
with reason: DeadlineExceeded
.
Note that a Job's .spec.activeDeadlineSeconds
takes precedence over its .spec.backoffLimit
. Therefore, a Job that is retrying one or more failed Pods will not deploy additional Pods once it reaches the time limit specified by activeDeadlineSeconds
, even if the backoffLimit
is not yet reached.
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
Note that both the Job spec and the Pod template spec within the Job have an activeDeadlineSeconds
field. Ensure that you set this field at the proper level.
Keep in mind that the restartPolicy
applies to the Pod, and not to the Job itself: there is no automatic Job restart once the Job status is type: Failed
.
That is, the Job termination mechanisms activated with .spec.activeDeadlineSeconds
and .spec.backoffLimit
result in a permanent Job failure that requires manual intervention to resolve.
Clean up finished jobs automatically
Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
TTL mechanism for finished Jobs
Kubernetes v1.23 [stable]
Another way to clean up finished Jobs (either Complete
or Failed
)
automatically is to use a TTL mechanism provided by a
TTL controller for
finished resources, by specifying the .spec.ttlSecondsAfterFinished
field of
the Job.
When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects, such as Pods, together with the Job. Note that when the Job is deleted, its lifecycle guarantees, such as finalizers, will be honored.
For example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
The Job pi-with-ttl
will be eligible to be automatically deleted, 100
seconds after it finishes.
If the field is set to 0
, the Job will be eligible to be automatically deleted
immediately after it finishes. If the field is unset, this Job won't be cleaned
up by the TTL controller after it finishes.
It is recommended to set ttlSecondsAfterFinished
field because unmanaged jobs
(Jobs that you created directly, and not indirectly through other workload APIs
such as CronJob) have a default deletion
policy of orphanDependents
causing Pods created by an unmanaged Job to be left around
after that Job is fully deleted.
Even though the control plane eventually
garbage collects
the Pods from a deleted Job after they either fail or complete, sometimes those
lingering pods may cause cluster performance degradation or in worst case cause the
cluster to go offline due to this degradation.
You can use LimitRanges and ResourceQuotas to place a cap on the amount of resources that a particular namespace can consume.
Job patterns
The Job object can be used to support reliable parallel execution of Pods. The Job object is not designed to support closely-communicating parallel processes, as commonly found in scientific computing. It does support parallel processing of a set of independent but related work items. These might be emails to be sent, frames to be rendered, files to be transcoded, ranges of keys in a NoSQL database to scan, and so on.
In a complex system, there may be multiple different sets of work items. Here we are just considering one set of work items that the user wants to manage together — a batch job.
There are several different patterns for parallel computation, each with strengths and weaknesses. The tradeoffs are:
- One Job object for each work item, vs. a single Job object for all work items. The latter is better for large numbers of work items. The former creates some overhead for the user and for the system to manage large numbers of Job objects.
- Number of pods created equals number of work items, vs. each Pod can process multiple work items. The former typically requires less modification to existing code and containers. The latter is better for large numbers of work items, for similar reasons to the previous bullet.
- Several approaches use a work queue. This requires running a queue service, and modifications to the existing program or container to make it use the work queue. Other approaches are easier to adapt to an existing containerised application.
The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs. The pattern names are also links to examples and more detailed description.
Pattern | Single Job object | Fewer pods than work items? | Use app unmodified? |
---|---|---|---|
Queue with Pod Per Work Item | ✓ | sometimes | |
Queue with Variable Pod Count | ✓ | ✓ | |
Indexed Job with Static Work Assignment | ✓ | ✓ | |
Job Template Expansion | ✓ |
When you specify completions with .spec.completions
, each Pod created by the Job controller
has an identical spec
. This means that
all pods for a task will have the same command line and the same
image, the same volumes, and (almost) the same environment variables. These patterns
are different ways to arrange for pods to work on different things.
This table shows the required settings for .spec.parallelism
and .spec.completions
for each of the patterns.
Here, W
is the number of work items.
Pattern | .spec.completions |
.spec.parallelism |
---|---|---|
Queue with Pod Per Work Item | W | any |
Queue with Variable Pod Count | null | any |
Indexed Job with Static Work Assignment | W | any |
Job Template Expansion | 1 | should be 1 |
Advanced usage
Suspending a Job
Kubernetes v1.22 [beta]
When a Job is created, the Job controller will immediately begin creating Pods to satisfy the Job's requirements and will continue to do so until the Job is complete. However, you may want to temporarily suspend a Job's execution and resume it later, or start Jobs in suspended state and have a custom controller decide later when to start them.
To suspend a Job, you can update the .spec.suspend
field of
the Job to true; later, when you want to resume it again, update it to false.
Creating a Job with .spec.suspend
set to true will create it in the suspended
state.
When a Job is resumed from suspension, its .status.startTime
field will be
reset to the current time. This means that the .spec.activeDeadlineSeconds
timer will be stopped and reset when a Job is suspended and resumed.
Remember that suspending a Job will delete all active Pods. When the Job is
suspended, your Pods will be terminated
with a SIGTERM signal. The Pod's graceful termination period will be honored and
your Pod must handle this signal in this period. This may involve saving
progress for later or undoing changes. Pods terminated this way will not count
towards the Job's completions
count.
An example Job definition in the suspended state can be like so:
kubectl get job myjob -o yaml
apiVersion: batch/v1
kind: Job
metadata:
name: myjob
spec:
suspend: true
parallelism: 1
completions: 5
template:
spec:
...
The Job's status can be used to determine if a Job is suspended or has been suspended in the past:
kubectl get jobs/myjob -o yaml
apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
conditions:
- lastProbeTime: "2021-02-05T13:14:33Z"
lastTransitionTime: "2021-02-05T13:14:33Z"
status: "True"
type: Suspended
startTime: "2021-02-05T13:13:48Z"
The Job condition of type "Suspended" with status "True" means the Job is
suspended; the lastTransitionTime
field can be used to determine how long the
Job has been suspended for. If the status of that condition is "False", then the
Job was previously suspended and is now running. If such a condition does not
exist in the Job's status, the Job has never been stopped.
Events are also created when the Job is suspended and resumed:
kubectl describe jobs/myjob
Name: myjob
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 12m job-controller Created pod: myjob-hlrpl
Normal SuccessfulDelete 11m job-controller Deleted pod: myjob-hlrpl
Normal Suspended 11m job-controller Job suspended
Normal SuccessfulCreate 3s job-controller Created pod: myjob-jvb44
Normal Resumed 3s job-controller Job resumed
The last four events, particularly the "Suspended" and "Resumed" events, are
directly a result of toggling the .spec.suspend
field. In the time between
these two events, we see that no Pods were created, but Pod creation restarted
as soon as the Job was resumed.
Mutable Scheduling Directives
Kubernetes v1.23 [beta]
JobMutableNodeSchedulingDirectives
feature gate
on the API server.
It is enabled by default.
In most cases a parallel job will want the pods to run with constraints, like all in the same zone, or all either on GPU model x or y but not a mix of both.
The suspend field is the first step towards achieving those semantics. Suspend allows a custom queue controller to decide when a job should start; However, once a job is unsuspended, a custom queue controller has no influence on where the pods of a job will actually land.
This feature allows updating a Job's scheduling directives before it starts, which gives custom queue controllers the ability to influence pod placement while at the same time offloading actual pod-to-node assignment to kube-scheduler. This is allowed only for suspended Jobs that have never been unsuspended before.
The fields in a Job's pod template that can be updated are node affinity, node selector, tolerations, labels and annotations.
Specifying your own Pod selector
Normally, when you create a Job object, you do not specify .spec.selector
.
The system defaulting logic adds this field when the Job is created.
It picks a selector value that will not overlap with any other jobs.
However, in some cases, you might need to override this automatically set selector.
To do this, you can specify the .spec.selector
of the Job.
Be very careful when doing this. If you specify a label selector which is not
unique to the pods of that Job, and which matches unrelated Pods, then pods of the unrelated
job may be deleted, or this Job may count other Pods as completing it, or one or both
Jobs may refuse to create Pods or run to completion. If a non-unique selector is
chosen, then other controllers (e.g. ReplicationController) and their Pods may behave
in unpredictable ways too. Kubernetes will not stop you from making a mistake when
specifying .spec.selector
.
Here is an example of a case when you might want to use this feature.
Say Job old
is already running. You want existing Pods
to keep running, but you want the rest of the Pods it creates
to use a different pod template and for the Job to have a new name.
You cannot update the Job because these fields are not updatable.
Therefore, you delete Job old
but leave its pods
running, using kubectl delete jobs/old --cascade=orphan
.
Before deleting it, you make a note of what selector it uses:
kubectl get job old -o yaml
The output is similar to this:
kind: Job
metadata:
name: old
...
spec:
selector:
matchLabels:
controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
Then you create a new Job with name new
and you explicitly specify the same selector.
Since the existing Pods have label controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002
,
they are controlled by Job new
as well.
You need to specify manualSelector: true
in the new Job since you are not using
the selector that the system normally generates for you automatically.
kind: Job
metadata:
name: new
...
spec:
manualSelector: true
selector:
matchLabels:
controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
The new Job itself will have a different uid from a8f3d00d-c6d2-11e5-9f87-42010af00002
. Setting
manualSelector: true
tells the system that you know what you are doing and to allow this
mismatch.
Job tracking with finalizers
Kubernetes v1.23 [beta]
In order to use this behavior, you must enable the JobTrackingWithFinalizers
feature gate
on the API server
and the controller manager.
It is enabled by default.
When enabled, the control plane tracks new Jobs using the behavior described below. Jobs created before the feature was enabled are unaffected. As a user, the only difference you would see is that the control plane tracking of Job completion is more accurate.
When this feature isn't enabled, the Job Controller
relies on counting the Pods that exist in the cluster to track the Job status,
that is, to keep the counters for succeeded
and failed
Pods.
However, Pods can be removed for a number of reasons, including:
- The garbage collector that removes orphan Pods when a Node goes down.
- The garbage collector that removes finished Pods (in
Succeeded
orFailed
phase) after a threshold. - Human intervention to delete Pods belonging to a Job.
- An external controller (not provided as part of Kubernetes) that removes or replaces Pods.
If you enable the JobTrackingWithFinalizers
feature for your cluster, the
control plane keeps track of the Pods that belong to any Job and notices if any
such Pod is removed from the API server. To do that, the Job controller creates Pods with
the finalizer batch.kubernetes.io/job-tracking
. The controller removes the
finalizer only after the Pod has been accounted for in the Job status, allowing
the Pod to be removed by other controllers or users.
The Job controller uses the new algorithm for new Jobs only. Jobs created
before the feature is enabled are unaffected. You can determine if the Job
controller is tracking a Job using Pod finalizers by checking if the Job has the
annotation batch.kubernetes.io/job-tracking
. You should not manually add
or remove this annotation from Jobs.
Alternatives
Bare Pods
When the node that a Pod is running on reboots or fails, the pod is terminated and will not be restarted. However, a Job will create new Pods to replace terminated ones. For this reason, we recommend that you use a Job rather than a bare Pod, even if your application requires only a single Pod.
Replication Controller
Jobs are complementary to Replication Controllers. A Replication Controller manages Pods which are not expected to terminate (e.g. web servers), and a Job manages Pods that are expected to terminate (e.g. batch tasks).
As discussed in Pod Lifecycle, Job
is only appropriate
for pods with RestartPolicy
equal to OnFailure
or Never
.
(Note: If RestartPolicy
is not set, the default value is Always
.)
Single Job starts controller Pod
Another pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but maintains complete control over what Pods are created and how work is assigned to them.
What's next
- Learn about Pods.
- Read about different ways of running Jobs:
- Coarse Parallel Processing Using a Work Queue
- Fine Parallel Processing Using a Work Queue
- Use an indexed Job for parallel processing with static work assignment (beta)
- Create multiple Jobs based on a template: Parallel Processing using Expansions
- Follow the links within Clean up finished jobs automatically to learn more about how your cluster can clean up completed and / or failed tasks.
Job
is part of the Kubernetes REST API. Read the Job object definition to understand the API for jobs.- Read about
CronJob
, which you can use to define a series of Jobs that will run based on a schedule, similar to the Unix toolcron
.
3.4.2.6 - Automatic Clean-up for Finished Jobs
Kubernetes v1.23 [stable]
TTL-after-finished controller provides a TTL (time to live) mechanism to limit the lifetime of resource objects that have finished execution. TTL controller only handles Jobs.
TTL-after-finished Controller
The TTL-after-finished controller is only supported for Jobs. A cluster operator can use this feature to clean
up finished Jobs (either Complete
or Failed
) automatically by specifying the
.spec.ttlSecondsAfterFinished
field of a Job, as in this
example.
The TTL-after-finished controller will assume that a job is eligible to be cleaned up
TTL seconds after the job has finished, in other words, when the TTL has expired. When the
TTL-after-finished controller cleans up a job, it will delete it cascadingly, that is to say it will delete
its dependent objects together with it. Note that when the job is deleted,
its lifecycle guarantees, such as finalizers, will be honored.
The TTL seconds can be set at any time. Here are some examples for setting the
.spec.ttlSecondsAfterFinished
field of a Job:
- Specify this field in the job manifest, so that a Job can be cleaned up automatically some time after it finishes.
- Set this field of existing, already finished jobs, to adopt this new feature.
- Use a mutating admission webhook to set this field dynamically at job creation time. Cluster administrators can use this to enforce a TTL policy for finished jobs.
- Use a mutating admission webhook to set this field dynamically after the job has finished, and choose different TTL values based on job status, labels, etc.
Caveat
Updating TTL Seconds
Note that the TTL period, e.g. .spec.ttlSecondsAfterFinished
field of Jobs,
can be modified after the job is created or has finished. However, once the
Job becomes eligible to be deleted (when the TTL has expired), the system won't
guarantee that the Jobs will be kept, even if an update to extend the TTL
returns a successful API response.
Time Skew
Because TTL-after-finished controller uses timestamps stored in the Kubernetes jobs to determine whether the TTL has expired or not, this feature is sensitive to time skew in the cluster, which may cause TTL-after-finish controller to clean up job objects at the wrong time.
Clocks aren't always correct, but the difference should be very small. Please be aware of this risk when setting a non-zero TTL.
What's next
3.4.2.7 - CronJob
Kubernetes v1.21 [stable]
A CronJob creates Jobs on a repeating schedule.
One CronJob object is like one line of a crontab (cron table) file. It runs a job periodically on a given schedule, written in Cron format.
All CronJob schedule:
times are based on the timezone of the
kube-controller-manager.
If your control plane runs the kube-controller-manager in Pods or bare containers, the timezone set for the kube-controller-manager container determines the timezone that the cron job controller uses.
The v1 CronJob API does not officially support setting timezone as explained above.
Setting variables such as CRON_TZ
or TZ
is not officially supported by the Kubernetes project.
CRON_TZ
or TZ
is an implementation detail of the internal library being used
for parsing and calculating the next Job creation time. Any usage of it is not
recommended in a production cluster.
When creating the manifest for a CronJob resource, make sure the name you provide is a valid DNS subdomain name. The name must be no longer than 52 characters. This is because the CronJob controller will automatically append 11 characters to the job name provided and there is a constraint that the maximum length of a Job name is no more than 63 characters.
CronJob
CronJobs are meant for performing regular scheduled actions such as backups, report generation, and so on. Each of those tasks should be configured to recur indefinitely (for example: once a day / week / month); you can define the point in time within that interval when the job should start.
Example
This example CronJob manifest prints the current time and a hello message every minute:
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
(Running Automated Tasks with a CronJob takes you through this example in more detail).
Cron schedule syntax
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │ 7 is also Sunday on some systems)
# │ │ │ │ │ OR sun, mon, tue, wed, thu, fri, sat
# │ │ │ │ │
# * * * * *
Entry | Description | Equivalent to |
---|---|---|
@yearly (or @annually) | Run once a year at midnight of 1 January | 0 0 1 1 * |
@monthly | Run once a month at midnight of the first day of the month | 0 0 1 * * |
@weekly | Run once a week at midnight on Sunday morning | 0 0 * * 0 |
@daily (or @midnight) | Run once a day at midnight | 0 0 * * * |
@hourly | Run once an hour at the beginning of the hour | 0 * * * * |
For example, the line below states that the task must be started every Friday at midnight, as well as on the 13th of each month at midnight:
0 0 13 * 5
To generate CronJob schedule expressions, you can also use web tools like crontab.guru.
CronJob limitations
A cron job creates a job object about once per execution time of its schedule. We say "about" because there are certain circumstances where two jobs might be created, or no job might be created. We attempt to make these rare, but do not completely prevent them. Therefore, jobs should be idempotent.
If startingDeadlineSeconds
is set to a large value or left unset (the default)
and if concurrencyPolicy
is set to Allow
, the jobs will always run
at least once.
startingDeadlineSeconds
is set to a value less than 10 seconds, the CronJob may not be scheduled. This is because the CronJob controller checks things every 10 seconds.
For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error
Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.
It is important to note that if the startingDeadlineSeconds
field is set (not nil
), the controller counts how many missed jobs occurred from the value of startingDeadlineSeconds
until now rather than from the last scheduled time until now. For example, if startingDeadlineSeconds
is 200
, the controller counts how many missed jobs occurred in the last 200 seconds.
A CronJob is counted as missed if it has failed to be created at its scheduled time. For example, If concurrencyPolicy
is set to Forbid
and a CronJob was attempted to be scheduled when there was a previous schedule still running, then it would count as missed.
For example, suppose a CronJob is set to schedule a new Job every one minute beginning at 08:30:00
, and its
startingDeadlineSeconds
field is not set. If the CronJob controller happens to
be down from 08:29:00
to 10:21:00
, the job will not start as the number of missed jobs which missed their schedule is greater than 100.
To illustrate this concept further, suppose a CronJob is set to schedule a new Job every one minute beginning at 08:30:00
, and its
startingDeadlineSeconds
is set to 200 seconds. If the CronJob controller happens to
be down for the same period as the previous example (08:29:00
to 10:21:00
,) the Job will still start at 10:22:00. This happens as the controller now checks how many missed schedules happened in the last 200 seconds (ie, 3 missed schedules), rather than from the last scheduled time until now.
The CronJob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of the Pods it represents.
Controller version
Starting with Kubernetes v1.21 the second version of the CronJob controller
is the default implementation. To disable the default CronJob controller
and use the original CronJob controller instead, one pass the CronJobControllerV2
feature gate
flag to the kube-controller-manager,
and set this flag to false
. For example:
--feature-gates="CronJobControllerV2=false"
What's next
- Learn about Pods and Jobs, two concepts that CronJobs rely upon.
- Read about the format
of CronJob
.spec.schedule
fields. - For instructions on creating and working with CronJobs, and for an example of a CronJob manifest, see Running automated tasks with CronJobs.
- For instructions to clean up failed or completed jobs automatically, see Clean up Jobs automatically
CronJob
is part of the Kubernetes REST API. Read the CronJob object definition to understand the API for Kubernetes cron jobs.
3.4.2.8 - ReplicationController
Deployment
that configures a ReplicaSet
is now the recommended way to set up replication.
A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.
How a ReplicationController Works
If there are too many pods, the ReplicationController terminates the extra pods. If there are too few, the ReplicationController starts more pods. Unlike manually created pods, the pods maintained by a ReplicationController are automatically replaced if they fail, are deleted, or are terminated. For example, your pods are re-created on a node after disruptive maintenance such as a kernel upgrade. For this reason, you should use a ReplicationController even if your application requires only a single pod. A ReplicationController is similar to a process supervisor, but instead of supervising individual processes on a single node, the ReplicationController supervises multiple pods across multiple nodes.
ReplicationController is often abbreviated to "rc" in discussion, and as a shortcut in kubectl commands.
A simple case is to create one ReplicationController object to reliably run one instance of a Pod indefinitely. A more complex use case is to run several identical replicas of a replicated service, such as web servers.
Running an example ReplicationController
This example ReplicationController config runs three copies of the nginx web server.
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Run the example job by downloading the example file and then running this command:
kubectl apply -f https://k8s.io/examples/controllers/replication.yaml
The output is similar to this:
replicationcontroller/nginx created
Check on the status of the ReplicationController using this command:
kubectl describe replicationcontrollers/nginx
The output is similar to this:
Name: nginx
Namespace: default
Selector: app=nginx
Labels: app=nginx
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 0 Running / 3 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- ---- ------ -------
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod: nginx-qrm3m
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod: nginx-3ntk0
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod: nginx-4ok8v
Here, three pods are created, but none is running yet, perhaps because the image is being pulled. A little later, the same command may show:
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
To list all the pods that belong to the ReplicationController in a machine readable form, you can use a command like this:
pods=$(kubectl get pods --selector=app=nginx --output=jsonpath={.items..metadata.name})
echo $pods
The output is similar to this:
nginx-3ntk0 nginx-4ok8v nginx-qrm3m
Here, the selector is the same as the selector for the ReplicationController (seen in the
kubectl describe
output), and in a different form in replication.yaml
. The --output=jsonpath
option
specifies an expression with the name from each pod in the returned list.
Writing a ReplicationController Spec
As with all other Kubernetes config, a ReplicationController needs apiVersion
, kind
, and metadata
fields.
The name of a ReplicationController object must be a valid
DNS subdomain name.
For general information about working with configuration files, see object management.
A ReplicationController also needs a .spec
section.
Pod Template
The .spec.template
is the only required field of the .spec
.
The .spec.template
is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind
.
In addition to required fields for a Pod, a pod template in a ReplicationController must specify appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with other controllers. See pod selector.
Only a .spec.template.spec.restartPolicy
equal to Always
is allowed, which is the default if not specified.
For local container restarts, ReplicationControllers delegate to an agent on the node, for example the Kubelet or Docker.
Labels on the ReplicationController
The ReplicationController can itself have labels (.metadata.labels
). Typically, you
would set these the same as the .spec.template.metadata.labels
; if .metadata.labels
is not specified
then it defaults to .spec.template.metadata.labels
. However, they are allowed to be
different, and the .metadata.labels
do not affect the behavior of the ReplicationController.
Pod Selector
The .spec.selector
field is a label selector. A ReplicationController
manages all the pods with labels that match the selector. It does not distinguish
between pods that it created or deleted and pods that another person or process created or
deleted. This allows the ReplicationController to be replaced without affecting the running pods.
If specified, the .spec.template.metadata.labels
must be equal to the .spec.selector
, or it will
be rejected by the API. If .spec.selector
is unspecified, it will be defaulted to
.spec.template.metadata.labels
.
Also you should not normally create any pods whose labels match this selector, either directly, with another ReplicationController, or with another controller such as Job. If you do so, the ReplicationController thinks that it created the other pods. Kubernetes does not stop you from doing this.
If you do end up with multiple controllers that have overlapping selectors, you will have to manage the deletion yourself (see below).
Multiple Replicas
You can specify how many pods should run concurrently by setting .spec.replicas
to the number
of pods you would like to have running concurrently. The number running at any time may be higher
or lower, such as if the replicas were just increased or decreased, or if a pod is gracefully
shutdown, and a replacement starts early.
If you do not specify .spec.replicas
, then it defaults to 1.
Working with ReplicationControllers
Deleting a ReplicationController and its Pods
To delete a ReplicationController and all its pods, use kubectl delete
. Kubectl will scale the ReplicationController to zero and wait
for it to delete each pod before deleting the ReplicationController itself. If this kubectl
command is interrupted, it can be restarted.
When using the REST API or Go client library, you need to do the steps explicitly (scale replicas to 0, wait for pod deletions, then delete the ReplicationController).
Deleting only a ReplicationController
You can delete a ReplicationController without affecting any of its pods.
Using kubectl, specify the --cascade=orphan
option to kubectl delete
.
When using the REST API or Go client library, you can delete the ReplicationController object.
Once the original is deleted, you can create a new ReplicationController to replace it. As long
as the old and new .spec.selector
are the same, then the new one will adopt the old pods.
However, it will not make any effort to make existing pods match a new, different pod template.
To update pods to a new spec in a controlled way, use a rolling update.
Isolating pods from a ReplicationController
Pods may be removed from a ReplicationController's target set by changing their labels. This technique may be used to remove pods from service for debugging and data recovery. Pods that are removed in this way will be replaced automatically (assuming that the number of replicas is not also changed).
Common usage patterns
Rescheduling
As mentioned above, whether you have 1 pod you want to keep running, or 1000, a ReplicationController will ensure that the specified number of pods exists, even in the event of node failure or pod termination (for example, due to an action by another control agent).
Scaling
The ReplicationController enables scaling the number of replicas up or down, either manually or by an auto-scaling control agent, by updating the replicas
field.
Rolling updates
The ReplicationController is designed to facilitate rolling updates to a service by replacing pods one-by-one.
As explained in #1353, the recommended approach is to create a new ReplicationController with 1 replica, scale the new (+1) and old (-1) controllers one by one, and then delete the old controller after it reaches 0 replicas. This predictably updates the set of pods regardless of unexpected failures.
Ideally, the rolling update controller would take application readiness into account, and would ensure that a sufficient number of pods were productively serving at any given time.
The two ReplicationControllers would need to create pods with at least one differentiating label, such as the image tag of the primary container of the pod, since it is typically image updates that motivate rolling updates.
Multiple release tracks
In addition to running multiple releases of an application while a rolling update is in progress, it's common to run multiple releases for an extended period of time, or even continuously, using multiple release tracks. The tracks would be differentiated by labels.
For instance, a service might target all pods with tier in (frontend), environment in (prod)
. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a ReplicationController with replicas
set to 9 for the bulk of the replicas, with labels tier=frontend, environment=prod, track=stable
, and another ReplicationController with replicas
set to 1 for the canary, with labels tier=frontend, environment=prod, track=canary
. Now the service is covering both the canary and non-canary pods. But you can mess with the ReplicationControllers separately to test things out, monitor the results, etc.
Using ReplicationControllers with Services
Multiple ReplicationControllers can sit behind a single service, so that, for example, some traffic goes to the old version, and some goes to the new version.
A ReplicationController will never terminate on its own, but it isn't expected to be as long-lived as services. Services may be composed of pods controlled by multiple ReplicationControllers, and it is expected that many ReplicationControllers may be created and destroyed over the lifetime of a service (for instance, to perform an update of pods that run the service). Both services themselves and their clients should remain oblivious to the ReplicationControllers that maintain the pods of the services.
Writing programs for Replication
Pods created by a ReplicationController are intended to be fungible and semantically identical, though their configurations may become heterogeneous over time. This is an obvious fit for replicated stateless servers, but ReplicationControllers can also be used to maintain availability of master-elected, sharded, and worker-pool applications. Such applications should use dynamic work assignment mechanisms, such as the RabbitMQ work queues, as opposed to static/one-time customization of the configuration of each pod, which is considered an anti-pattern. Any pod customization performed, such as vertical auto-sizing of resources (for example, cpu or memory), should be performed by another online controller process, not unlike the ReplicationController itself.
Responsibilities of the ReplicationController
The ReplicationController ensures that the desired number of pods matches its label selector and are operational. Currently, only terminated pods are excluded from its count. In the future, readiness and other information available from the system may be taken into account, we may add more controls over the replacement policy, and we plan to emit events that could be used by external clients to implement arbitrarily sophisticated replacement and/or scale-down policies.
The ReplicationController is forever constrained to this narrow responsibility. It itself will not perform readiness nor liveness probes. Rather than performing auto-scaling, it is intended to be controlled by an external auto-scaler (as discussed in #492), which would change its replicas
field. We will not add scheduling policies (for example, spreading) to the ReplicationController. Nor should it verify that the pods controlled match the currently specified template, as that would obstruct auto-sizing and other automated processes. Similarly, completion deadlines, ordering dependencies, configuration expansion, and other features belong elsewhere. We even plan to factor out the mechanism for bulk pod creation (#170).
The ReplicationController is intended to be a composable building-block primitive. We expect higher-level APIs and/or tools to be built on top of it and other complementary primitives for user convenience in the future. The "macro" operations currently supported by kubectl (run, scale) are proof-of-concept examples of this. For instance, we could imagine something like Asgard managing ReplicationControllers, auto-scalers, services, scheduling policies, canaries, etc.
API Object
Replication controller is a top-level resource in the Kubernetes REST API. More details about the API object can be found at: ReplicationController API object.
Alternatives to ReplicationController
ReplicaSet
ReplicaSet
is the next-generation ReplicationController that supports the new set-based label selector.
It's mainly used by Deployment as a mechanism to orchestrate pod creation, deletion and updates.
Note that we recommend using Deployments instead of directly using Replica Sets, unless you require custom update orchestration or don't require updates at all.
Deployment (Recommended)
Deployment
is a higher-level API object that updates its underlying Replica Sets and their Pods. Deployments are recommended if you want the rolling update functionality, because they are declarative, server-side, and have additional features.
Bare Pods
Unlike in the case where a user directly created pods, a ReplicationController replaces pods that are deleted or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we recommend that you use a ReplicationController even if your application requires only a single pod. Think of it similarly to a process supervisor, only it supervises multiple pods across multiple nodes instead of individual processes on a single node. A ReplicationController delegates local container restarts to some agent on the node (for example, Kubelet or Docker).
Job
Use a Job
instead of a ReplicationController for pods that are expected to terminate on their own
(that is, batch jobs).
DaemonSet
Use a DaemonSet
instead of a ReplicationController for pods that provide a
machine-level function, such as machine monitoring or machine logging. These pods have a lifetime that is tied
to a machine lifetime: the pod needs to be running on the machine before other pods start, and are
safe to terminate when the machine is otherwise ready to be rebooted/shutdown.
What's next
- Learn about Pods.
- Learn about Deployment, the replacement for ReplicationController.
ReplicationController
is part of the Kubernetes REST API. Read the ReplicationController object definition to understand the API for replication controllers.
3.5 - Services, Load Balancing, and Networking
The Kubernetes network model
Every Pod
gets its own IP address.
This means you do not need to explicitly create links between Pods
and you
almost never need to deal with mapping container ports to host ports.
This creates a clean, backwards-compatible model where Pods
can be treated
much like VMs or physical hosts from the perspectives of port allocation,
naming, service discovery, load balancing, application configuration,
and migration.
Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):
- pods on a node can communicate with all pods on all nodes without NAT
- agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
Note: For those platforms that support Pods
running in the host network (e.g.
Linux):
- pods in the host network of a node can communicate with all pods on all nodes without NAT
This model is not only less complex overall, but it is principally compatible with the desire for Kubernetes to enable low-friction porting of apps from VMs to containers. If your job previously ran in a VM, your VM had an IP and could talk to other VMs in your project. This is the same basic model.
Kubernetes IP addresses exist at the Pod
scope - containers within a Pod
share their network namespaces - including their IP address and MAC address.
This means that containers within a Pod
can all reach each other's ports on
localhost
. This also means that containers within a Pod
must coordinate port
usage, but this is no different from processes in a VM. This is called the
"IP-per-pod" model.
How this is implemented is a detail of the particular container runtime in use.
It is possible to request ports on the Node
itself which forward to your Pod
(called host ports), but this is a very niche operation. How that forwarding is
implemented is also a detail of the container runtime. The Pod
itself is
blind to the existence or non-existence of host ports.
Kubernetes networking addresses four concerns:
- Containers within a Pod use networking to communicate via loopback.
- Cluster networking provides communication between different Pods.
- The Service resource lets you expose an application running in Pods to be reachable from outside your cluster.
- You can also use Services to publish services only for consumption inside your cluster.
3.5.1 - Service
An abstract way to expose an application running on a set of Pods as a network service.With Kubernetes you don't need to modify your application to use an unfamiliar service discovery mechanism. Kubernetes gives Pods their own IP addresses and a single DNS name for a set of Pods, and can load-balance across them.
Motivation
Kubernetes Pods are created and destroyed to match the state of your cluster. Pods are nonpermanent resources. If you use a Deployment to run your app, it can create and destroy Pods dynamically.
Each Pod gets its own IP address, however in a Deployment, the set of Pods running in one moment in time could be different from the set of Pods running that application a moment later.
This leads to a problem: if some set of Pods (call them "backends") provides functionality to other Pods (call them "frontends") inside your cluster, how do the frontends find out and keep track of which IP address to connect to, so that the frontend can use the backend part of the workload?
Enter Services.
Service resources
In Kubernetes, a Service is an abstraction which defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service). The set of Pods targeted by a Service is usually determined by a selector. To learn about other ways to define Service endpoints, see Services without selectors.
For example, consider a stateless image-processing backend which is running with 3 replicas. Those replicas are fungible—frontends do not care which backend they use. While the actual Pods that compose the backend set may change, the frontend clients should not need to be aware of that, nor should they need to keep track of the set of backends themselves.
The Service abstraction enables this decoupling.
Cloud-native service discovery
If you're able to use Kubernetes APIs for service discovery in your application, you can query the API server for Endpoints, that get updated whenever the set of Pods in a Service changes.
For non-native applications, Kubernetes offers ways to place a network port or load balancer in between your application and the backend Pods.
Defining a Service
A Service in Kubernetes is a REST object, similar to a Pod. Like all of the
REST objects, you can POST
a Service definition to the API server to create
a new instance.
The name of a Service object must be a valid
RFC 1035 label name.
For example, suppose you have a set of Pods where each listens on TCP port 9376
and contains a label app=MyApp
:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
This specification creates a new Service object named "my-service", which
targets TCP port 9376 on any Pod with the app=MyApp
label.
Kubernetes assigns this Service an IP address (sometimes called the "cluster IP"), which is used by the Service proxies (see Virtual IPs and service proxies below).
The controller for the Service selector continuously scans for Pods that match its selector, and then POSTs any updates to an Endpoint object also named "my-service".
port
to a targetPort
. By default and
for convenience, the targetPort
is set to the same value as the port
field.
Port definitions in Pods have names, and you can reference these names in the
targetPort
attribute of a Service. For example, we can bind the targetPort
of the Service to the Pod port in the following way:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app.kubernetes.io/name: proxy
spec:
containers:
- name: nginx
image: nginx:11.14.2
ports:
- containerPort: 80
name: http-web-svc
---
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
selector:
app.kubernetes.io/name: proxy
ports:
- name: name-of-service-port
protocol: TCP
port: 80
targetPort: http-web-svc
This works even if there is a mixture of Pods in the Service using a single configured name, with the same network protocol available via different port numbers. This offers a lot of flexibility for deploying and evolving your Services. For example, you can change the port numbers that Pods expose in the next version of your backend software, without breaking clients.
The default protocol for Services is TCP; you can also use any other supported protocol.
As many Services need to expose more than one port, Kubernetes supports multiple
port definitions on a Service object.
Each port definition can have the same protocol
, or a different one.
Services without selectors
Services most commonly abstract access to Kubernetes Pods, but they can also abstract other kinds of backends. For example:
- You want to have an external database cluster in production, but in your test environment you use your own databases.
- You want to point your Service to a Service in a different Namespace or on another cluster.
- You are migrating a workload to Kubernetes. While evaluating the approach, you run only a portion of your backends in Kubernetes.
In any of these scenarios you can define a Service without a Pod selector. For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
Because this Service has no selector, the corresponding Endpoints object is not created automatically. You can manually map the Service to the network address and port where it's running, by adding an Endpoints object manually:
apiVersion: v1
kind: Endpoints
metadata:
name: my-service
subsets:
- addresses:
- ip: 192.0.2.42
ports:
- port: 9376
The name of the Endpoints object must be a valid DNS subdomain name.
The endpoint IPs must not be: loopback (127.0.0.0/8 for IPv4, ::1/128 for IPv6), or link-local (169.254.0.0/16 and 224.0.0.0/24 for IPv4, fe80::/64 for IPv6).
Endpoint IP addresses cannot be the cluster IPs of other Kubernetes Services, because kube-proxy doesn't support virtual IPs as a destination.
Accessing a Service without a selector works the same as if it had a selector.
In the example above, traffic is routed to the single endpoint defined in
the YAML: 192.0.2.42:9376
(TCP).
kubectl proxy <service-name>
where the service has no
selector will fail due to this constraint. This prevents the Kubernetes API server
from being used as a proxy to endpoints the caller may not be authorized to access.
An ExternalName Service is a special case of Service that does not have selectors and uses DNS names instead. For more information, see the ExternalName section later in this document.
Over Capacity Endpoints
If an Endpoints resource has more than 1000 endpoints then a Kubernetes v1.22 (or later)
cluster annotates that Endpoints with endpoints.kubernetes.io/over-capacity: truncated
.
This annotation indicates that the affected Endpoints object is over capacity and that
the endpoints controller has truncated the number of endpoints to 1000.
EndpointSlices
Kubernetes v1.21 [stable]
EndpointSlices are an API resource that can provide a more scalable alternative to Endpoints. Although conceptually quite similar to Endpoints, EndpointSlices allow for distributing network endpoints across multiple resources. By default, an EndpointSlice is considered "full" once it reaches 100 endpoints, at which point additional EndpointSlices will be created to store any additional endpoints.
EndpointSlices provide additional attributes and functionality which is described in detail in EndpointSlices.
Application protocol
Kubernetes v1.20 [stable]
The appProtocol
field provides a way to specify an application protocol for
each Service port. The value of this field is mirrored by the corresponding
Endpoints and EndpointSlice objects.
This field follows standard Kubernetes label syntax. Values should either be
IANA standard service names or
domain prefixed names such as mycompany.com/my-custom-protocol
.
Virtual IPs and service proxies
Every node in a Kubernetes cluster runs a kube-proxy
. kube-proxy
is
responsible for implementing a form of virtual IP for Services
of type other
than ExternalName
.
Why not use round-robin DNS?
A question that pops up every now and then is why Kubernetes relies on proxying to forward inbound traffic to backends. What about other approaches? For example, would it be possible to configure DNS records that have multiple A values (or AAAA for IPv6), and rely on round-robin name resolution?
There are a few reasons for using proxying for Services:
- There is a long history of DNS implementations not respecting record TTLs, and caching the results of name lookups after they should have expired.
- Some apps do DNS lookups only once and cache the results indefinitely.
- Even if apps and libraries did proper re-resolution, the low or zero TTLs on the DNS records could impose a high load on DNS that then becomes difficult to manage.
Later in this page you can read about various kube-proxy implementations work. Overall,
you should note that, when running kube-proxy
, kernel level rules may be
modified (for example, iptables rules might get created), which won't get cleaned up,
in some cases until you reboot. Thus, running kube-proxy is something that should
only be done by an administrator which understands the consequences of having a
low level, privileged network proxying service on a computer. Although the kube-proxy
executable supports a cleanup
function, this function is not an official feature and
thus is only available to use as-is.
Configuration
Note that the kube-proxy starts up in different modes, which are determined by its configuration.
- The kube-proxy's configuration is done via a ConfigMap, and the ConfigMap for kube-proxy effectively deprecates the behaviour for almost all of the flags for the kube-proxy.
- The ConfigMap for the kube-proxy does not support live reloading of configuration.
- The ConfigMap parameters for the kube-proxy cannot all be validated and verified on startup. For example, if your operating system doesn't allow you to run iptables commands, the standard kernel kube-proxy implementation will not work. Likewise, if you have an operating system which doesn't support
netsh
, it will not run in Windows userspace mode.
User space proxy mode
In this (legacy) mode, kube-proxy watches the Kubernetes control plane for the addition and
removal of Service and Endpoint objects. For each Service it opens a
port (randomly chosen) on the local node. Any connections to this "proxy port"
are proxied to one of the Service's backend Pods (as reported via
Endpoints). kube-proxy takes the SessionAffinity
setting of the Service into
account when deciding which backend Pod to use.
Lastly, the user-space proxy installs iptables rules which capture traffic to
the Service's clusterIP
(which is virtual) and port
. The rules
redirect that traffic to the proxy port which proxies the backend Pod.
By default, kube-proxy in userspace mode chooses a backend via a round-robin algorithm.
iptables
proxy mode
In this mode, kube-proxy watches the Kubernetes control plane for the addition and
removal of Service and Endpoint objects. For each Service, it installs
iptables rules, which capture traffic to the Service's clusterIP
and port
,
and redirect that traffic to one of the Service's
backend sets. For each Endpoint object, it installs iptables rules which
select a backend Pod.
By default, kube-proxy in iptables mode chooses a backend at random.
Using iptables to handle traffic has a lower system overhead, because traffic is handled by Linux netfilter without the need to switch between userspace and the kernel space. This approach is also likely to be more reliable.
If kube-proxy is running in iptables mode and the first Pod that's selected does not respond, the connection fails. This is different from userspace mode: in that scenario, kube-proxy would detect that the connection to the first Pod had failed and would automatically retry with a different backend Pod.
You can use Pod readiness probes to verify that backend Pods are working OK, so that kube-proxy in iptables mode only sees backends that test out as healthy. Doing this means you avoid having traffic sent via kube-proxy to a Pod that's known to have failed.
IPVS proxy mode
Kubernetes v1.11 [stable]
In ipvs
mode, kube-proxy watches Kubernetes Services and Endpoints,
calls netlink
interface to create IPVS rules accordingly and synchronizes
IPVS rules with Kubernetes Services and Endpoints periodically.
This control loop ensures that IPVS status matches the desired
state.
When accessing a Service, IPVS directs traffic to one of the backend Pods.
The IPVS proxy mode is based on netfilter hook function that is similar to iptables mode, but uses a hash table as the underlying data structure and works in the kernel space. That means kube-proxy in IPVS mode redirects traffic with lower latency than kube-proxy in iptables mode, with much better performance when synchronising proxy rules. Compared to the other proxy modes, IPVS mode also supports a higher throughput of network traffic.
IPVS provides more options for balancing traffic to backend Pods; these are:
rr
: round-robinlc
: least connection (smallest number of open connections)dh
: destination hashingsh
: source hashingsed
: shortest expected delaynq
: never queue
To run kube-proxy in IPVS mode, you must make IPVS available on the node before starting kube-proxy.
When kube-proxy starts in IPVS proxy mode, it verifies whether IPVS kernel modules are available. If the IPVS kernel modules are not detected, then kube-proxy falls back to running in iptables proxy mode.
In these proxy models, the traffic bound for the Service's IP:Port is proxied to an appropriate backend without the clients knowing anything about Kubernetes or Services or Pods.
If you want to make sure that connections from a particular client
are passed to the same Pod each time, you can select the session affinity based
on the client's IP addresses by setting service.spec.sessionAffinity
to "ClientIP"
(the default is "None").
You can also set the maximum session sticky time by setting
service.spec.sessionAffinityConfig.clientIP.timeoutSeconds
appropriately.
(the default value is 10800, which works out to be 3 hours).
Multi-Port Services
For some Services, you need to expose more than one port. Kubernetes lets you configure multiple port definitions on a Service object. When using multiple ports for a Service, you must give all of your ports names so that these are unambiguous. For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- name: http
protocol: TCP
port: 80
targetPort: 9376
- name: https
protocol: TCP
port: 443
targetPort: 9377
As with Kubernetes names in general, names for ports
must only contain lowercase alphanumeric characters and -
. Port names must
also start and end with an alphanumeric character.
For example, the names 123-abc
and web
are valid, but 123_abc
and -web
are not.
Choosing your own IP address
You can specify your own cluster IP address as part of a Service
creation
request. To do this, set the .spec.clusterIP
field. For example, if you
already have an existing DNS entry that you wish to reuse, or legacy systems
that are configured for a specific IP address and difficult to re-configure.
The IP address that you choose must be a valid IPv4 or IPv6 address from within the
service-cluster-ip-range
CIDR range that is configured for the API server.
If you try to create a Service with an invalid clusterIP address value, the API
server will return a 422 HTTP status code to indicate that there's a problem.
Traffic policies
External traffic policy
You can set the spec.externalTrafficPolicy
field to control how traffic from external sources is routed.
Valid values are Cluster
and Local
. Set the field to Cluster
to route external traffic to all ready endpoints
and Local
to only route to ready node-local endpoints. If the traffic policy is Local
and there are are no node-local
endpoints, the kube-proxy does not forward any traffic for the relevant Service.
Kubernetes v1.22 [alpha]
If you enable the ProxyTerminatingEndpoints
feature gate
for the kube-proxy, the kube-proxy checks if the node
has local endpoints and whether or not all the local endpoints are marked as terminating.
If there are local endpoints and all of those are terminating, then the kube-proxy ignores
any external traffic policy of Local
. Instead, whilst the node-local endpoints remain as all
terminating, the kube-proxy forwards traffic for that Service to healthy endpoints elsewhere,
as if the external traffic policy were set to Cluster
.
This forwarding behavior for terminating endpoints exists to allow external load balancers to
gracefully drain connections that are backed by NodePort
Services, even when the health check
node port starts to fail. Otherwise, traffic can be lost between the time a node is still in the node pool of a load
balancer and traffic is being dropped during the termination period of a pod.
Internal traffic policy
Kubernetes v1.22 [beta]
You can set the spec.internalTrafficPolicy
field to control how traffic from internal sources is routed.
Valid values are Cluster
and Local
. Set the field to Cluster
to route internal traffic to all ready endpoints
and Local
to only route to ready node-local endpoints. If the traffic policy is Local
and there are no node-local
endpoints, traffic is dropped by kube-proxy.
Discovering services
Kubernetes supports 2 primary modes of finding a Service - environment variables and DNS.
Environment variables
When a Pod is run on a Node, the kubelet adds a set of environment variables
for each active Service. It adds {SVCNAME}_SERVICE_HOST
and {SVCNAME}_SERVICE_PORT
variables, where the Service name is upper-cased and dashes are converted to underscores. It also supports variables (see makeLinkVariables) that are compatible with Docker Engine's "legacy container links" feature.
For example, the Service redis-master
which exposes TCP port 6379 and has been
allocated cluster IP address 10.0.0.11, produces the following environment
variables:
REDIS_MASTER_SERVICE_HOST=10.0.0.11
REDIS_MASTER_SERVICE_PORT=6379
REDIS_MASTER_PORT=tcp://10.0.0.11:6379
REDIS_MASTER_PORT_6379_TCP=tcp://10.0.0.11:6379
REDIS_MASTER_PORT_6379_TCP_PROTO=tcp
REDIS_MASTER_PORT_6379_TCP_PORT=6379
REDIS_MASTER_PORT_6379_TCP_ADDR=10.0.0.11
When you have a Pod that needs to access a Service, and you are using the environment variable method to publish the port and cluster IP to the client Pods, you must create the Service before the client Pods come into existence. Otherwise, those client Pods won't have their environment variables populated.
If you only use DNS to discover the cluster IP for a Service, you don't need to worry about this ordering issue.
DNS
You can (and almost always should) set up a DNS service for your Kubernetes cluster using an add-on.
A cluster-aware DNS server, such as CoreDNS, watches the Kubernetes API for new Services and creates a set of DNS records for each one. If DNS has been enabled throughout your cluster then all Pods should automatically be able to resolve Services by their DNS name.
For example, if you have a Service called my-service
in a Kubernetes
namespace my-ns
, the control plane and the DNS Service acting together
create a DNS record for my-service.my-ns
. Pods in the my-ns
namespace
should be able to find the service by doing a name lookup for my-service
(my-service.my-ns
would also work).
Pods in other namespaces must qualify the name as my-service.my-ns
. These names
will resolve to the cluster IP assigned for the Service.
Kubernetes also supports DNS SRV (Service) records for named ports. If the
my-service.my-ns
Service has a port named http
with the protocol set to
TCP
, you can do a DNS SRV query for _http._tcp.my-service.my-ns
to discover
the port number for http
, as well as the IP address.
The Kubernetes DNS server is the only way to access ExternalName
Services.
You can find more information about ExternalName
resolution in
DNS Pods and Services.
Headless Services
Sometimes you don't need load-balancing and a single Service IP. In
this case, you can create what are termed "headless" Services, by explicitly
specifying "None"
for the cluster IP (.spec.clusterIP
).
You can use a headless Service to interface with other service discovery mechanisms, without being tied to Kubernetes' implementation.
For headless Services
, a cluster IP is not allocated, kube-proxy does not handle
these Services, and there is no load balancing or proxying done by the platform
for them. How DNS is automatically configured depends on whether the Service has
selectors defined:
With selectors
For headless Services that define selectors, the endpoints controller creates
Endpoints
records in the API, and modifies the DNS configuration to return
A records (IP addresses) that point directly to the Pods
backing the Service
.
Without selectors
For headless Services that do not define selectors, the endpoints controller does
not create Endpoints
records. However, the DNS system looks for and configures
either:
- CNAME records for
ExternalName
-type Services. - A records for any
Endpoints
that share a name with the Service, for all other types.
Publishing Services (ServiceTypes)
For some parts of your application (for example, frontends) you may want to expose a Service onto an external IP address, that's outside of your cluster.
Kubernetes ServiceTypes
allow you to specify what kind of Service you want.
The default is ClusterIP
.
Type
values and their behaviors are:
ClusterIP
: Exposes the Service on a cluster-internal IP. Choosing this value makes the Service only reachable from within the cluster. This is the defaultServiceType
.NodePort
: Exposes the Service on each Node's IP at a static port (theNodePort
). AClusterIP
Service, to which theNodePort
Service routes, is automatically created. You'll be able to contact theNodePort
Service, from outside the cluster, by requesting<NodeIP>:<NodePort>
.LoadBalancer
: Exposes the Service externally using a cloud provider's load balancer.NodePort
andClusterIP
Services, to which the external load balancer routes, are automatically created.ExternalName
: Maps the Service to the contents of theexternalName
field (e.g.foo.bar.example.com
), by returning aCNAME
record with its value. No proxying of any kind is set up.Note: You need eitherkube-dns
version 1.7 or CoreDNS version 0.0.8 or higher to use theExternalName
type.
You can also use Ingress to expose your Service. Ingress is not a Service type, but it acts as the entry point for your cluster. It lets you consolidate your routing rules into a single resource as it can expose multiple services under the same IP address.
Type NodePort
If you set the type
field to NodePort
, the Kubernetes control plane
allocates a port from a range specified by --service-node-port-range
flag (default: 30000-32767).
Each node proxies that port (the same port number on every Node) into your Service.
Your Service reports the allocated port in its .spec.ports[*].nodePort
field.
If you want to specify particular IP(s) to proxy the port, you can set the
--nodeport-addresses
flag for kube-proxy or the equivalent nodePortAddresses
field of the
kube-proxy configuration file
to particular IP block(s).
This flag takes a comma-delimited list of IP blocks (e.g. 10.0.0.0/8
, 192.0.2.0/25
) to specify IP address ranges that kube-proxy should consider as local to this node.
For example, if you start kube-proxy with the --nodeport-addresses=127.0.0.0/8
flag, kube-proxy only selects the loopback interface for NodePort Services. The default for --nodeport-addresses
is an empty list. This means that kube-proxy should consider all available network interfaces for NodePort. (That's also compatible with earlier Kubernetes releases).
If you want a specific port number, you can specify a value in the nodePort
field. The control plane will either allocate you that port or report that
the API transaction failed.
This means that you need to take care of possible port collisions yourself.
You also have to use a valid port number, one that's inside the range configured
for NodePort use.
Using a NodePort gives you the freedom to set up your own load balancing solution, to configure environments that are not fully supported by Kubernetes, or even to expose one or more nodes' IPs directly.
Note that this Service is visible as <NodeIP>:spec.ports[*].nodePort
and .spec.clusterIP:spec.ports[*].port
.
If the --nodeport-addresses
flag for kube-proxy or the equivalent field
in the kube-proxy configuration file is set, <NodeIP>
would be filtered node IP(s).
For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
type: NodePort
selector:
app: MyApp
ports:
# By default and for convenience, the `targetPort` is set to the same value as the `port` field.
- port: 80
targetPort: 80
# Optional field
# By default and for convenience, the Kubernetes control plane will allocate a port from a range (default: 30000-32767)
nodePort: 30007
Type LoadBalancer
On cloud providers which support external load balancers, setting the type
field to LoadBalancer
provisions a load balancer for your Service.
The actual creation of the load balancer happens asynchronously, and
information about the provisioned balancer is published in the Service's
.status.loadBalancer
field.
For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
clusterIP: 10.0.171.239
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 192.0.2.127
Traffic from the external load balancer is directed at the backend Pods. The cloud provider decides how it is load balanced.
Some cloud providers allow you to specify the loadBalancerIP
. In those cases, the load-balancer is created
with the user-specified loadBalancerIP
. If the loadBalancerIP
field is not specified,
the loadBalancer is set up with an ephemeral IP address. If you specify a loadBalancerIP
but your cloud provider does not support the feature, the loadbalancerIP
field that you
set is ignored.
On Azure, if you want to use a user-specified public type loadBalancerIP
, you first need
to create a static type public IP address resource. This public IP address resource should
be in the same resource group of the other automatically created resources of the cluster.
For example, MC_myResourceGroup_myAKSCluster_eastus
.
Specify the assigned IP address as loadBalancerIP. Ensure that you have updated the securityGroupName in the cloud provider configuration file. For information about troubleshooting CreatingLoadBalancerFailed
permission issues see, Use a static IP address with the Azure Kubernetes Service (AKS) load balancer or CreatingLoadBalancerFailed on AKS cluster with advanced networking.
Load balancers with mixed protocol types
Kubernetes v1.20 [alpha]
By default, for LoadBalancer type of Services, when there is more than one port defined, all ports must have the same protocol, and the protocol must be one which is supported by the cloud provider.
If the feature gate MixedProtocolLBService
is enabled for the kube-apiserver it is allowed to use different protocols when there is more than one port defined.
Disabling load balancer NodePort allocation
Kubernetes v1.22 [beta]
You can optionally disable node port allocation for a Service of type=LoadBalancer
, by setting
the field spec.allocateLoadBalancerNodePorts
to false
. This should only be used for load balancer implementations
that route traffic directly to pods as opposed to using node ports. By default, spec.allocateLoadBalancerNodePorts
is true
and type LoadBalancer Services will continue to allocate node ports. If spec.allocateLoadBalancerNodePorts
is set to false
on an existing Service with allocated node ports, those node ports will not be de-allocated automatically.
You must explicitly remove the nodePorts
entry in every Service port to de-allocate those node ports.
Your cluster must have the ServiceLBNodePortControl
feature gate
enabled to use this field.
For Kubernetes v1.23, this feature gate is enabled by default,
and you can use the spec.allocateLoadBalancerNodePorts
field. For clusters running
other versions of Kubernetes, check the documentation for that release.
Specifying class of load balancer implementation
Kubernetes v1.22 [beta]
spec.loadBalancerClass
enables you to use a load balancer implementation other than the cloud provider default.
Your cluster must have the ServiceLoadBalancerClass
feature gate enabled to use this field. For Kubernetes v1.23, this feature gate is enabled by default. For clusters running
other versions of Kubernetes, check the documentation for that release.
By default, spec.loadBalancerClass
is nil
and a LoadBalancer
type of Service uses
the cloud provider's default load balancer implementation if the cluster is configured with
a cloud provider using the --cloud-provider
component flag.
If spec.loadBalancerClass
is specified, it is assumed that a load balancer
implementation that matches the specified class is watching for Services.
Any default load balancer implementation (for example, the one provided by
the cloud provider) will ignore Services that have this field set.
spec.loadBalancerClass
can be set on a Service of type LoadBalancer
only.
Once set, it cannot be changed.
The value of spec.loadBalancerClass
must be a label-style identifier,
with an optional prefix such as "internal-vip
" or "example.com/internal-vip
".
Unprefixed names are reserved for end-users.
Internal load balancer
In a mixed environment it is sometimes necessary to route traffic from Services inside the same (virtual) network address block.
In a split-horizon DNS environment you would need two Services to be able to route both external and internal traffic to your endpoints.
To set an internal load balancer, add one of the following annotations to your Service depending on the cloud Service provider you're using.
Select one of the tabs.
[...]
metadata:
name: my-service
annotations:
cloud.google.com/load-balancer-type: "Internal"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
[...]
[...]
metadata:
name: my-service
annotations:
service.kubernetes.io/ibm-load-balancer-cloud-provider-ip-type: "private"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/openstack-internal-load-balancer: "true"
[...]
[...]
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/cce-load-balancer-internal-vpc: "true"
[...]
[...]
metadata:
annotations:
service.kubernetes.io/qcloud-loadbalancer-internal-subnetid: subnet-xxxxx
[...]
[...]
metadata:
annotations:
service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: "intranet"
[...]
TLS support on AWS
For partial TLS / SSL support on clusters running on AWS, you can add three
annotations to a LoadBalancer
service:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: arn:aws:acm:us-east-1:123456789012:certificate/12345678-1234-1234-1234-123456789012
The first specifies the ARN of the certificate to use. It can be either a certificate from a third party issuer that was uploaded to IAM or one created within AWS Certificate Manager.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: (https|http|ssl|tcp)
The second annotation specifies which protocol a Pod speaks. For HTTPS and SSL, the ELB expects the Pod to authenticate itself over the encrypted connection, using a certificate.
HTTP and HTTPS selects layer 7 proxying: the ELB terminates
the connection with the user, parses headers, and injects the X-Forwarded-For
header with the user's IP address (Pods only see the IP address of the
ELB at the other end of its connection) when forwarding requests.
TCP and SSL selects layer 4 proxying: the ELB forwards traffic without modifying the headers.
In a mixed-use environment where some ports are secured and others are left unencrypted, you can use the following annotations:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443,8443"
In the above example, if the Service contained three ports, 80
, 443
, and
8443
, then 443
and 8443
would use the SSL certificate, but 80
would be proxied HTTP.
From Kubernetes v1.9 onwards you can use predefined AWS SSL policies with HTTPS or SSL listeners for your Services.
To see which policies are available for use, you can use the aws
command line tool:
aws elb describe-load-balancer-policies --query 'PolicyDescriptions[].PolicyName'
You can then specify any one of those policies using the
"service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy
"
annotation; for example:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: "ELBSecurityPolicy-TLS-1-2-2017-01"
PROXY protocol support on AWS
To enable PROXY protocol support for clusters running on AWS, you can use the following service annotation:
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
Since version 1.3.0, the use of this annotation applies to all ports proxied by the ELB and cannot be configured otherwise.
ELB Access Logs on AWS
There are several annotations to manage access logs for ELB Services on AWS.
The annotation service.beta.kubernetes.io/aws-load-balancer-access-log-enabled
controls whether access logs are enabled.
The annotation service.beta.kubernetes.io/aws-load-balancer-access-log-emit-interval
controls the interval in minutes for publishing the access logs. You can specify
an interval of either 5 or 60 minutes.
The annotation service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name
controls the name of the Amazon S3 bucket where load balancer access logs are
stored.
The annotation service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-prefix
specifies the logical hierarchy you created for your Amazon S3 bucket.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true"
# Specifies whether access logs are enabled for the load balancer
service.beta.kubernetes.io/aws-load-balancer-access-log-emit-interval: "60"
# The interval for publishing the access logs. You can specify an interval of either 5 or 60 (minutes).
service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-name: "my-bucket"
# The name of the Amazon S3 bucket where the access logs are stored
service.beta.kubernetes.io/aws-load-balancer-access-log-s3-bucket-prefix: "my-bucket-prefix/prod"
# The logical hierarchy you created for your Amazon S3 bucket, for example `my-bucket-prefix/prod`
Connection Draining on AWS
Connection draining for Classic ELBs can be managed with the annotation
service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled
set
to the value of "true"
. The annotation
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout
can
also be used to set maximum time, in seconds, to keep the existing connections open before deregistering the instances.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
Other ELB annotations
There are other annotations to manage Classic Elastic Load Balancers that are described below.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
# The time, in seconds, that the connection is allowed to be idle (no data has been sent over the connection) before it is closed by the load balancer
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
# Specifies whether cross-zone load balancing is enabled for the load balancer
service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: "environment=prod,owner=devops"
# A comma-separated list of key-value pairs which will be recorded as
# additional tags in the ELB.
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: ""
# The number of successive successful health checks required for a backend to
# be considered healthy for traffic. Defaults to 2, must be between 2 and 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
# The number of unsuccessful health checks required for a backend to be
# considered unhealthy for traffic. Defaults to 6, must be between 2 and 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
# The approximate interval, in seconds, between health checks of an
# individual instance. Defaults to 10, must be between 5 and 300
service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
# The amount of time, in seconds, during which no response means a failed
# health check. This value must be less than the service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval
# value. Defaults to 5, must be between 2 and 60
service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-53fae93f"
# A list of existing security groups to be configured on the ELB created. Unlike the annotation
# service.beta.kubernetes.io/aws-load-balancer-extra-security-groups, this replaces all other security groups previously assigned to the ELB and also overrides the creation
# of a uniquely generated security group for this ELB.
# The first security group ID on this list is used as a source to permit incoming traffic to target worker nodes (service traffic and health checks).
# If multiple ELBs are configured with the same security group ID, only a single permit line will be added to the worker node security groups, that means if you delete any
# of those ELBs it will remove the single permit line and block access for all ELBs that shared the same security group ID.
# This can cause a cross-service outage if not used properly
service.beta.kubernetes.io/aws-load-balancer-extra-security-groups: "sg-53fae93f,sg-42efd82e"
# A list of additional security groups to be added to the created ELB, this leaves the uniquely generated security group in place, this ensures that every ELB
# has a unique security group ID and a matching permit line to allow traffic to the target worker nodes (service traffic and health checks).
# Security groups defined here can be shared between services.
service.beta.kubernetes.io/aws-load-balancer-target-node-labels: "ingress-gw,gw-name=public-api"
# A comma separated list of key-value pairs which are used
# to select the target nodes for the load balancer
Network Load Balancer support on AWS
Kubernetes v1.15 [beta]
To use a Network Load Balancer on AWS, use the annotation service.beta.kubernetes.io/aws-load-balancer-type
with the value set to nlb
.
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
Unlike Classic Elastic Load Balancers, Network Load Balancers (NLBs) forward the
client's IP address through to the node. If a Service's .spec.externalTrafficPolicy
is set to Cluster
, the client's IP address is not propagated to the end
Pods.
By setting .spec.externalTrafficPolicy
to Local
, the client IP addresses is
propagated to the end Pods, but this could result in uneven distribution of
traffic. Nodes without any Pods for a particular LoadBalancer Service will fail
the NLB Target Group's health check on the auto-assigned
.spec.healthCheckNodePort
and not receive any traffic.
In order to achieve even traffic, either use a DaemonSet or specify a pod anti-affinity to not locate on the same node.
You can also use NLB Services with the internal load balancer annotation.
In order for client traffic to reach instances behind an NLB, the Node security groups are modified with the following IP rules:
Rule | Protocol | Port(s) | IpRange(s) | IpRange Description |
---|---|---|---|---|
Health Check | TCP | NodePort(s) (.spec.healthCheckNodePort for .spec.externalTrafficPolicy = Local ) |
Subnet CIDR | kubernetes.io/rule/nlb/health=<loadBalancerName> |
Client Traffic | TCP | NodePort(s) | .spec.loadBalancerSourceRanges (defaults to 0.0.0.0/0 ) |
kubernetes.io/rule/nlb/client=<loadBalancerName> |
MTU Discovery | ICMP | 3,4 | .spec.loadBalancerSourceRanges (defaults to 0.0.0.0/0 ) |
kubernetes.io/rule/nlb/mtu=<loadBalancerName> |
In order to limit which client IP's can access the Network Load Balancer,
specify loadBalancerSourceRanges
.
spec:
loadBalancerSourceRanges:
- "143.231.0.0/16"
.spec.loadBalancerSourceRanges
is not set, Kubernetes
allows traffic from 0.0.0.0/0
to the Node Security Group(s). If nodes have
public IP addresses, be aware that non-NLB traffic can also reach all instances
in those modified security groups.
Further documentation on annotations for Elastic IPs and other common use-cases may be found in the AWS Load Balancer Controller documentation.
Other CLB annotations on Tencent Kubernetes Engine (TKE)
There are other annotations for managing Cloud Load Balancers on TKE as shown below.
metadata:
name: my-service
annotations:
# Bind Loadbalancers with specified nodes
service.kubernetes.io/qcloud-loadbalancer-backends-label: key in (value1, value2)
# ID of an existing load balancer
service.kubernetes.io/tke-existed-lbid:lb-6swtxxxx
# Custom parameters for the load balancer (LB), does not support modification of LB type yet
service.kubernetes.io/service.extensiveParameters: ""
# Custom parameters for the LB listener
service.kubernetes.io/service.listenerParameters: ""
# Specifies the type of Load balancer;
# valid values: classic (Classic Cloud Load Balancer) or application (Application Cloud Load Balancer)
service.kubernetes.io/loadbalance-type: xxxxx
# Specifies the public network bandwidth billing method;
# valid values: TRAFFIC_POSTPAID_BY_HOUR(bill-by-traffic) and BANDWIDTH_POSTPAID_BY_HOUR (bill-by-bandwidth).
service.kubernetes.io/qcloud-loadbalancer-internet-charge-type: xxxxxx
# Specifies the bandwidth value (value range: [1,2000] Mbps).
service.kubernetes.io/qcloud-loadbalancer-internet-max-bandwidth-out: "10"
# When this annotation is set,the loadbalancers will only register nodes
# with pod running on it, otherwise all nodes will be registered.
service.kubernetes.io/local-svc-only-bind-node-with-pod: true
Type ExternalName
Services of type ExternalName map a Service to a DNS name, not to a typical selector such as
my-service
or cassandra
. You specify these Services with the spec.externalName
parameter.
This Service definition, for example, maps
the my-service
Service in the prod
namespace to my.database.example.com
:
apiVersion: v1
kind: Service
metadata:
name: my-service
namespace: prod
spec:
type: ExternalName
externalName: my.database.example.com
When looking up the host my-service.prod.svc.cluster.local
, the cluster DNS Service
returns a CNAME
record with the value my.database.example.com
. Accessing
my-service
works in the same way as other Services but with the crucial
difference that redirection happens at the DNS level rather than via proxying or
forwarding. Should you later decide to move your database into your cluster, you
can start its Pods, add appropriate selectors or endpoints, and change the
Service's type
.
You may have trouble using ExternalName for some common protocols, including HTTP and HTTPS. If you use ExternalName then the hostname used by clients inside your cluster is different from the name that the ExternalName references.
For protocols that use hostnames this difference may lead to errors or unexpected responses. HTTP requests will have a Host:
header that the origin server does not recognize; TLS servers will not be able to provide a certificate matching the hostname that the client connected to.
External IPs
If there are external IPs that route to one or more cluster nodes, Kubernetes Services can be exposed on those
externalIPs
. Traffic that ingresses into the cluster with the external IP (as destination IP), on the Service port,
will be routed to one of the Service endpoints. externalIPs
are not managed by Kubernetes and are the responsibility
of the cluster administrator.
In the Service spec, externalIPs
can be specified along with any of the ServiceTypes
.
In the example below, "my-service
" can be accessed by clients on "80.11.12.10:80
" (externalIP:port
)
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- name: http
protocol: TCP
port: 80
targetPort: 9376
externalIPs:
- 80.11.12.10
Shortcomings
Using the userspace proxy for VIPs works at small to medium scale, but will not scale to very large clusters with thousands of Services. The original design proposal for portals has more details on this.
Using the userspace proxy obscures the source IP address of a packet accessing a Service. This makes some kinds of network filtering (firewalling) impossible. The iptables proxy mode does not obscure in-cluster source IPs, but it does still impact clients coming through a load balancer or node-port.
The Type
field is designed as nested functionality - each level adds to the
previous. This is not strictly required on all cloud providers (e.g. Google Compute Engine does
not need to allocate a NodePort
to make LoadBalancer
work, but AWS does)
but the current API requires it.
Virtual IP implementation
The previous information should be sufficient for many people who want to use Services. However, there is a lot going on behind the scenes that may be worth understanding.
Avoiding collisions
One of the primary philosophies of Kubernetes is that you should not be exposed to situations that could cause your actions to fail through no fault of your own. For the design of the Service resource, this means not making you choose your own port number if that choice might collide with someone else's choice. That is an isolation failure.
In order to allow you to choose a port number for your Services, we must ensure that no two Services can collide. Kubernetes does that by allocating each Service its own IP address.
To ensure each Service receives a unique IP, an internal allocator atomically updates a global allocation map in etcd prior to creating each Service. The map object must exist in the registry for Services to get IP address assignments, otherwise creations will fail with a message indicating an IP address could not be allocated.
In the control plane, a background controller is responsible for creating that map (needed to support migrating from older versions of Kubernetes that used in-memory locking). Kubernetes also uses controllers to check for invalid assignments (eg due to administrator intervention) and for cleaning up allocated IP addresses that are no longer used by any Services.
Service IP addresses
Unlike Pod IP addresses, which actually route to a fixed destination, Service IPs are not actually answered by a single host. Instead, kube-proxy uses iptables (packet processing logic in Linux) to define virtual IP addresses which are transparently redirected as needed. When clients connect to the VIP, their traffic is automatically transported to an appropriate endpoint. The environment variables and DNS for Services are actually populated in terms of the Service's virtual IP address (and port).
kube-proxy supports three proxy modes—userspace, iptables and IPVS—which each operate slightly differently.
Userspace
As an example, consider the image processing application described above. When the backend Service is created, the Kubernetes master assigns a virtual IP address, for example 10.0.0.1. Assuming the Service port is 1234, the Service is observed by all of the kube-proxy instances in the cluster. When a proxy sees a new Service, it opens a new random port, establishes an iptables redirect from the virtual IP address to this new port, and starts accepting connections on it.
When a client connects to the Service's virtual IP address, the iptables rule kicks in, and redirects the packets to the proxy's own port. The "Service proxy" chooses a backend, and starts proxying traffic from the client to the backend.
This means that Service owners can choose any port they want without risk of collision. Clients can connect to an IP and port, without being aware of which Pods they are actually accessing.
iptables
Again, consider the image processing application described above. When the backend Service is created, the Kubernetes control plane assigns a virtual IP address, for example 10.0.0.1. Assuming the Service port is 1234, the Service is observed by all of the kube-proxy instances in the cluster. When a proxy sees a new Service, it installs a series of iptables rules which redirect from the virtual IP address to per-Service rules. The per-Service rules link to per-Endpoint rules which redirect traffic (using destination NAT) to the backends.
When a client connects to the Service's virtual IP address the iptables rule kicks in. A backend is chosen (either based on session affinity or randomly) and packets are redirected to the backend. Unlike the userspace proxy, packets are never copied to userspace, the kube-proxy does not have to be running for the virtual IP address to work, and Nodes see traffic arriving from the unaltered client IP address.
This same basic flow executes when traffic comes in through a node-port or through a load-balancer, though in those cases the client IP does get altered.
IPVS
iptables operations slow down dramatically in large scale cluster e.g 10,000 Services. IPVS is designed for load balancing and based on in-kernel hash tables. So you can achieve performance consistency in large number of Services from IPVS-based kube-proxy. Meanwhile, IPVS-based kube-proxy has more sophisticated load balancing algorithms (least conns, locality, weighted, persistence).
API Object
Service is a top-level resource in the Kubernetes REST API. You can find more details about the API object at: Service API object.
Supported protocols
TCP
You can use TCP for any kind of Service, and it's the default network protocol.
UDP
You can use UDP for most Services. For type=LoadBalancer Services, UDP support depends on the cloud provider offering this facility.
SCTP
Kubernetes v1.20 [stable]
When using a network plugin that supports SCTP traffic, you can use SCTP for most Services. For type=LoadBalancer Services, SCTP support depends on the cloud provider offering this facility. (Most do not).
Warnings
Support for multihomed SCTP associations
The support of multihomed SCTP associations requires that the CNI plugin can support the assignment of multiple interfaces and IP addresses to a Pod.
NAT for multihomed SCTP associations requires special logic in the corresponding kernel modules.
Windows
Userspace kube-proxy
HTTP
If your cloud provider supports it, you can use a Service in LoadBalancer mode to set up external HTTP / HTTPS reverse proxying, forwarded to the Endpoints of the Service.
PROXY protocol
If your cloud provider supports it, you can use a Service in LoadBalancer mode to configure a load balancer outside of Kubernetes itself, that will forward connections prefixed with PROXY protocol.
The load balancer will send an initial series of octets describing the incoming connection, similar to this example
PROXY TCP4 192.0.2.202 10.0.42.7 12345 7\r\n
followed by the data from the client.
What's next
- Read Connecting Applications with Services
- Read about Ingress
- Read about EndpointSlices
3.5.2 - Topology-aware traffic routing with topology keys
Kubernetes v1.21 [deprecated]
topologyKeys
API, is deprecated since
Kubernetes v1.21.
Topology Aware Hints,
introduced in Kubernetes v1.21, provide similar functionality.
Service Topology enables a service to route traffic based upon the Node topology of the cluster. For example, a service can specify that traffic be preferentially routed to endpoints that are on the same Node as the client, or in the same availability zone.
Topology-aware traffic routing
By default, traffic sent to a ClusterIP
or NodePort
Service may be routed to
any backend address for the Service. Kubernetes 1.7 made it possible to
route "external" traffic to the Pods running on the same Node that received the
traffic. For ClusterIP
Services, the equivalent same-node preference for
routing wasn't possible; nor could you configure your cluster to favor routing
to endpoints within the same zone.
By setting topologyKeys
on a Service, you're able to define a policy for routing
traffic based upon the Node labels for the originating and destination Nodes.
The label matching between the source and destination lets you, as a cluster operator, designate sets of Nodes that are "closer" and "farther" from one another. You can define labels to represent whatever metric makes sense for your own requirements. In public clouds, for example, you might prefer to keep network traffic within the same zone, because interzonal traffic has a cost associated with it (and intrazonal traffic typically does not). Other common needs include being able to route traffic to a local Pod managed by a DaemonSet, or directing traffic to Nodes connected to the same top-of-rack switch for the lowest latency.
Using Service Topology
If your cluster has the ServiceTopology
feature gate enabled, you can control Service traffic
routing by specifying the topologyKeys
field on the Service spec. This field
is a preference-order list of Node labels which will be used to sort endpoints
when accessing this Service. Traffic will be directed to a Node whose value for
the first label matches the originating Node's value for that label. If there is
no backend for the Service on a matching Node, then the second label will be
considered, and so forth, until no labels remain.
If no match is found, the traffic will be rejected, as if there were no
backends for the Service at all. That is, endpoints are chosen based on the first
topology key with available backends. If this field is specified and all entries
have no backends that match the topology of the client, the service has no
backends for that client and connections should fail. The special value "*"
may
be used to mean "any topology". This catch-all value, if used, only makes sense
as the last value in the list.
If topologyKeys
is not specified or empty, no topology constraints will be applied.
Consider a cluster with Nodes that are labeled with their hostname, zone name,
and region name. Then you can set the topologyKeys
values of a service to direct
traffic as follows.
- Only to endpoints on the same node, failing if no endpoint exists on the node:
["kubernetes.io/hostname"]
. - Preferentially to endpoints on the same node, falling back to endpoints in the
same zone, followed by the same region, and failing otherwise:
["kubernetes.io/hostname", "topology.kubernetes.io/zone", "topology.kubernetes.io/region"]
. This may be useful, for example, in cases where data locality is critical. - Preferentially to the same zone, but fallback on any available endpoint if
none are available within this zone:
["topology.kubernetes.io/zone", "*"]
.
Constraints
-
Service topology is not compatible with
externalTrafficPolicy=Local
, and therefore a Service cannot use both of these features. It is possible to use both features in the same cluster on different Services, only not on the same Service. -
Valid topology keys are currently limited to
kubernetes.io/hostname
,topology.kubernetes.io/zone
, andtopology.kubernetes.io/region
, but will be generalized to other node labels in the future. -
Topology keys must be valid label keys and at most 16 keys may be specified.
-
The catch-all value,
"*"
, must be the last value in the topology keys, if it is used.
Examples
The following are common examples of using the Service Topology feature.
Only Node Local Endpoints
A Service that only routes to node local endpoints. If no endpoints exist on the node, traffic is dropped:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
Prefer Node Local Endpoints
A Service that prefers node local Endpoints but falls back to cluster wide endpoints if node local endpoints do not exist:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
- "*"
Only Zonal or Regional Endpoints
A Service that prefers zonal then regional endpoints. If no endpoints exist in either, traffic is dropped.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "topology.kubernetes.io/zone"
- "topology.kubernetes.io/region"
Prefer Node Local, Zonal, then Regional Endpoints
A Service that prefers node local, zonal, then regional endpoints but falls back to cluster wide endpoints.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
- "topology.kubernetes.io/zone"
- "topology.kubernetes.io/region"
- "*"
What's next
- Read about enabling Service Topology
- Read Connecting Applications with Services
3.5.3 - DNS for Services and Pods
Kubernetes creates DNS records for services and pods. You can contact services with consistent DNS names instead of IP addresses.
Introduction
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures the kubelets to tell individual containers to use the DNS Service's IP to resolve DNS names.
Every Service defined in the cluster (including the DNS server itself) is assigned a DNS name. By default, a client Pod's DNS search list includes the Pod's own namespace and the cluster's default domain.
Namespaces of Services
A DNS query may return different results based on the namespace of the pod making it. DNS queries that don't specify a namespace are limited to the pod's namespace. Access services in other namespaces by specifying it in the DNS query.
For example, consider a pod in a test
namespace. A data
service is in
the prod
namespace.
A query for data
returns no results, because it uses the pod's test
namespace.
A query for data.prod
returns the intended result, because it specifies the
namespace.
DNS queries may be expanded using the pod's /etc/resolv.conf
. Kubelet
sets this file for each pod. For example, a query for just data
may be
expanded to data.test.svc.cluster.local
. The values of the search
option
are used to expand queries. To learn more about DNS queries, see
the resolv.conf
manual page.
nameserver 10.32.0.10
search <namespace>.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
In summary, a pod in the test namespace can successfully resolve either
data.prod
or data.prod.svc.cluster.local
.
DNS Records
What objects get DNS records?
- Services
- Pods
The following sections detail the supported DNS record types and layout that is supported. Any other layout or names or queries that happen to work are considered implementation details and are subject to change without warning. For more up-to-date specification, see Kubernetes DNS-Based Service Discovery.
Services
A/AAAA records
"Normal" (not headless) Services are assigned a DNS A or AAAA record,
depending on the IP family of the service, for a name of the form
my-svc.my-namespace.svc.cluster-domain.example
. This resolves to the cluster IP
of the Service.
"Headless" (without a cluster IP) Services are also assigned a DNS A or AAAA record,
depending on the IP family of the service, for a name of the form
my-svc.my-namespace.svc.cluster-domain.example
. Unlike normal
Services, this resolves to the set of IPs of the pods selected by the Service.
Clients are expected to consume the set or else use standard round-robin
selection from the set.
SRV records
SRV Records are created for named ports that are part of normal or Headless
Services.
For each named port, the SRV record would have the form
_my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster-domain.example
.
For a regular service, this resolves to the port number and the domain name:
my-svc.my-namespace.svc.cluster-domain.example
.
For a headless service, this resolves to multiple answers, one for each pod
that is backing the service, and contains the port number and the domain name of the pod
of the form auto-generated-name.my-svc.my-namespace.svc.cluster-domain.example
.
Pods
A/AAAA records
In general a pod has the following DNS resolution:
pod-ip-address.my-namespace.pod.cluster-domain.example
.
For example, if a pod in the default
namespace has the IP address 172.17.0.3,
and the domain name for your cluster is cluster.local
, then the Pod has a DNS name:
172-17-0-3.default.pod.cluster.local
.
Any pods exposed by a Service have the following DNS resolution available:
pod-ip-address.service-name.my-namespace.svc.cluster-domain.example
.
Pod's hostname and subdomain fields
Currently when a pod is created, its hostname is the Pod's metadata.name
value.
The Pod spec has an optional hostname
field, which can be used to specify the
Pod's hostname. When specified, it takes precedence over the Pod's name to be
the hostname of the pod. For example, given a Pod with hostname
set to
"my-host
", the Pod will have its hostname set to "my-host
".
The Pod spec also has an optional subdomain
field which can be used to specify
its subdomain. For example, a Pod with hostname
set to "foo
", and subdomain
set to "bar
", in namespace "my-namespace
", will have the fully qualified
domain name (FQDN) "foo.bar.my-namespace.svc.cluster-domain.example
".
Example:
apiVersion: v1
kind: Service
metadata:
name: default-subdomain
spec:
selector:
name: busybox
clusterIP: None
ports:
- name: foo # Actually, no port is needed.
port: 1234
targetPort: 1234
---
apiVersion: v1
kind: Pod
metadata:
name: busybox1
labels:
name: busybox
spec:
hostname: busybox-1
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
---
apiVersion: v1
kind: Pod
metadata:
name: busybox2
labels:
name: busybox
spec:
hostname: busybox-2
subdomain: default-subdomain
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
name: busybox
If there exists a headless service in the same namespace as the pod and with
the same name as the subdomain, the cluster's DNS Server also returns an A or AAAA
record for the Pod's fully qualified hostname.
For example, given a Pod with the hostname set to "busybox-1
" and the subdomain set to
"default-subdomain
", and a headless Service named "default-subdomain
" in
the same namespace, the pod will see its own FQDN as
"busybox-1.default-subdomain.my-namespace.svc.cluster-domain.example
". DNS serves an
A or AAAA record at that name, pointing to the Pod's IP. Both pods "busybox1
" and
"busybox2
" can have their distinct A or AAAA records.
The Endpoints object can specify the hostname
for any endpoint addresses,
along with its IP.
hostname
is required for the Pod's A or AAAA
record to be created. A Pod with no hostname
but with subdomain
will only create the
A or AAAA record for the headless service (default-subdomain.my-namespace.svc.cluster-domain.example
),
pointing to the Pod's IP address. Also, Pod needs to become ready in order to have a
record unless publishNotReadyAddresses=True
is set on the Service.
Pod's setHostnameAsFQDN field
Kubernetes v1.22 [stable]
When a Pod is configured to have fully qualified domain name (FQDN), its hostname is the short hostname. For example, if you have a Pod with the fully qualified domain name busybox-1.default-subdomain.my-namespace.svc.cluster-domain.example
, then by default the hostname
command inside that Pod returns busybox-1
and the hostname --fqdn
command returns the FQDN.
When you set setHostnameAsFQDN: true
in the Pod spec, the kubelet writes the Pod's FQDN into the hostname for that Pod's namespace. In this case, both hostname
and hostname --fqdn
return the Pod's FQDN.
In Linux, the hostname field of the kernel (the nodename
field of struct utsname
) is limited to 64 characters.
If a Pod enables this feature and its FQDN is longer than 64 character, it will fail to start. The Pod will remain in Pending
status (ContainerCreating
as seen by kubectl
) generating error events, such as Failed to construct FQDN from pod hostname and cluster domain, FQDN long-FQDN
is too long (64 characters is the max, 70 characters requested). One way of improving user experience for this scenario is to create an admission webhook controller to control FQDN size when users create top level objects, for example, Deployment.
Pod's DNS Policy
DNS policies can be set on a per-pod basis. Currently Kubernetes supports the
following pod-specific DNS policies. These policies are specified in the
dnsPolicy
field of a Pod Spec.
- "
Default
": The Pod inherits the name resolution configuration from the node that the pods run on. See related discussion for more details. - "
ClusterFirst
": Any DNS query that does not match the configured cluster domain suffix, such as "www.kubernetes.io
", is forwarded to the upstream nameserver inherited from the node. Cluster administrators may have extra stub-domain and upstream DNS servers configured. See related discussion for details on how DNS queries are handled in those cases. - "
ClusterFirstWithHostNet
": For Pods running with hostNetwork, you should explicitly set its DNS policy "ClusterFirstWithHostNet
". - "
None
": It allows a Pod to ignore DNS settings from the Kubernetes environment. All DNS settings are supposed to be provided using thednsConfig
field in the Pod Spec. See Pod's DNS config subsection below.
dnsPolicy
is not
explicitly specified, then "ClusterFirst" is used.
The example below shows a Pod with its DNS policy set to
"ClusterFirstWithHostNet
" because it has hostNetwork
set to true
.
apiVersion: v1
kind: Pod
metadata:
name: busybox
namespace: default
spec:
containers:
- image: busybox:1.28
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Always
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
Pod's DNS Config
Kubernetes v1.14 [stable]
Pod's DNS Config allows users more control on the DNS settings for a Pod.
The dnsConfig
field is optional and it can work with any dnsPolicy
settings.
However, when a Pod's dnsPolicy
is set to "None
", the dnsConfig
field has
to be specified.
Below are the properties a user can specify in the dnsConfig
field:
nameservers
: a list of IP addresses that will be used as DNS servers for the Pod. There can be at most 3 IP addresses specified. When the Pod'sdnsPolicy
is set to "None
", the list must contain at least one IP address, otherwise this property is optional. The servers listed will be combined to the base nameservers generated from the specified DNS policy with duplicate addresses removed.searches
: a list of DNS search domains for hostname lookup in the Pod. This property is optional. When specified, the provided list will be merged into the base search domain names generated from the chosen DNS policy. Duplicate domain names are removed. Kubernetes allows for at most 6 search domains.options
: an optional list of objects where each object may have aname
property (required) and avalue
property (optional). The contents in this property will be merged to the options generated from the specified DNS policy. Duplicate entries are removed.
The following is an example Pod with custom DNS settings:
apiVersion: v1
kind: Pod
metadata:
namespace: default
name: dns-example
spec:
containers:
- name: test
image: nginx
dnsPolicy: "None"
dnsConfig:
nameservers:
- 1.2.3.4
searches:
- ns1.svc.cluster-domain.example
- my.dns.search.suffix
options:
- name: ndots
value: "2"
- name: edns0
When the Pod above is created, the container test
gets the following contents
in its /etc/resolv.conf
file:
nameserver 1.2.3.4
search ns1.svc.cluster-domain.example my.dns.search.suffix
options ndots:2 edns0
For IPv6 setup, search path and name server should be setup like this:
kubectl exec -it dns-example -- cat /etc/resolv.conf
The output is similar to this:
nameserver fd00:79:30::a
search default.svc.cluster-domain.example svc.cluster-domain.example cluster-domain.example
options ndots:5
Expanded DNS Configuration
Kubernetes 1.22 [alpha]
By default, for Pod's DNS Config, Kubernetes allows at most 6 search domains and a list of search domains of up to 256 characters.
If the feature gate ExpandedDNSConfig
is enabled for the kube-apiserver and
the kubelet, it is allowed for Kubernetes to have at most 32 search domains and
a list of search domains of up to 2048 characters.
Feature availability
The availability of Pod DNS Config and DNS Policy "None
" is shown as below.
k8s version | Feature support |
---|---|
1.14 | Stable |
1.10 | Beta (on by default) |
1.9 | Alpha |
What's next
For guidance on administering DNS configurations, check Configure DNS Service
3.5.4 - Connecting Applications with Services
The Kubernetes model for connecting containers
Now that you have a continuously running, replicated application you can expose it on a network.
Kubernetes assumes that pods can communicate with other pods, regardless of which host they land on. Kubernetes gives every pod its own cluster-private IP address, so you do not need to explicitly create links between pods or map container ports to host ports. This means that containers within a Pod can all reach each other's ports on localhost, and all pods in a cluster can see each other without NAT. The rest of this document elaborates on how you can run reliable services on such a networking model.
This guide uses a simple nginx server to demonstrate proof of concept.
Exposing pods to the cluster
We did this in a previous example, but let's do it once again and focus on the networking perspective. Create an nginx Pod, and note that it has a container port specification:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
This makes it accessible from any node in your cluster. Check the nodes the Pod is running on:
kubectl apply -f ./run-my-nginx.yaml
kubectl get pods -l run=my-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE
my-nginx-3800858182-jr4a2 1/1 Running 0 13s 10.244.3.4 kubernetes-minion-905m
my-nginx-3800858182-kna2y 1/1 Running 0 13s 10.244.2.5 kubernetes-minion-ljyd
Check your pods' IPs:
kubectl get pods -l run=my-nginx -o yaml | grep podIP
podIP: 10.244.3.4
podIP: 10.244.2.5
You should be able to ssh into any node in your cluster and use a tool such as curl
to make queries against both IPs. Note that the containers are not using port 80 on the node, nor are there any special NAT rules to route traffic to the pod. This means you can run multiple nginx pods on the same node all using the same containerPort
, and access them from any other pod or node in your cluster using the assigned IP address for the Service. If you want to arrange for a specific port on the host Node to be forwarded to backing Pods, you can - but the networking model should mean that you do not need to do so.
You can read more about the Kubernetes Networking Model if you're curious.
Creating a Service
So we have pods running nginx in a flat, cluster wide, address space. In theory, you could talk to these pods directly, but what happens when a node dies? The pods die with it, and the Deployment will create new ones, with different IPs. This is the problem a Service solves.
A Kubernetes Service is an abstraction which defines a logical set of Pods running somewhere in your cluster, that all provide the same functionality. When created, each Service is assigned a unique IP address (also called clusterIP). This address is tied to the lifespan of the Service, and will not change while the Service is alive. Pods can be configured to talk to the Service, and know that communication to the Service will be automatically load-balanced out to some pod that is a member of the Service.
You can create a Service for your 2 nginx replicas with kubectl expose
:
kubectl expose deployment/my-nginx
service/my-nginx exposed
This is equivalent to kubectl apply -f
the following yaml:
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
ports:
- port: 80
protocol: TCP
selector:
run: my-nginx
This specification will create a Service which targets TCP port 80 on any Pod
with the run: my-nginx
label, and expose it on an abstracted Service port
(targetPort
: is the port the container accepts traffic on, port
: is the
abstracted Service port, which can be any port other pods use to access the
Service).
View Service
API object to see the list of supported fields in service definition.
Check your Service:
kubectl get svc my-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx ClusterIP 10.0.162.149 <none> 80/TCP 21s
As mentioned previously, a Service is backed by a group of Pods. These Pods are
exposed through endpoints
. The Service's selector will be evaluated continuously
and the results will be POSTed to an Endpoints object also named my-nginx
.
When a Pod dies, it is automatically removed from the endpoints, and new Pods
matching the Service's selector will automatically get added to the endpoints.
Check the endpoints, and note that the IPs are the same as the Pods created in
the first step:
kubectl describe svc my-nginx
Name: my-nginx
Namespace: default
Labels: run=my-nginx
Annotations: <none>
Selector: run=my-nginx
Type: ClusterIP
IP: 10.0.162.149
Port: <unset> 80/TCP
Endpoints: 10.244.2.5:80,10.244.3.4:80
Session Affinity: None
Events: <none>
kubectl get ep my-nginx
NAME ENDPOINTS AGE
my-nginx 10.244.2.5:80,10.244.3.4:80 1m
You should now be able to curl the nginx Service on <CLUSTER-IP>:<PORT>
from
any node in your cluster. Note that the Service IP is completely virtual, it
never hits the wire. If you're curious about how this works you can read more
about the service proxy.
Accessing the Service
Kubernetes supports 2 primary modes of finding a Service - environment variables and DNS. The former works out of the box while the latter requires the CoreDNS cluster addon.
enableServiceLinks
flag to false
on the pod spec.
Environment Variables
When a Pod runs on a Node, the kubelet adds a set of environment variables for each active Service. This introduces an ordering problem. To see why, inspect the environment of your running nginx Pods (your Pod name will be different):
kubectl exec my-nginx-3800858182-jr4a2 -- printenv | grep SERVICE
KUBERNETES_SERVICE_HOST=10.0.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
Note there's no mention of your Service. This is because you created the replicas before the Service. Another disadvantage of doing this is that the scheduler might put both Pods on the same machine, which will take your entire Service down if it dies. We can do this the right way by killing the 2 Pods and waiting for the Deployment to recreate them. This time around the Service exists before the replicas. This will give you scheduler-level Service spreading of your Pods (provided all your nodes have equal capacity), as well as the right environment variables:
kubectl scale deployment my-nginx --replicas=0; kubectl scale deployment my-nginx --replicas=2;
kubectl get pods -l run=my-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE
my-nginx-3800858182-e9ihh 1/1 Running 0 5s 10.244.2.7 kubernetes-minion-ljyd
my-nginx-3800858182-j4rm4 1/1 Running 0 5s 10.244.3.8 kubernetes-minion-905m
You may notice that the pods have different names, since they are killed and recreated.
kubectl exec my-nginx-3800858182-e9ihh -- printenv | grep SERVICE
KUBERNETES_SERVICE_PORT=443
MY_NGINX_SERVICE_HOST=10.0.162.149
KUBERNETES_SERVICE_HOST=10.0.0.1
MY_NGINX_SERVICE_PORT=80
KUBERNETES_SERVICE_PORT_HTTPS=443
DNS
Kubernetes offers a DNS cluster addon Service that automatically assigns dns names to other Services. You can check if it's running on your cluster:
kubectl get services kube-dns --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.0.0.10 <none> 53/UDP,53/TCP 8m
The rest of this section will assume you have a Service with a long lived IP
(my-nginx), and a DNS server that has assigned a name to that IP. Here we use the CoreDNS cluster addon (application name kube-dns
), so you can talk to the Service from any pod in your cluster using standard methods (e.g. gethostbyname()
). If CoreDNS isn't running, you can enable it referring to the CoreDNS README or Installing CoreDNS. Let's run another curl application to test this:
kubectl run curl --image=radial/busyboxplus:curl -i --tty
Waiting for pod default/curl-131556218-9fnch to be running, status is Pending, pod ready: false
Hit enter for command prompt
Then, hit enter and run nslookup my-nginx
:
[ root@curl-131556218-9fnch:/ ]$ nslookup my-nginx
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: my-nginx
Address 1: 10.0.162.149
Securing the Service
Till now we have only accessed the nginx server from within the cluster. Before exposing the Service to the internet, you want to make sure the communication channel is secure. For this, you will need:
- Self signed certificates for https (unless you already have an identity certificate)
- An nginx server configured to use the certificates
- A secret that makes the certificates accessible to pods
You can acquire all these from the nginx https example. This requires having go and make tools installed. If you don't want to install those, then follow the manual steps later. In short:
make keys KEY=/tmp/nginx.key CERT=/tmp/nginx.crt
kubectl create secret tls nginxsecret --key /tmp/nginx.key --cert /tmp/nginx.crt
secret/nginxsecret created
kubectl get secrets
NAME TYPE DATA AGE
default-token-il9rc kubernetes.io/service-account-token 1 1d
nginxsecret kubernetes.io/tls 2 1m
And also the configmap:
kubectl create configmap nginxconfigmap --from-file=default.conf
configmap/nginxconfigmap created
kubectl get configmaps
NAME DATA AGE
nginxconfigmap 1 114s
Following are the manual steps to follow in case you run into problems running make (on windows for example):
# Create a public private key pair
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /d/tmp/nginx.key -out /d/tmp/nginx.crt -subj "/CN=my-nginx/O=my-nginx"
# Convert the keys to base64 encoding
cat /d/tmp/nginx.crt | base64
cat /d/tmp/nginx.key | base64
Use the output from the previous commands to create a yaml file as follows. The base64 encoded value should all be on a single line.
apiVersion: "v1"
kind: "Secret"
metadata:
name: "nginxsecret"
namespace: "default"
type: kubernetes.io/tls
data:
tls.crt: "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURIekNDQWdlZ0F3SUJBZ0lKQUp5M3lQK0pzMlpJTUEwR0NTcUdTSWIzRFFFQkJRVUFNQ1l4RVRBUEJnTlYKQkFNVENHNW5hVzU0YzNaak1SRXdEd1lEVlFRS0V3aHVaMmx1ZUhOMll6QWVGdzB4TnpFd01qWXdOekEzTVRKYQpGdzB4T0RFd01qWXdOekEzTVRKYU1DWXhFVEFQQmdOVkJBTVRDRzVuYVc1NGMzWmpNUkV3RHdZRFZRUUtFd2h1CloybHVlSE4yWXpDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBSjFxSU1SOVdWM0IKMlZIQlRMRmtobDRONXljMEJxYUhIQktMSnJMcy8vdzZhU3hRS29GbHlJSU94NGUrMlN5ajBFcndCLzlYTnBwbQppeW1CL3JkRldkOXg5UWhBQUxCZkVaTmNiV3NsTVFVcnhBZW50VWt1dk1vLzgvMHRpbGhjc3paenJEYVJ4NEo5Ci82UVRtVVI3a0ZTWUpOWTVQZkR3cGc3dlVvaDZmZ1Voam92VG42eHNVR0M2QURVODBpNXFlZWhNeVI1N2lmU2YKNHZpaXdIY3hnL3lZR1JBRS9mRTRqakxCdmdONjc2SU90S01rZXV3R0ljNDFhd05tNnNTSzRqYUNGeGpYSnZaZQp2by9kTlEybHhHWCtKT2l3SEhXbXNhdGp4WTRaNVk3R1ZoK0QrWnYvcW1mMFgvbVY0Rmo1NzV3ajFMWVBocWtsCmdhSXZYRyt4U1FVQ0F3RUFBYU5RTUU0d0hRWURWUjBPQkJZRUZPNG9OWkI3YXc1OUlsYkROMzhIYkduYnhFVjcKTUI4R0ExVWRJd1FZTUJhQUZPNG9OWkI3YXc1OUlsYkROMzhIYkduYnhFVjdNQXdHQTFVZEV3UUZNQU1CQWY4dwpEUVlKS29aSWh2Y05BUUVGQlFBRGdnRUJBRVhTMW9FU0lFaXdyMDhWcVA0K2NwTHI3TW5FMTducDBvMm14alFvCjRGb0RvRjdRZnZqeE04Tzd2TjB0clcxb2pGSW0vWDE4ZnZaL3k4ZzVaWG40Vm8zc3hKVmRBcStNZC9jTStzUGEKNmJjTkNUekZqeFpUV0UrKzE5NS9zb2dmOUZ3VDVDK3U2Q3B5N0M3MTZvUXRUakViV05VdEt4cXI0Nk1OZWNCMApwRFhWZmdWQTRadkR4NFo3S2RiZDY5eXM3OVFHYmg5ZW1PZ05NZFlsSUswSGt0ejF5WU4vbVpmK3FqTkJqbWZjCkNnMnlwbGQ0Wi8rUUNQZjl3SkoybFIrY2FnT0R4elBWcGxNSEcybzgvTHFDdnh6elZPUDUxeXdLZEtxaUMwSVEKQ0I5T2wwWW5scE9UNEh1b2hSUzBPOStlMm9KdFZsNUIyczRpbDlhZ3RTVXFxUlU9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
tls.key: "LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUV2UUlCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktjd2dnU2pBZ0VBQW9JQkFRQ2RhaURFZlZsZHdkbFIKd1V5eFpJWmVEZWNuTkFhbWh4d1NpeWF5N1AvOE9ta3NVQ3FCWmNpQ0RzZUh2dGtzbzlCSzhBZi9WemFhWm9zcApnZjYzUlZuZmNmVUlRQUN3WHhHVFhHMXJKVEVGSzhRSHA3VkpMcnpLUC9QOUxZcFlYTE0yYzZ3MmtjZUNmZitrCkU1bEVlNUJVbUNUV09UM3c4S1lPNzFLSWVuNEZJWTZMMDUrc2JGQmd1Z0ExUE5JdWFubm9UTWtlZTRuMG4rTDQKb3NCM01ZUDhtQmtRQlAzeE9JNHl3YjREZXUraURyU2pKSHJzQmlIT05Xc0RadXJFaXVJMmdoY1kxeWIyWHI2UAozVFVOcGNSbC9pVG9zQngxcHJHclk4V09HZVdPeGxZZmcvbWIvNnBuOUYvNWxlQlkrZStjSTlTMkQ0YXBKWUdpCkwxeHZzVWtGQWdNQkFBRUNnZ0VBZFhCK0xkbk8ySElOTGo5bWRsb25IUGlHWWVzZ294RGQwci9hQ1Zkank4dlEKTjIwL3FQWkUxek1yall6Ry9kVGhTMmMwc0QxaTBXSjdwR1lGb0xtdXlWTjltY0FXUTM5SjM0VHZaU2FFSWZWNgo5TE1jUHhNTmFsNjRLMFRVbUFQZytGam9QSFlhUUxLOERLOUtnNXNrSE5pOWNzMlY5ckd6VWlVZWtBL0RBUlBTClI3L2ZjUFBacDRuRWVBZmI3WTk1R1llb1p5V21SU3VKdlNyblBESGtUdW1vVlVWdkxMRHRzaG9reUxiTWVtN3oKMmJzVmpwSW1GTHJqbGtmQXlpNHg0WjJrV3YyMFRrdWtsZU1jaVlMbjk4QWxiRi9DSmRLM3QraTRoMTVlR2ZQegpoTnh3bk9QdlVTaDR2Q0o3c2Q5TmtEUGJvS2JneVVHOXBYamZhRGR2UVFLQmdRRFFLM01nUkhkQ1pKNVFqZWFKClFGdXF4cHdnNzhZTjQyL1NwenlUYmtGcVFoQWtyczJxWGx1MDZBRzhrZzIzQkswaHkzaE9zSGgxcXRVK3NHZVAKOWRERHBsUWV0ODZsY2FlR3hoc0V0L1R6cEdtNGFKSm5oNzVVaTVGZk9QTDhPTm1FZ3MxMVRhUldhNzZxelRyMgphRlpjQ2pWV1g0YnRSTHVwSkgrMjZnY0FhUUtCZ1FEQmxVSUUzTnNVOFBBZEYvL25sQVB5VWs1T3lDdWc3dmVyClUycXlrdXFzYnBkSi9hODViT1JhM05IVmpVM25uRGpHVHBWaE9JeXg5TEFrc2RwZEFjVmxvcG9HODhXYk9lMTAKMUdqbnkySmdDK3JVWUZiRGtpUGx1K09IYnRnOXFYcGJMSHBzUVpsMGhucDBYSFNYVm9CMUliQndnMGEyOFVadApCbFBtWmc2d1BRS0JnRHVIUVV2SDZHYTNDVUsxNFdmOFhIcFFnMU16M2VvWTBPQm5iSDRvZUZKZmcraEppSXlnCm9RN3hqWldVR3BIc3AyblRtcHErQWlSNzdyRVhsdlhtOElVU2FsbkNiRGlKY01Pc29RdFBZNS9NczJMRm5LQTQKaENmL0pWb2FtZm1nZEN0ZGtFMXNINE9MR2lJVHdEbTRpb0dWZGIwMllnbzFyb2htNUpLMUI3MkpBb0dBUW01UQpHNDhXOTVhL0w1eSt5dCsyZ3YvUHM2VnBvMjZlTzRNQ3lJazJVem9ZWE9IYnNkODJkaC8xT2sybGdHZlI2K3VuCnc1YytZUXRSTHlhQmd3MUtpbGhFZDBKTWU3cGpUSVpnQWJ0LzVPbnlDak9OVXN2aDJjS2lrQ1Z2dTZsZlBjNkQKckliT2ZIaHhxV0RZK2Q1TGN1YSt2NzJ0RkxhenJsSlBsRzlOZHhrQ2dZRUF5elIzT3UyMDNRVVV6bUlCRkwzZAp4Wm5XZ0JLSEo3TnNxcGFWb2RjL0d5aGVycjFDZzE2MmJaSjJDV2RsZkI0VEdtUjZZdmxTZEFOOFRwUWhFbUtKCnFBLzVzdHdxNWd0WGVLOVJmMWxXK29xNThRNTBxMmk1NVdUTThoSDZhTjlaMTltZ0FGdE5VdGNqQUx2dFYxdEYKWSs4WFJkSHJaRnBIWll2NWkwVW1VbGc9Ci0tLS0tRU5EIFBSSVZBVEUgS0VZLS0tLS0K"
Now create the secrets using the file:
kubectl apply -f nginxsecrets.yaml
kubectl get secrets
NAME TYPE DATA AGE
default-token-il9rc kubernetes.io/service-account-token 1 1d
nginxsecret kubernetes.io/tls 2 1m
Now modify your nginx replicas to start an https server using the certificate in the secret, and the Service, to expose both ports (80 and 443):
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
type: NodePort
ports:
- port: 8080
targetPort: 80
protocol: TCP
name: http
- port: 443
protocol: TCP
name: https
selector:
run: my-nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 1
template:
metadata:
labels:
run: my-nginx
spec:
volumes:
- name: secret-volume
secret:
secretName: nginxsecret
- name: configmap-volume
configMap:
name: nginxconfigmap
containers:
- name: nginxhttps
image: bprashanth/nginxhttps:1.0
ports:
- containerPort: 443
- containerPort: 80
volumeMounts:
- mountPath: /etc/nginx/ssl
name: secret-volume
- mountPath: /etc/nginx/conf.d
name: configmap-volume
Noteworthy points about the nginx-secure-app manifest:
- It contains both Deployment and Service specification in the same file.
- The nginx server serves HTTP traffic on port 80 and HTTPS traffic on 443, and nginx Service exposes both ports.
- Each container has access to the keys through a volume mounted at
/etc/nginx/ssl
. This is setup before the nginx server is started.
kubectl delete deployments,svc my-nginx; kubectl create -f ./nginx-secure-app.yaml
At this point you can reach the nginx server from any node.
kubectl get pods -o yaml | grep -i podip
podIP: 10.244.3.5
node $ curl -k https://10.244.3.5
...
<h1>Welcome to nginx!</h1>
Note how we supplied the -k
parameter to curl in the last step, this is because we don't know anything about the pods running nginx at certificate generation time,
so we have to tell curl to ignore the CName mismatch. By creating a Service we linked the CName used in the certificate with the actual DNS name used by pods during Service lookup.
Let's test this from a pod (the same secret is being reused for simplicity, the pod only needs nginx.crt to access the Service):
apiVersion: apps/v1
kind: Deployment
metadata:
name: curl-deployment
spec:
selector:
matchLabels:
app: curlpod
replicas: 1
template:
metadata:
labels:
app: curlpod
spec:
volumes:
- name: secret-volume
secret:
secretName: nginxsecret
containers:
- name: curlpod
command:
- sh
- -c
- while true; do sleep 1; done
image: radial/busyboxplus:curl
volumeMounts:
- mountPath: /etc/nginx/ssl
name: secret-volume
kubectl apply -f ./curlpod.yaml
kubectl get pods -l app=curlpod
NAME READY STATUS RESTARTS AGE
curl-deployment-1515033274-1410r 1/1 Running 0 1m
kubectl exec curl-deployment-1515033274-1410r -- curl https://my-nginx --cacert /etc/nginx/ssl/tls.crt
...
<title>Welcome to nginx!</title>
...
Exposing the Service
For some parts of your applications you may want to expose a Service onto an
external IP address. Kubernetes supports two ways of doing this: NodePorts and
LoadBalancers. The Service created in the last section already used NodePort
,
so your nginx HTTPS replica is ready to serve traffic on the internet if your
node has a public IP.
kubectl get svc my-nginx -o yaml | grep nodePort -C 5
uid: 07191fb3-f61a-11e5-8ae5-42010af00002
spec:
clusterIP: 10.0.162.149
ports:
- name: http
nodePort: 31704
port: 8080
protocol: TCP
targetPort: 80
- name: https
nodePort: 32453
port: 443
protocol: TCP
targetPort: 443
selector:
run: my-nginx
kubectl get nodes -o yaml | grep ExternalIP -C 1
- address: 104.197.41.11
type: ExternalIP
allocatable:
--
- address: 23.251.152.56
type: ExternalIP
allocatable:
...
$ curl https://<EXTERNAL-IP>:<NODE-PORT> -k
...
<h1>Welcome to nginx!</h1>
Let's now recreate the Service to use a cloud load balancer. Change the Type
of my-nginx
Service from NodePort
to LoadBalancer
:
kubectl edit svc my-nginx
kubectl get svc my-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx LoadBalancer 10.0.162.149 xx.xxx.xxx.xxx 8080:30163/TCP 21s
curl https://<EXTERNAL-IP> -k
...
<title>Welcome to nginx!</title>
The IP address in the EXTERNAL-IP
column is the one that is available on the public internet. The CLUSTER-IP
is only available inside your
cluster/private cloud network.
Note that on AWS, type LoadBalancer
creates an ELB, which uses a (long)
hostname, not an IP. It's too long to fit in the standard kubectl get svc
output, in fact, so you'll need to do kubectl describe service my-nginx
to
see it. You'll see something like this:
kubectl describe service my-nginx
...
LoadBalancer Ingress: a320587ffd19711e5a37606cf4a74574-1142138393.us-east-1.elb.amazonaws.com
...
What's next
- Learn more about Using a Service to Access an Application in a Cluster
- Learn more about Connecting a Front End to a Back End Using a Service
- Learn more about Creating an External Load Balancer
3.5.5 - Ingress
Kubernetes v1.19 [stable]
An API object that manages external access to the services in a cluster, typically HTTP.
Ingress may provide load balancing, SSL termination and name-based virtual hosting.
Terminology
For clarity, this guide defines the following terms:
- Node: A worker machine in Kubernetes, part of a cluster.
- Cluster: A set of Nodes that run containerized applications managed by Kubernetes. For this example, and in most common Kubernetes deployments, nodes in the cluster are not part of the public internet.
- Edge router: A router that enforces the firewall policy for your cluster. This could be a gateway managed by a cloud provider or a physical piece of hardware.
- Cluster network: A set of links, logical or physical, that facilitate communication within a cluster according to the Kubernetes networking model.
- Service: A Kubernetes Service that identifies a set of Pods using label selectors. Unless mentioned otherwise, Services are assumed to have virtual IPs only routable within the cluster network.
What is Ingress?
Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource.
Here is a simple example where an Ingress sends all its traffic to one Service:
An Ingress may be configured to give Services externally-reachable URLs, load balance traffic, terminate SSL / TLS, and offer name-based virtual hosting. An Ingress controller is responsible for fulfilling the Ingress, usually with a load balancer, though it may also configure your edge router or additional frontends to help handle the traffic.
An Ingress does not expose arbitrary ports or protocols. Exposing services other than HTTP and HTTPS to the internet typically uses a service of type Service.Type=NodePort or Service.Type=LoadBalancer.
Prerequisites
You must have an Ingress controller to satisfy an Ingress. Only creating an Ingress resource has no effect.
You may need to deploy an Ingress controller such as ingress-nginx. You can choose from a number of Ingress controllers.
Ideally, all Ingress controllers should fit the reference specification. In reality, the various Ingress controllers operate slightly differently.
The Ingress resource
A minimal Ingress resource example:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minimal-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx-example
rules:
- http:
paths:
- path: /testpath
pathType: Prefix
backend:
service:
name: test
port:
number: 80
As with all other Kubernetes resources, an Ingress needs apiVersion
, kind
, and metadata
fields.
The name of an Ingress object must be a valid
DNS subdomain name.
For general information about working with config files, see deploying applications, configuring containers, managing resources.
Ingress frequently uses annotations to configure some options depending on the Ingress controller, an example of which
is the rewrite-target annotation.
Different Ingress controllers support different annotations. Review the documentation for
your choice of Ingress controller to learn which annotations are supported.
The Ingress spec has all the information needed to configure a load balancer or proxy server. Most importantly, it contains a list of rules matched against all incoming requests. Ingress resource only supports rules for directing HTTP(S) traffic.
If the ingressClassName
is omitted, a default Ingress class
should be defined.
There are some ingress controllers, that work without the definition of a
default IngressClass
. For example, the Ingress-NGINX controller can be
configured with a flag
--watch-ingress-without-class
. It is recommended though, to specify the
default IngressClass
as shown below.
Ingress rules
Each HTTP rule contains the following information:
- An optional host. In this example, no host is specified, so the rule applies to all inbound HTTP traffic through the IP address specified. If a host is provided (for example, foo.bar.com), the rules apply to that host.
- A list of paths (for example,
/testpath
), each of which has an associated backend defined with aservice.name
and aservice.port.name
orservice.port.number
. Both the host and path must match the content of an incoming request before the load balancer directs traffic to the referenced Service. - A backend is a combination of Service and port names as described in the Service doc or a custom resource backend by way of a CRD. HTTP (and HTTPS) requests to the Ingress that matches the host and path of the rule are sent to the listed backend.
A defaultBackend
is often configured in an Ingress controller to service any requests that do not
match a path in the spec.
DefaultBackend
An Ingress with no rules sends all traffic to a single default backend and .spec.defaultBackend
is the backend that should handle requests in that case.
The defaultBackend
is conventionally a configuration option of the
Ingress controller and
is not specified in your Ingress resources.
If no .spec.rules
are specified, .spec.defaultBackend
must be specified.
If defaultBackend
is not set, the handling of requests that do not match any of the rules will be up to the
ingress controller (consult the documentation for your ingress controller to find out how it handles this case).
If none of the hosts or paths match the HTTP request in the Ingress objects, the traffic is routed to your default backend.
Resource backends
A Resource
backend is an ObjectRef to another Kubernetes resource within the
same namespace as the Ingress object. A Resource
is a mutually exclusive
setting with Service, and will fail validation if both are specified. A common
usage for a Resource
backend is to ingress data to an object storage backend
with static assets.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ingress-resource-backend
spec:
defaultBackend:
resource:
apiGroup: k8s.example.com
kind: StorageBucket
name: static-assets
rules:
- http:
paths:
- path: /icons
pathType: ImplementationSpecific
backend:
resource:
apiGroup: k8s.example.com
kind: StorageBucket
name: icon-assets
After creating the Ingress above, you can view it with the following command:
kubectl describe ingress ingress-resource-backend
Name: ingress-resource-backend
Namespace: default
Address:
Default backend: APIGroup: k8s.example.com, Kind: StorageBucket, Name: static-assets
Rules:
Host Path Backends
---- ---- --------
*
/icons APIGroup: k8s.example.com, Kind: StorageBucket, Name: icon-assets
Annotations: <none>
Events: <none>
Path types
Each path in an Ingress is required to have a corresponding path type. Paths
that do not include an explicit pathType
will fail validation. There are three
supported path types:
-
ImplementationSpecific
: With this path type, matching is up to the IngressClass. Implementations can treat this as a separatepathType
or treat it identically toPrefix
orExact
path types. -
Exact
: Matches the URL path exactly and with case sensitivity. -
Prefix
: Matches based on a URL path prefix split by/
. Matching is case sensitive and done on a path element by element basis. A path element refers to the list of labels in the path split by the/
separator. A request is a match for path p if every p is an element-wise prefix of p of the request path.Note: If the last element of the path is a substring of the last element in request path, it is not a match (for example:/foo/bar
matches/foo/bar/baz
, but does not match/foo/barbaz
).
Examples
Kind | Path(s) | Request path(s) | Matches? |
---|---|---|---|
Prefix | / |
(all paths) | Yes |
Exact | /foo |
/foo |
Yes |
Exact | /foo |
/bar |
No |
Exact | /foo |
/foo/ |
No |
Exact | /foo/ |
/foo |
No |
Prefix | /foo |
/foo , /foo/ |
Yes |
Prefix | /foo/ |
/foo , /foo/ |
Yes |
Prefix | /aaa/bb |
/aaa/bbb |
No |
Prefix | /aaa/bbb |
/aaa/bbb |
Yes |
Prefix | /aaa/bbb/ |
/aaa/bbb |
Yes, ignores trailing slash |
Prefix | /aaa/bbb |
/aaa/bbb/ |
Yes, matches trailing slash |
Prefix | /aaa/bbb |
/aaa/bbb/ccc |
Yes, matches subpath |
Prefix | /aaa/bbb |
/aaa/bbbxyz |
No, does not match string prefix |
Prefix | / , /aaa |
/aaa/ccc |
Yes, matches /aaa prefix |
Prefix | / , /aaa , /aaa/bbb |
/aaa/bbb |
Yes, matches /aaa/bbb prefix |
Prefix | / , /aaa , /aaa/bbb |
/ccc |
Yes, matches / prefix |
Prefix | /aaa |
/ccc |
No, uses default backend |
Mixed | /foo (Prefix), /foo (Exact) |
/foo |
Yes, prefers Exact |
Multiple matches
In some cases, multiple paths within an Ingress will match a request. In those cases precedence will be given first to the longest matching path. If two paths are still equally matched, precedence will be given to paths with an exact path type over prefix path type.
Hostname wildcards
Hosts can be precise matches (for example “foo.bar.com
”) or a wildcard (for
example “*.foo.com
”). Precise matches require that the HTTP host
header
matches the host
field. Wildcard matches require the HTTP host
header is
equal to the suffix of the wildcard rule.
Host | Host header | Match? |
---|---|---|
*.foo.com |
bar.foo.com |
Matches based on shared suffix |
*.foo.com |
baz.bar.foo.com |
No match, wildcard only covers a single DNS label |
*.foo.com |
foo.com |
No match, wildcard only covers a single DNS label |
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ingress-wildcard-host
spec:
rules:
- host: "foo.bar.com"
http:
paths:
- pathType: Prefix
path: "/bar"
backend:
service:
name: service1
port:
number: 80
- host: "*.foo.com"
http:
paths:
- pathType: Prefix
path: "/foo"
backend:
service:
name: service2
port:
number: 80
Ingress class
Ingresses can be implemented by different controllers, often with different configuration. Each Ingress should specify a class, a reference to an IngressClass resource that contains additional configuration including the name of the controller that should implement the class.
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: external-lb
spec:
controller: example.com/ingress-controller
parameters:
apiGroup: k8s.example.com
kind: IngressParameters
name: external-lb
The .spec.parameters
field of an IngressClass lets you reference another
resource that provides configuration related to that IngressClass.
The specific type of parameters to use depends on the ingress controller
that you specify in the .spec.controller
field of the IngressClass.
IngressClass scope
Depending on your ingress controller, you may be able to use parameters that you set cluster-wide, or just for one namespace.
The default scope for IngressClass parameters is cluster-wide.
If you set the .spec.parameters
field and don't set
.spec.parameters.scope
, or if you set .spec.parameters.scope
to
Cluster
, then the IngressClass refers to a cluster-scoped resource.
The kind
(in combination the apiGroup
) of the parameters
refers to a cluster-scoped API (possibly a custom resource), and
the name
of the parameters identifies a specific cluster scoped
resource for that API.
For example:
---
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: external-lb-1
spec:
controller: example.com/ingress-controller
parameters:
# The parameters for this IngressClass are specified in a
# ClusterIngressParameter (API group k8s.example.net) named
# "external-config-1". This definition tells Kubernetes to
# look for a cluster-scoped parameter resource.
scope: Cluster
apiGroup: k8s.example.net
kind: ClusterIngressParameter
name: external-config-1
Kubernetes v1.23 [stable]
If you set the .spec.parameters
field and set
.spec.parameters.scope
to Namespace
, then the IngressClass refers
to a namespaced-scoped resource. You must also set the namespace
field within .spec.parameters
to the namespace that contains
the parameters you want to use.
The kind
(in combination the apiGroup
) of the parameters
refers to a namespaced API (for example: ConfigMap), and
the name
of the parameters identifies a specific resource
in the namespace you specified in namespace
.
Namespace-scoped parameters help the cluster operator delegate control over the configuration (for example: load balancer settings, API gateway definition) that is used for a workload. If you used a cluster-scoped parameter then either:
- the cluster operator team needs to approve a different team's changes every time there's a new configuration change being applied.
- the cluster operator must define specific access controls, such as RBAC roles and bindings, that let the application team make changes to the cluster-scoped parameters resource.
The IngressClass API itself is always cluster-scoped.
Here is an example of an IngressClass that refers to parameters that are namespaced:
---
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: external-lb-2
spec:
controller: example.com/ingress-controller
parameters:
# The parameters for this IngressClass are specified in an
# IngressParameter (API group k8s.example.com) named "external-config",
# that's in the "external-configuration" namespace.
scope: Namespace
apiGroup: k8s.example.com
kind: IngressParameter
namespace: external-configuration
name: external-config
Deprecated annotation
Before the IngressClass resource and ingressClassName
field were added in
Kubernetes 1.18, Ingress classes were specified with a
kubernetes.io/ingress.class
annotation on the Ingress. This annotation was
never formally defined, but was widely supported by Ingress controllers.
The newer ingressClassName
field on Ingresses is a replacement for that
annotation, but is not a direct equivalent. While the annotation was generally
used to reference the name of the Ingress controller that should implement the
Ingress, the field is a reference to an IngressClass resource that contains
additional Ingress configuration, including the name of the Ingress controller.
Default IngressClass
You can mark a particular IngressClass as default for your cluster. Setting the
ingressclass.kubernetes.io/is-default-class
annotation to true
on an
IngressClass resource will ensure that new Ingresses without an
ingressClassName
field specified will be assigned this default IngressClass.
ingressClassName
specified. You can resolve this by ensuring that at most 1
IngressClass is marked as default in your cluster.
There are some ingress controllers, that work without the definition of a
default IngressClass
. For example, the Ingress-NGINX controller can be
configured with a flag
--watch-ingress-without-class
. It is recommended though, to specify the
default IngressClass
:
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
labels:
app.kubernetes.io/component: controller
name: nginx-example
annotations:
ingressclass.kubernetes.io/is-default-class: "true"
spec:
controller: k8s.io/ingress-nginx
Types of Ingress
Ingress backed by a single Service
There are existing Kubernetes concepts that allow you to expose a single Service (see alternatives). You can also do this with an Ingress by specifying a default backend with no rules.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: test-ingress
spec:
defaultBackend:
service:
name: test
port:
number: 80
If you create it using kubectl apply -f
you should be able to view the state
of the Ingress you added:
kubectl get ingress test-ingress
NAME CLASS HOSTS ADDRESS PORTS AGE
test-ingress external-lb * 203.0.113.123 80 59s
Where 203.0.113.123
is the IP allocated by the Ingress controller to satisfy
this Ingress.
<pending>
.
Simple fanout
A fanout configuration routes traffic from a single IP address to more than one Service, based on the HTTP URI being requested. An Ingress allows you to keep the number of load balancers down to a minimum. For example, a setup like:
would require an Ingress such as:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: simple-fanout-example
spec:
rules:
- host: foo.bar.com
http:
paths:
- path: /foo
pathType: Prefix
backend:
service:
name: service1
port:
number: 4200
- path: /bar
pathType: Prefix
backend:
service:
name: service2
port:
number: 8080
When you create the Ingress with kubectl apply -f
:
kubectl describe ingress simple-fanout-example
Name: simple-fanout-example
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:4200 (10.8.0.90:4200)
/bar service2:8080 (10.8.0.91:8080)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ADD 22s loadbalancer-controller default/test
The Ingress controller provisions an implementation-specific load balancer
that satisfies the Ingress, as long as the Services (service1
, service2
) exist.
When it has done so, you can see the address of the load balancer at the
Address field.
Name based virtual hosting
Name-based virtual hosts support routing HTTP traffic to multiple host names at the same IP address.
The following Ingress tells the backing load balancer to route requests based on the Host header.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: name-virtual-host-ingress
spec:
rules:
- host: foo.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service1
port:
number: 80
- host: bar.foo.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service2
port:
number: 80
If you create an Ingress resource without any hosts defined in the rules, then any web traffic to the IP address of your Ingress controller can be matched without a name based virtual host being required.
For example, the following Ingress routes traffic
requested for first.bar.com
to service1
, second.bar.com
to service2
, and any traffic whose request host header doesn't match first.bar.com
and second.bar.com
to service3
.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: name-virtual-host-ingress-no-third-host
spec:
rules:
- host: first.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service1
port:
number: 80
- host: second.bar.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service2
port:
number: 80
- http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: service3
port:
number: 80
TLS
You can secure an Ingress by specifying a Secret
that contains a TLS private key and certificate. The Ingress resource only
supports a single TLS port, 443, and assumes TLS termination at the ingress point
(traffic to the Service and its Pods is in plaintext).
If the TLS configuration section in an Ingress specifies different hosts, they are
multiplexed on the same port according to the hostname specified through the
SNI TLS extension (provided the Ingress controller supports SNI). The TLS secret
must contain keys named tls.crt
and tls.key
that contain the certificate
and private key to use for TLS. For example:
apiVersion: v1
kind: Secret
metadata:
name: testsecret-tls
namespace: default
data:
tls.crt: base64 encoded cert
tls.key: base64 encoded key
type: kubernetes.io/tls
Referencing this secret in an Ingress tells the Ingress controller to
secure the channel from the client to the load balancer using TLS. You need to make
sure the TLS secret you created came from a certificate that contains a Common
Name (CN), also known as a Fully Qualified Domain Name (FQDN) for https-example.foo.com
.
hosts
in the tls
section need to explicitly match the host
in the rules
section.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tls-example-ingress
spec:
tls:
- hosts:
- https-example.foo.com
secretName: testsecret-tls
rules:
- host: https-example.foo.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: service1
port:
number: 80
Load balancing
An Ingress controller is bootstrapped with some load balancing policy settings that it applies to all Ingress, such as the load balancing algorithm, backend weight scheme, and others. More advanced load balancing concepts (e.g. persistent sessions, dynamic weights) are not yet exposed through the Ingress. You can instead get these features through the load balancer used for a Service.
It's also worth noting that even though health checks are not exposed directly through the Ingress, there exist parallel concepts in Kubernetes such as readiness probes that allow you to achieve the same end result. Please review the controller specific documentation to see how they handle health checks (for example: nginx, or GCE).
Updating an Ingress
To update an existing Ingress to add a new Host, you can update it by editing the resource:
kubectl describe ingress test
Name: test
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:80 (10.8.0.90:80)
Annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ADD 35s loadbalancer-controller default/test
kubectl edit ingress test
This pops up an editor with the existing configuration in YAML format. Modify it to include the new Host:
spec:
rules:
- host: foo.bar.com
http:
paths:
- backend:
service:
name: service1
port:
number: 80
path: /foo
pathType: Prefix
- host: bar.baz.com
http:
paths:
- backend:
service:
name: service2
port:
number: 80
path: /foo
pathType: Prefix
..
After you save your changes, kubectl updates the resource in the API server, which tells the Ingress controller to reconfigure the load balancer.
Verify this:
kubectl describe ingress test
Name: test
Namespace: default
Address: 178.91.123.132
Default backend: default-http-backend:80 (10.8.2.3:8080)
Rules:
Host Path Backends
---- ---- --------
foo.bar.com
/foo service1:80 (10.8.0.90:80)
bar.baz.com
/foo service2:80 (10.8.0.91:80)
Annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ADD 45s loadbalancer-controller default/test
You can achieve the same outcome by invoking kubectl replace -f
on a modified Ingress YAML file.
Failing across availability zones
Techniques for spreading traffic across failure domains differ between cloud providers. Please check the documentation of the relevant Ingress controller for details.
Alternatives
You can expose a Service in multiple ways that don't directly involve the Ingress resource:
What's next
- Learn about the Ingress API
- Learn about Ingress controllers
- Set up Ingress on Minikube with the NGINX Controller
3.5.6 - Ingress Controllers
In order for the Ingress resource to work, the cluster must have an ingress controller running.
Unlike other types of controllers which run as part of the kube-controller-manager
binary, Ingress controllers
are not started automatically with a cluster. Use this page to choose the ingress controller implementation
that best fits your cluster.
Kubernetes as a project supports and maintains AWS, GCE, and nginx ingress controllers.
Additional controllers
- AKS Application Gateway Ingress Controller is an ingress controller that configures the Azure Application Gateway.
- Ambassador API Gateway is an Envoy-based ingress controller.
- Apache APISIX ingress controller is an Apache APISIX-based ingress controller.
- Avi Kubernetes Operator provides L4-L7 load-balancing using VMware NSX Advanced Load Balancer.
- BFE Ingress Controller is a BFE-based ingress controller.
- The Citrix ingress controller works with Citrix Application Delivery Controller.
- Contour is an Envoy based ingress controller.
- EnRoute is an Envoy based API gateway that can run as an ingress controller.
- Easegress IngressController is an Easegress based API gateway that can run as an ingress controller.
- F5 BIG-IP Container Ingress Services for Kubernetes lets you use an Ingress to configure F5 BIG-IP virtual servers.
- Gloo is an open-source ingress controller based on Envoy, which offers API gateway functionality.
- HAProxy Ingress is an ingress controller for HAProxy.
- The HAProxy Ingress Controller for Kubernetes is also an ingress controller for HAProxy.
- Istio Ingress is an Istio based ingress controller.
- The Kong Ingress Controller for Kubernetes is an ingress controller driving Kong Gateway.
- The NGINX Ingress Controller for Kubernetes works with the NGINX webserver (as a proxy).
- The Pomerium Ingress Controller is based on Pomerium, which offers context-aware access policy.
- Skipper HTTP router and reverse proxy for service composition, including use cases like Kubernetes Ingress, designed as a library to build your custom proxy.
- The Traefik Kubernetes Ingress provider is an ingress controller for the Traefik proxy.
- Tyk Operator extends Ingress with Custom Resources to bring API Management capabilities to Ingress. Tyk Operator works with the Open Source Tyk Gateway & Tyk Cloud control plane.
- Voyager is an ingress controller for HAProxy.
Using multiple Ingress controllers
You may deploy any number of ingress controllers using ingress class
within a cluster. Note the .metadata.name
of your ingress class resource. When you create an ingress you would need that name to specify the ingressClassName
field on your Ingress object (refer to IngressSpec v1 reference. ingressClassName
is a replacement of the older annotation method.
If you do not specify an IngressClass for an Ingress, and your cluster has exactly one IngressClass marked as default, then Kubernetes applies the cluster's default IngressClass to the Ingress.
You mark an IngressClass as default by setting the ingressclass.kubernetes.io/is-default-class
annotation on that IngressClass, with the string value "true"
.
Ideally, all ingress controllers should fulfill this specification, but the various ingress controllers operate slightly differently.
What's next
- Learn more about Ingress.
- Set up Ingress on Minikube with the NGINX Controller.
3.5.7 - EndpointSlices
Kubernetes v1.21 [stable]
EndpointSlices provide a simple way to track network endpoints within a Kubernetes cluster. They offer a more scalable and extensible alternative to Endpoints.
Motivation
The Endpoints API has provided a simple and straightforward way of tracking network endpoints in Kubernetes. Unfortunately as Kubernetes clusters and Services have grown to handle and send more traffic to more backend Pods, limitations of that original API became more visible. Most notably, those included challenges with scaling to larger numbers of network endpoints.
Since all network endpoints for a Service were stored in a single Endpoints resource, those resources could get quite large. That affected the performance of Kubernetes components (notably the master control plane) and resulted in significant amounts of network traffic and processing when Endpoints changed. EndpointSlices help you mitigate those issues as well as provide an extensible platform for additional features such as topological routing.
EndpointSlice resources
In Kubernetes, an EndpointSlice contains references to a set of network endpoints. The control plane automatically creates EndpointSlices for any Kubernetes Service that has a selector specified. These EndpointSlices include references to all the Pods that match the Service selector. EndpointSlices group network endpoints together by unique combinations of protocol, port number, and Service name. The name of a EndpointSlice object must be a valid DNS subdomain name.
As an example, here's a sample EndpointSlice resource for the example
Kubernetes Service.
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: example-abc
labels:
kubernetes.io/service-name: example
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints:
- addresses:
- "10.1.2.3"
conditions:
ready: true
hostname: pod-1
nodeName: node-1
zone: us-west2-a
By default, the control plane creates and manages EndpointSlices to have no
more than 100 endpoints each. You can configure this with the
--max-endpoints-per-slice
kube-controller-manager
flag, up to a maximum of 1000.
EndpointSlices can act as the source of truth for kube-proxy when it comes to how to route internal traffic. When enabled, they should provide a performance improvement for services with large numbers of endpoints.
Address types
EndpointSlices support three address types:
- IPv4
- IPv6
- FQDN (Fully Qualified Domain Name)
Conditions
The EndpointSlice API stores conditions about endpoints that may be useful for consumers.
The three conditions are ready
, serving
, and terminating
.
Ready
ready
is a condition that maps to a Pod's Ready
condition. A running Pod with the Ready
condition set to True
should have this EndpointSlice condition also set to true
. For
compatibility reasons, ready
is NEVER true
when a Pod is terminating. Consumers should refer
to the serving
condition to inspect the readiness of terminating Pods. The only exception to
this rule is for Services with spec.publishNotReadyAddresses
set to true
. Endpoints for these
Services will always have the ready
condition set to true
.
Serving
Kubernetes v1.20 [alpha]
serving
is identical to the ready
condition, except it does not account for terminating states.
Consumers of the EndpointSlice API should check this condition if they care about pod readiness while
the pod is also terminating.
serving
is almost identical to ready
, it was added to prevent break the existing meaning
of ready
. It may be unexpected for existing clients if ready
could be true
for terminating
endpoints, since historically terminating endpoints were never included in the Endpoints or
EndpointSlice API to begin with. For this reason, ready
is always false
for terminating
endpoints, and a new condition serving
was added in v1.20 so that clients can track readiness
for terminating pods independent of the existing semantics for ready
.
Terminating
Kubernetes v1.20 [alpha]
Terminating
is a condition that indicates whether an endpoint is terminating.
For pods, this is any pod that has a deletion timestamp set.
Topology information
Each endpoint within an EndpointSlice can contain relevant topology information. The topology information includes the location of the endpoint and information about the corresponding Node and zone. These are available in the following per endpoint fields on EndpointSlices:
nodeName
- The name of the Node this endpoint is on.zone
- The zone this endpoint is in.
In the v1 API, the per endpoint topology
was effectively removed in favor of
the dedicated fields nodeName
and zone
.
Setting arbitrary topology fields on the endpoint
field of an EndpointSlice
resource has been deprecated and is not supported in the v1 API.
Instead, the v1 API supports setting individual nodeName
and zone
fields.
These fields are automatically translated between API versions. For example, the
value of the "topology.kubernetes.io/zone"
key in the topology
field in
the v1beta1 API is accessible as the zone
field in the v1 API.
Management
Most often, the control plane (specifically, the endpoint slice controller) creates and manages EndpointSlice objects. There are a variety of other use cases for EndpointSlices, such as service mesh implementations, that could result in other entities or controllers managing additional sets of EndpointSlices.
To ensure that multiple entities can manage EndpointSlices without interfering
with each other, Kubernetes defines the
label
endpointslice.kubernetes.io/managed-by
, which indicates the entity managing
an EndpointSlice.
The endpoint slice controller sets endpointslice-controller.k8s.io
as the value
for this label on all EndpointSlices it manages. Other entities managing
EndpointSlices should also set a unique value for this label.
Ownership
In most use cases, EndpointSlices are owned by the Service that the endpoint
slice object tracks endpoints for. This ownership is indicated by an owner
reference on each EndpointSlice as well as a kubernetes.io/service-name
label that enables simple lookups of all EndpointSlices belonging to a Service.
EndpointSlice mirroring
In some cases, applications create custom Endpoints resources. To ensure that these applications do not need to concurrently write to both Endpoints and EndpointSlice resources, the cluster's control plane mirrors most Endpoints resources to corresponding EndpointSlices.
The control plane mirrors Endpoints resources unless:
- the Endpoints resource has a
endpointslice.kubernetes.io/skip-mirror
label set totrue
. - the Endpoints resource has a
control-plane.alpha.kubernetes.io/leader
annotation. - the corresponding Service resource does not exist.
- the corresponding Service resource has a non-nil selector.
Individual Endpoints resources may translate into multiple EndpointSlices. This will occur if an Endpoints resource has multiple subsets or includes endpoints with multiple IP families (IPv4 and IPv6). A maximum of 1000 addresses per subset will be mirrored to EndpointSlices.
Distribution of EndpointSlices
Each EndpointSlice has a set of ports that applies to all endpoints within the resource. When named ports are used for a Service, Pods may end up with different target port numbers for the same named port, requiring different EndpointSlices. This is similar to the logic behind how subsets are grouped with Endpoints.
The control plane tries to fill EndpointSlices as full as possible, but does not actively rebalance them. The logic is fairly straightforward:
- Iterate through existing EndpointSlices, remove endpoints that are no longer desired and update matching endpoints that have changed.
- Iterate through EndpointSlices that have been modified in the first step and fill them up with any new endpoints needed.
- If there's still new endpoints left to add, try to fit them into a previously unchanged slice and/or create new ones.
Importantly, the third step prioritizes limiting EndpointSlice updates over a perfectly full distribution of EndpointSlices. As an example, if there are 10 new endpoints to add and 2 EndpointSlices with room for 5 more endpoints each, this approach will create a new EndpointSlice instead of filling up the 2 existing EndpointSlices. In other words, a single EndpointSlice creation is preferrable to multiple EndpointSlice updates.
With kube-proxy running on each Node and watching EndpointSlices, every change to an EndpointSlice becomes relatively expensive since it will be transmitted to every Node in the cluster. This approach is intended to limit the number of changes that need to be sent to every Node, even if it may result with multiple EndpointSlices that are not full.
In practice, this less than ideal distribution should be rare. Most changes processed by the EndpointSlice controller will be small enough to fit in an existing EndpointSlice, and if not, a new EndpointSlice is likely going to be necessary soon anyway. Rolling updates of Deployments also provide a natural repacking of EndpointSlices with all Pods and their corresponding endpoints getting replaced.
Duplicate endpoints
Due to the nature of EndpointSlice changes, endpoints may be represented in more
than one EndpointSlice at the same time. This naturally occurs as changes to
different EndpointSlice objects can arrive at the Kubernetes client watch/cache
at different times. Implementations using EndpointSlice must be able to have the
endpoint appear in more than one slice. A reference implementation of how to
perform endpoint deduplication can be found in the EndpointSliceCache
implementation in kube-proxy
.
What's next
3.5.8 - Service Internal Traffic Policy
Kubernetes v1.23 [beta]
Service Internal Traffic Policy enables internal traffic restrictions to only route internal traffic to endpoints within the node the traffic originated from. The "internal" traffic here refers to traffic originated from Pods in the current cluster. This can help to reduce costs and improve performance.
Using Service Internal Traffic Policy
The ServiceInternalTrafficPolicy
feature gate
is a Beta feature and enabled by default.
When the feature is enabled, you can enable the internal-only traffic policy for a
Services, by setting its
.spec.internalTrafficPolicy
to Local
.
This tells kube-proxy to only use node local endpoints for cluster internal traffic.
The following example shows what a Service looks like when you set
.spec.internalTrafficPolicy
to Local
:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
internalTrafficPolicy: Local
How it works
The kube-proxy filters the endpoints it routes to based on the
spec.internalTrafficPolicy
setting. When it's set to Local
, only node local
endpoints are considered. When it's Cluster
or missing, all endpoints are
considered.
When the feature gate
ServiceInternalTrafficPolicy
is enabled, spec.internalTrafficPolicy
defaults to "Cluster".
Constraints
- Service Internal Traffic Policy is not used when
externalTrafficPolicy
is set toLocal
on a Service. It is possible to use both features in the same cluster on different Services, just not on the same Service.
What's next
- Read about Topology Aware Hints
- Read about Service External Traffic Policy
- Read Connecting Applications with Services
3.5.9 - Topology Aware Hints
Kubernetes v1.23 [beta]
Topology Aware Hints enable topology aware routing by including suggestions for how clients should consume endpoints. This approach adds metadata to enable consumers of EndpointSlice and / or Endpoints objects, so that traffic to those network endpoints can be routed closer to where it originated.
For example, you can route traffic within a locality to reduce costs, or to improve network performance.
TopologyAwareHints
feature gate.
Motivation
Kubernetes clusters are increasingly deployed in multi-zone environments. Topology Aware Hints provides a mechanism to help keep traffic within the zone it originated from. This concept is commonly referred to as "Topology Aware Routing". When calculating the endpoints for a Service, the EndpointSlice controller considers the topology (region and zone) of each endpoint and populates the hints field to allocate it to a zone. Cluster components such as the kube-proxy can then consume those hints, and use them to influence how the traffic is routed (favoring topologically closer endpoints).
Using Topology Aware Hints
You can activate Topology Aware Hints for a Service by setting the
service.kubernetes.io/topology-aware-hints
annotation to auto
. This tells
the EndpointSlice controller to set topology hints if it is deemed safe.
Importantly, this does not guarantee that hints will always be set.
How it works
The functionality enabling this feature is split into two components: The EndpointSlice controller and the kube-proxy. This section provides a high level overview of how each component implements this feature.
EndpointSlice controller
The EndpointSlice controller is responsible for setting hints on EndpointSlices when this feature is enabled. The controller allocates a proportional amount of endpoints to each zone. This proportion is based on the allocatable CPU cores for nodes running in that zone. For example, if one zone had 2 CPU cores and another zone only had 1 CPU core, the controller would allocated twice as many endpoints to the zone with 2 CPU cores.
The following example shows what an EndpointSlice looks like when hints have been populated:
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: example-hints
labels:
kubernetes.io/service-name: example-svc
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints:
- addresses:
- "10.1.2.3"
conditions:
ready: true
hostname: pod-1
zone: zone-a
hints:
forZones:
- name: "zone-a"
kube-proxy
The kube-proxy component filters the endpoints it routes to based on the hints set by the EndpointSlice controller. In most cases, this means that the kube-proxy is able to route traffic to endpoints in the same zone. Sometimes the controller allocates endpoints from a different zone to ensure more even distribution of endpoints between zones. This would result in some traffic being routed to other zones.
Safeguards
The Kubernetes control plane and the kube-proxy on each node apply some safeguard rules before using Topology Aware Hints. If these don't check out, the kube-proxy selects endpoints from anywhere in your cluster, regardless of the zone.
-
Insufficient number of endpoints: If there are less endpoints than zones in a cluster, the controller will not assign any hints.
-
Impossible to achieve balanced allocation: In some cases, it will be impossible to achieve a balanced allocation of endpoints among zones. For example, if zone-a is twice as large as zone-b, but there are only 2 endpoints, an endpoint allocated to zone-a may receive twice as much traffic as zone-b. The controller does not assign hints if it can't get this "expected overload" value below an acceptable threshold for each zone. Importantly this is not based on real-time feedback. It is still possible for individual endpoints to become overloaded.
-
One or more Nodes has insufficient information: If any node does not have a
topology.kubernetes.io/zone
label or is not reporting a value for allocatable CPU, the control plane does not set any topology-aware endpoint hints and so kube-proxy does not filter endpoints by zone. -
One or more endpoints does not have a zone hint: When this happens, the kube-proxy assumes that a transition from or to Topology Aware Hints is underway. Filtering endpoints for a Service in this state would be dangerous so the kube-proxy falls back to using all endpoints.
-
A zone is not represented in hints: If the kube-proxy is unable to find at least one endpoint with a hint targeting the zone it is running in, it falls to using endpoints from all zones. This is most likely to happen as you add a new zone into your existing cluster.
Constraints
-
Topology Aware Hints are not used when either
externalTrafficPolicy
orinternalTrafficPolicy
is set toLocal
on a Service. It is possible to use both features in the same cluster on different Services, just not on the same Service. -
This approach will not work well for Services that have a large proportion of traffic originating from a subset of zones. Instead this assumes that incoming traffic will be roughly proportional to the capacity of the Nodes in each zone.
-
The EndpointSlice controller ignores unready nodes as it calculates the proportions of each zone. This could have unintended consequences if a large portion of nodes are unready.
-
The EndpointSlice controller does not take into account tolerations when deploying calculating the proportions of each zone. If the Pods backing a Service are limited to a subset of Nodes in the cluster, this will not be taken into account.
-
This may not work well with autoscaling. For example, if a lot of traffic is originating from a single zone, only the endpoints allocated to that zone will be handling that traffic. That could result in Horizontal Pod Autoscaler either not picking up on this event, or newly added pods starting in a different zone.
What's next
3.5.10 - Network Policies
If you want to control traffic flow at the IP address or port level (OSI layer 3 or 4), then you might consider using Kubernetes NetworkPolicies for particular applications in your cluster. NetworkPolicies are an application-centric construct which allow you to specify how a pod is allowed to communicate with various network "entities" (we use the word "entity" here to avoid overloading the more common terms such as "endpoints" and "services", which have specific Kubernetes connotations) over the network. NetworkPolicies apply to a connection with a pod on one or both ends, and are not relevant to other connections.
The entities that a Pod can communicate with are identified through a combination of the following 3 identifiers:
- Other pods that are allowed (exception: a pod cannot block access to itself)
- Namespaces that are allowed
- IP blocks (exception: traffic to and from the node where a Pod is running is always allowed, regardless of the IP address of the Pod or the node)
When defining a pod- or namespace- based NetworkPolicy, you use a selector to specify what traffic is allowed to and from the Pod(s) that match the selector.
Meanwhile, when IP based NetworkPolicies are created, we define policies based on IP blocks (CIDR ranges).
Prerequisites
Network policies are implemented by the network plugin. To use network policies, you must be using a networking solution which supports NetworkPolicy. Creating a NetworkPolicy resource without a controller that implements it will have no effect.
The Two Sorts of Pod Isolation
There are two sorts of isolation for a pod: isolation for egress, and isolation for ingress. They concern what connections may be established. "Isolation" here is not absolute, rather it means "some restrictions apply". The alternative, "non-isolated for $direction", means that no restrictions apply in the stated direction. The two sorts of isolation (or not) are declared independently, and are both relevant for a connection from one pod to another.
By default, a pod is non-isolated for egress; all outbound connections are allowed. A pod is isolated for egress if there is any NetworkPolicy that both selects the pod and has "Egress" in its policyTypes
; we say that such a policy applies to the pod for egress. When a pod is isolated for egress, the only allowed connections from the pod are those allowed by the egress
list of some NetworkPolicy that applies to the pod for egress. The effects of those egress
lists combine additively.
By default, a pod is non-isolated for ingress; all inbound connections are allowed. A pod is isolated for ingress if there is any NetworkPolicy that both selects the pod and has "Ingress" in its policyTypes
; we say that such a policy applies to the pod for ingress. When a pod is isolated for ingress, the only allowed connections into the pod are those from the pod's node and those allowed by the ingress
list of some NetworkPolicy that applies to the pod for ingress. The effects of those ingress
lists combine additively.
Network policies do not conflict; they are additive. If any policy or policies apply to a given pod for a given direction, the connections allowed in that direction from that pod is the union of what the applicable policies allow. Thus, order of evaluation does not affect the policy result.
For a connection from a source pod to a destination pod to be allowed, both the egress policy on the source pod and the ingress policy on the destination pod need to allow the connection. If either side does not allow the connection, it will not happen.
The NetworkPolicy resource
See the NetworkPolicy reference for a full definition of the resource.
An example NetworkPolicy might look like this:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: test-network-policy
namespace: default
spec:
podSelector:
matchLabels:
role: db
policyTypes:
- Ingress
- Egress
ingress:
- from:
- ipBlock:
cidr: 172.17.0.0/16
except:
- 172.17.1.0/24
- namespaceSelector:
matchLabels:
project: myproject
- podSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 6379
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 5978
Mandatory Fields: As with all other Kubernetes config, a NetworkPolicy
needs apiVersion
, kind
, and metadata
fields. For general information
about working with config files, see
Configure Containers Using a ConfigMap,
and Object Management.
spec: NetworkPolicy spec has all the information needed to define a particular network policy in the given namespace.
podSelector: Each NetworkPolicy includes a podSelector
which selects the grouping of pods to which the policy applies. The example policy selects pods with the label "role=db". An empty podSelector
selects all pods in the namespace.
policyTypes: Each NetworkPolicy includes a policyTypes
list which may include either Ingress
, Egress
, or both. The policyTypes
field indicates whether or not the given policy applies to ingress traffic to selected pod, egress traffic from selected pods, or both. If no policyTypes
are specified on a NetworkPolicy then by default Ingress
will always be set and Egress
will be set if the NetworkPolicy has any egress rules.
ingress: Each NetworkPolicy may include a list of allowed ingress
rules. Each rule allows traffic which matches both the from
and ports
sections. The example policy contains a single rule, which matches traffic on a single port, from one of three sources, the first specified via an ipBlock
, the second via a namespaceSelector
and the third via a podSelector
.
egress: Each NetworkPolicy may include a list of allowed egress
rules. Each rule allows traffic which matches both the to
and ports
sections. The example policy contains a single rule, which matches traffic on a single port to any destination in 10.0.0.0/24
.
So, the example NetworkPolicy:
-
isolates "role=db" pods in the "default" namespace for both ingress and egress traffic (if they weren't already isolated)
-
(Ingress rules) allows connections to all pods in the "default" namespace with the label "role=db" on TCP port 6379 from:
- any pod in the "default" namespace with the label "role=frontend"
- any pod in a namespace with the label "project=myproject"
- IP addresses in the ranges 172.17.0.0–172.17.0.255 and 172.17.2.0–172.17.255.255 (ie, all of 172.17.0.0/16 except 172.17.1.0/24)
-
(Egress rules) allows connections from any pod in the "default" namespace with the label "role=db" to CIDR 10.0.0.0/24 on TCP port 5978
See the Declare Network Policy walkthrough for further examples.
Behavior of to
and from
selectors
There are four kinds of selectors that can be specified in an ingress
from
section or egress
to
section:
podSelector: This selects particular Pods in the same namespace as the NetworkPolicy which should be allowed as ingress sources or egress destinations.
namespaceSelector: This selects particular namespaces for which all Pods should be allowed as ingress sources or egress destinations.
namespaceSelector and podSelector: A single to
/from
entry that specifies both namespaceSelector
and podSelector
selects particular Pods within particular namespaces. Be careful to use correct YAML syntax; this policy:
...
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
podSelector:
matchLabels:
role: client
...
contains a single from
element allowing connections from Pods with the label role=client
in namespaces with the label user=alice
. But this policy:
...
ingress:
- from:
- namespaceSelector:
matchLabels:
user: alice
- podSelector:
matchLabels:
role: client
...
contains two elements in the from
array, and allows connections from Pods in the local Namespace with the label role=client
, or from any Pod in any namespace with the label user=alice
.
When in doubt, use kubectl describe
to see how Kubernetes has interpreted the policy.
ipBlock: This selects particular IP CIDR ranges to allow as ingress sources or egress destinations. These should be cluster-external IPs, since Pod IPs are ephemeral and unpredictable.
Cluster ingress and egress mechanisms often require rewriting the source or destination IP
of packets. In cases where this happens, it is not defined whether this happens before or
after NetworkPolicy processing, and the behavior may be different for different
combinations of network plugin, cloud provider, Service
implementation, etc.
In the case of ingress, this means that in some cases you may be able to filter incoming
packets based on the actual original source IP, while in other cases, the "source IP" that
the NetworkPolicy acts on may be the IP of a LoadBalancer
or of the Pod's node, etc.
For egress, this means that connections from pods to Service
IPs that get rewritten to
cluster-external IPs may or may not be subject to ipBlock
-based policies.
Default policies
By default, if no policies exist in a namespace, then all ingress and egress traffic is allowed to and from pods in that namespace. The following examples let you change the default behavior in that namespace.
Default deny all ingress traffic
You can create a "default" ingress isolation policy for a namespace by creating a NetworkPolicy that selects all pods but does not allow any ingress traffic to those pods.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
This ensures that even pods that aren't selected by any other NetworkPolicy will still be isolated for ingress. This policy does not affect isolation for egress from any pod.
Allow all ingress traffic
If you want to allow all incoming connections to all pods in a namespace, you can create a policy that explicitly allows that.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-ingress
spec:
podSelector: {}
ingress:
- {}
policyTypes:
- Ingress
With this policy in place, no additional policy or policies can cause any incoming connection to those pods to be denied. This policy has no effect on isolation for egress from any pod.
Default deny all egress traffic
You can create a "default" egress isolation policy for a namespace by creating a NetworkPolicy that selects all pods but does not allow any egress traffic from those pods.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-egress
spec:
podSelector: {}
policyTypes:
- Egress
This ensures that even pods that aren't selected by any other NetworkPolicy will not be allowed egress traffic. This policy does not change the ingress isolation behavior of any pod.
Allow all egress traffic
If you want to allow all connections from all pods in a namespace, you can create a policy that explicitly allows all outgoing connections from pods in that namespace.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all-egress
spec:
podSelector: {}
egress:
- {}
policyTypes:
- Egress
With this policy in place, no additional policy or policies can cause any outgoing connection from those pods to be denied. This policy has no effect on isolation for ingress to any pod.
Default deny all ingress and all egress traffic
You can create a "default" policy for a namespace which prevents all ingress AND egress traffic by creating the following NetworkPolicy in that namespace.
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
This ensures that even pods that aren't selected by any other NetworkPolicy will not be allowed ingress or egress traffic.
SCTP support
Kubernetes v1.20 [stable]
As a stable feature, this is enabled by default. To disable SCTP at a cluster level, you (or your cluster administrator) will need to disable the SCTPSupport
feature gate for the API server with --feature-gates=SCTPSupport=false,…
.
When the feature gate is enabled, you can set the protocol
field of a NetworkPolicy to SCTP
.
Targeting a range of Ports
Kubernetes v1.22 [beta]
When writing a NetworkPolicy, you can target a range of ports instead of a single port.
This is achievable with the usage of the endPort
field, as the following example:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: multi-port-egress
namespace: default
spec:
podSelector:
matchLabels:
role: db
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 32000
endPort: 32768
The above rule allows any Pod with label role=db
on the namespace default
to communicate
with any IP within the range 10.0.0.0/24
over TCP, provided that the target
port is between the range 32000 and 32768.
The following restrictions apply when using this field:
- As a beta feature, this is enabled by default. To disable the
endPort
field at a cluster level, you (or your cluster administrator) need to disable theNetworkPolicyEndPort
feature gate for the API server with--feature-gates=NetworkPolicyEndPort=false,…
. - The
endPort
field must be equal to or greater than theport
field. endPort
can only be defined ifport
is also defined.- Both ports must be numeric.
endPort
field in NetworkPolicy specifications.
If your network plugin
does not support the endPort
field and you specify a NetworkPolicy with that,
the policy will be applied only for the single port
field.
Targeting a Namespace by its name
Kubernetes 1.21 [beta]
The Kubernetes control plane sets an immutable label kubernetes.io/metadata.name
on all
namespaces, provided that the NamespaceDefaultLabelName
feature gate is enabled.
The value of the label is the namespace name.
While NetworkPolicy cannot target a namespace by its name with some object field, you can use the standardized label to target a specific namespace.
What you can't do with network policies (at least, not yet)
As of Kubernetes 1.23, the following functionality does not exist in the NetworkPolicy API, but you might be able to implement workarounds using Operating System components (such as SELinux, OpenVSwitch, IPTables, and so on) or Layer 7 technologies (Ingress controllers, Service Mesh implementations) or admission controllers. In case you are new to network security in Kubernetes, its worth noting that the following User Stories cannot (yet) be implemented using the NetworkPolicy API.
- Forcing internal cluster traffic to go through a common gateway (this might be best served with a service mesh or other proxy).
- Anything TLS related (use a service mesh or ingress controller for this).
- Node specific policies (you can use CIDR notation for these, but you cannot target nodes by their Kubernetes identities specifically).
- Targeting of services by name (you can, however, target pods or namespaces by their labels, which is often a viable workaround).
- Creation or management of "Policy requests" that are fulfilled by a third party.
- Default policies which are applied to all namespaces or pods (there are some third party Kubernetes distributions and projects which can do this).
- Advanced policy querying and reachability tooling.
- The ability to log network security events (for example connections that are blocked or accepted).
- The ability to explicitly deny policies (currently the model for NetworkPolicies are deny by default, with only the ability to add allow rules).
- The ability to prevent loopback or incoming host traffic (Pods cannot currently block localhost access, nor do they have the ability to block access from their resident node).
What's next
- See the Declare Network Policy walkthrough for further examples.
- See more recipes for common scenarios enabled by the NetworkPolicy resource.
3.5.11 - IPv4/IPv6 dual-stack
Kubernetes v1.23 [stable]
IPv4/IPv6 dual-stack networking enables the allocation of both IPv4 and IPv6 addresses to Pods and Services.
IPv4/IPv6 dual-stack networking is enabled by default for your Kubernetes cluster starting in 1.21, allowing the simultaneous assignment of both IPv4 and IPv6 addresses.
Supported Features
IPv4/IPv6 dual-stack on your Kubernetes cluster provides the following features:
- Dual-stack Pod networking (a single IPv4 and IPv6 address assignment per Pod)
- IPv4 and IPv6 enabled Services
- Pod off-cluster egress routing (eg. the Internet) via both IPv4 and IPv6 interfaces
Prerequisites
The following prerequisites are needed in order to utilize IPv4/IPv6 dual-stack Kubernetes clusters:
- Kubernetes 1.20 or later
For information about using dual-stack services with earlier Kubernetes versions, refer to the documentation for that version of Kubernetes. - Provider support for dual-stack networking (Cloud provider or otherwise must be able to provide Kubernetes nodes with routable IPv4/IPv6 network interfaces)
- A network plugin that supports dual-stack (such as Kubenet or Calico)
Configure IPv4/IPv6 dual-stack
To configure IPv4/IPv6 dual-stack, set dual-stack cluster network assignments:
- kube-apiserver:
--service-cluster-ip-range=<IPv4 CIDR>,<IPv6 CIDR>
- kube-controller-manager:
--cluster-cidr=<IPv4 CIDR>,<IPv6 CIDR>
--service-cluster-ip-range=<IPv4 CIDR>,<IPv6 CIDR>
--node-cidr-mask-size-ipv4|--node-cidr-mask-size-ipv6
defaults to /24 for IPv4 and /64 for IPv6
- kube-proxy:
--cluster-cidr=<IPv4 CIDR>,<IPv6 CIDR>
- kubelet:
- when there is no
--cloud-provider
the administrator can pass a comma-separated pair of IP addresses via--node-ip
to manually configure dual-stack.status.addresses
for that Node. If a Pod runs on that node in HostNetwork mode, the Pod reports these IP addresses in its.status.podIPs
field. AllpodIPs
in a node match the IP family preference defined by the.status.addresses
field for that Node.
- when there is no
An example of an IPv4 CIDR: 10.244.0.0/16
(though you would supply your own address range)
An example of an IPv6 CIDR: fdXY:IJKL:MNOP:15::/64
(this shows the format but is not a valid address - see RFC 4193)
Services
You can create Services which can use IPv4, IPv6, or both.
The address family of a Service defaults to the address family of the first service cluster IP range (configured via the --service-cluster-ip-range
flag to the kube-apiserver).
When you define a Service you can optionally configure it as dual stack. To specify the behavior you want, you
set the .spec.ipFamilyPolicy
field to one of the following values:
SingleStack
: Single-stack service. The control plane allocates a cluster IP for the Service, using the first configured service cluster IP range.PreferDualStack
:- Allocates IPv4 and IPv6 cluster IPs for the Service.
RequireDualStack
: Allocates Service.spec.ClusterIPs
from both IPv4 and IPv6 address ranges.- Selects the
.spec.ClusterIP
from the list of.spec.ClusterIPs
based on the address family of the first element in the.spec.ipFamilies
array.
- Selects the
If you would like to define which IP family to use for single stack or define the order of IP families for dual-stack, you can choose the address families by setting an optional field, .spec.ipFamilies
, on the Service.
.spec.ipFamilies
field is immutable because the .spec.ClusterIP
cannot be reallocated on a Service that already exists. If you want to change .spec.ipFamilies
, delete and recreate the Service.
You can set .spec.ipFamilies
to any of the following array values:
["IPv4"]
["IPv6"]
["IPv4","IPv6"]
(dual stack)["IPv6","IPv4"]
(dual stack)
The first family you list is used for the legacy .spec.ClusterIP
field.
Dual-stack Service configuration scenarios
These examples demonstrate the behavior of various dual-stack Service configuration scenarios.
Dual-stack options on new Services
- This Service specification does not explicitly define
.spec.ipFamilyPolicy
. When you create this Service, Kubernetes assigns a cluster IP for the Service from the first configuredservice-cluster-ip-range
and sets the.spec.ipFamilyPolicy
toSingleStack
. (Services without selectors and headless Services with selectors will behave in this same way.)
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
-
This Service specification explicitly defines
PreferDualStack
in.spec.ipFamilyPolicy
. When you create this Service on a dual-stack cluster, Kubernetes assigns both IPv4 and IPv6 addresses for the service. The control plane updates the.spec
for the Service to record the IP address assignments. The field.spec.ClusterIPs
is the primary field, and contains both assigned IP addresses;.spec.ClusterIP
is a secondary field with its value calculated from.spec.ClusterIPs
.- For the
.spec.ClusterIP
field, the control plane records the IP address that is from the same address family as the first service cluster IP range. - On a single-stack cluster, the
.spec.ClusterIPs
and.spec.ClusterIP
fields both only list one address. - On a cluster with dual-stack enabled, specifying
RequireDualStack
in.spec.ipFamilyPolicy
behaves the same asPreferDualStack
.
- For the
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
ipFamilyPolicy: PreferDualStack
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
- This Service specification explicitly defines
IPv6
andIPv4
in.spec.ipFamilies
as well as definingPreferDualStack
in.spec.ipFamilyPolicy
. When Kubernetes assigns an IPv6 and IPv4 address in.spec.ClusterIPs
,.spec.ClusterIP
is set to the IPv6 address because that is the first element in the.spec.ClusterIPs
array, overriding the default.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
ipFamilyPolicy: PreferDualStack
ipFamilies:
- IPv6
- IPv4
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
Dual-stack defaults on existing Services
These examples demonstrate the default behavior when dual-stack is newly enabled on a cluster where Services already exist. (Upgrading an existing cluster to 1.21 or beyond will enable dual-stack.)
- When dual-stack is enabled on a cluster, existing Services (whether
IPv4
orIPv6
) are configured by the control plane to set.spec.ipFamilyPolicy
toSingleStack
and set.spec.ipFamilies
to the address family of the existing Service. The existing Service cluster IP will be stored in.spec.ClusterIPs
.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
You can validate this behavior by using kubectl to inspect an existing service.
kubectl get svc my-service -o yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: MyApp
name: my-service
spec:
clusterIP: 10.0.197.123
clusterIPs:
- 10.0.197.123
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: MyApp
type: ClusterIP
status:
loadBalancer: {}
- When dual-stack is enabled on a cluster, existing headless Services with selectors are configured by the control plane to set
.spec.ipFamilyPolicy
toSingleStack
and set.spec.ipFamilies
to the address family of the first service cluster IP range (configured via the--service-cluster-ip-range
flag to the kube-apiserver) even though.spec.ClusterIP
is set toNone
.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
You can validate this behavior by using kubectl to inspect an existing headless service with selectors.
kubectl get svc my-service -o yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: MyApp
name: my-service
spec:
clusterIP: None
clusterIPs:
- None
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: MyApp
Switching Services between single-stack and dual-stack
Services can be changed from single-stack to dual-stack and from dual-stack to single-stack.
-
To change a Service from single-stack to dual-stack, change
.spec.ipFamilyPolicy
fromSingleStack
toPreferDualStack
orRequireDualStack
as desired. When you change this Service from single-stack to dual-stack, Kubernetes assigns the missing address family so that the Service now has IPv4 and IPv6 addresses.Edit the Service specification updating the
.spec.ipFamilyPolicy
fromSingleStack
toPreferDualStack
.
Before:
spec:
ipFamilyPolicy: SingleStack
After:
spec:
ipFamilyPolicy: PreferDualStack
- To change a Service from dual-stack to single-stack, change
.spec.ipFamilyPolicy
fromPreferDualStack
orRequireDualStack
toSingleStack
. When you change this Service from dual-stack to single-stack, Kubernetes retains only the first element in the.spec.ClusterIPs
array, and sets.spec.ClusterIP
to that IP address and sets.spec.ipFamilies
to the address family of.spec.ClusterIPs
.
Headless Services without selector
For Headless Services without selectors and without .spec.ipFamilyPolicy
explicitly set, the .spec.ipFamilyPolicy
field defaults to RequireDualStack
.
Service type LoadBalancer
To provision a dual-stack load balancer for your Service:
- Set the
.spec.type
field toLoadBalancer
- Set
.spec.ipFamilyPolicy
field toPreferDualStack
orRequireDualStack
LoadBalancer
type Service, your cloud provider must support IPv4 and IPv6 load balancers.
Egress traffic
If you want to enable egress traffic in order to reach off-cluster destinations (eg. the public Internet) from a Pod that uses non-publicly routable IPv6 addresses, you need to enable the Pod to use a publicly routed IPv6 address via a mechanism such as transparent proxying or IP masquerading. The ip-masq-agent project supports IP masquerading on dual-stack clusters.
What's next
3.6 - Storage
3.6.1 - Volumes
On-disk files in a container are ephemeral, which presents some problems for
non-trivial applications when running in containers. One problem
is the loss of files when a container crashes. The kubelet restarts the container
but with a clean state. A second problem occurs when sharing files
between containers running together in a Pod
.
The Kubernetes volume abstraction
solves both of these problems.
Familiarity with Pods is suggested.
Background
Docker has a concept of volumes, though it is somewhat looser and less managed. A Docker volume is a directory on disk or in another container. Docker provides volume drivers, but the functionality is somewhat limited.
Kubernetes supports many types of volumes. A Pod can use any number of volume types simultaneously. Ephemeral volume types have a lifetime of a pod, but persistent volumes exist beyond the lifetime of a pod. When a pod ceases to exist, Kubernetes destroys ephemeral volumes; however, Kubernetes does not destroy persistent volumes. For any kind of volume in a given pod, data is preserved across container restarts.
At its core, a volume is a directory, possibly with some data in it, which is accessible to the containers in a pod. How that directory comes to be, the medium that backs it, and the contents of it are determined by the particular volume type used.
To use a volume, specify the volumes to provide for the Pod in .spec.volumes
and declare where to mount those volumes into containers in .spec.containers[*].volumeMounts
.
A process in a container sees a filesystem view composed from the initial contents of
the container image, plus volumes
(if defined) mounted inside the container.
The process sees a root filesystem that initially matches the contents of the container
image.
Any writes to within that filesystem hierarchy, if allowed, affect what that process views
when it performs a subsequent filesystem access.
Volumes mount at the specified paths within
the image.
For each container defined within a Pod, you must independently specify where
to mount each volume that the container uses.
Volumes cannot mount within other volumes (but see Using subPath for a related mechanism). Also, a volume cannot contain a hard link to anything in a different volume.
Types of Volumes
Kubernetes supports several types of volumes.
awsElasticBlockStore
An awsElasticBlockStore
volume mounts an Amazon Web Services (AWS)
EBS volume into your pod. Unlike
emptyDir
, which is erased when a pod is removed, the contents of an EBS
volume are persisted and the volume is unmounted. This means that an
EBS volume can be pre-populated with data, and that data can be shared between pods.
aws ec2 create-volume
or the AWS API before you can use it.
There are some restrictions when using an awsElasticBlockStore
volume:
- the nodes on which pods are running must be AWS EC2 instances
- those instances need to be in the same region and availability zone as the EBS volume
- EBS only supports a single EC2 instance mounting a volume
Creating an AWS EBS volume
Before you can use an EBS volume with a pod, you need to create it.
aws ec2 create-volume --availability-zone=eu-west-1a --size=10 --volume-type=gp2
Make sure the zone matches the zone you brought up your cluster in. Check that the size and EBS volume type are suitable for your use.
AWS EBS configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-ebs
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-ebs
name: test-volume
volumes:
- name: test-volume
# This AWS EBS volume must already exist.
awsElasticBlockStore:
volumeID: "<volume id>"
fsType: ext4
If the EBS volume is partitioned, you can supply the optional field partition: "<partition number>"
to specify which parition to mount on.
AWS EBS CSI migration
Kubernetes v1.17 [beta]
The CSIMigration
feature for awsElasticBlockStore
, when enabled, redirects
all plugin operations from the existing in-tree plugin to the ebs.csi.aws.com
Container
Storage Interface (CSI) driver. In order to use this feature, the AWS EBS CSI
driver
must be installed on the cluster and the CSIMigration
and CSIMigrationAWS
beta features must be enabled.
AWS EBS CSI migration complete
Kubernetes v1.17 [alpha]
To disable the awsElasticBlockStore
storage plugin from being loaded by the controller manager
and the kubelet, set the InTreePluginAWSUnregister
flag to true
.
azureDisk
The azureDisk
volume type mounts a Microsoft Azure Data Disk into a pod.
For more details, see the azureDisk
volume plugin.
azureDisk CSI migration
Kubernetes v1.19 [beta]
The CSIMigration
feature for azureDisk
, when enabled, redirects all plugin operations
from the existing in-tree plugin to the disk.csi.azure.com
Container
Storage Interface (CSI) Driver. In order to use this feature, the Azure Disk CSI
Driver
must be installed on the cluster and the CSIMigration
and CSIMigrationAzureDisk
features must be enabled.
azureFile
The azureFile
volume type mounts a Microsoft Azure File volume (SMB 2.1 and 3.0)
into a pod.
For more details, see the azureFile
volume plugin.
azureFile CSI migration
Kubernetes v1.21 [beta]
The CSIMigration
feature for azureFile
, when enabled, redirects all plugin operations
from the existing in-tree plugin to the file.csi.azure.com
Container
Storage Interface (CSI) Driver. In order to use this feature, the Azure File CSI
Driver
must be installed on the cluster and the CSIMigration
and CSIMigrationAzureFile
feature gates must be enabled.
Azure File CSI driver does not support using same volume with different fsgroups, if Azurefile CSI migration is enabled, using same volume with different fsgroups won't be supported at all.
cephfs
A cephfs
volume allows an existing CephFS volume to be
mounted into your Pod. Unlike emptyDir
, which is erased when a pod is
removed, the contents of a cephfs
volume are preserved and the volume is merely
unmounted. This means that a cephfs
volume can be pre-populated with data, and
that data can be shared between pods. The cephfs
volume can be mounted by multiple
writers simultaneously.
See the CephFS example for more details.
cinder
The cinder
volume type is used to mount the OpenStack Cinder volume into your pod.
Cinder volume configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-cinder
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-cinder-container
volumeMounts:
- mountPath: /test-cinder
name: test-volume
volumes:
- name: test-volume
# This OpenStack volume must already exist.
cinder:
volumeID: "<volume id>"
fsType: ext4
OpenStack CSI migration
Kubernetes v1.21 [beta]
The CSIMigration
feature for Cinder is enabled by default in Kubernetes 1.21.
It redirects all plugin operations from the existing in-tree plugin to the
cinder.csi.openstack.org
Container Storage Interface (CSI) Driver.
OpenStack Cinder CSI Driver
must be installed on the cluster.
You can disable Cinder CSI migration for your cluster by setting the CSIMigrationOpenStack
feature gate to false
.
If you disable the CSIMigrationOpenStack
feature, the in-tree Cinder volume plugin takes responsibility
for all aspects of Cinder volume storage management.
configMap
A ConfigMap
provides a way to inject configuration data into pods.
The data stored in a ConfigMap can be referenced in a volume of type
configMap
and then consumed by containerized applications running in a pod.
When referencing a ConfigMap, you provide the name of the ConfigMap in the
volume. You can customize the path to use for a specific
entry in the ConfigMap. The following configuration shows how to mount
the log-config
ConfigMap onto a Pod called configmap-pod
:
apiVersion: v1
kind: Pod
metadata:
name: configmap-pod
spec:
containers:
- name: test
image: busybox:1.28
volumeMounts:
- name: config-vol
mountPath: /etc/config
volumes:
- name: config-vol
configMap:
name: log-config
items:
- key: log_level
path: log_level
The log-config
ConfigMap is mounted as a volume, and all contents stored in
its log_level
entry are mounted into the Pod at path /etc/config/log_level
.
Note that this path is derived from the volume's mountPath
and the path
keyed with log_level
.
downwardAPI
A downwardAPI
volume makes downward API data available to applications.
It mounts a directory and writes the requested data in plain text files.
subPath
volume mount will not
receive downward API updates.
See the downward API example for more details.
emptyDir
An emptyDir
volume is first created when a Pod is assigned to a node, and
exists as long as that Pod is running on that node. As the name says, the
emptyDir
volume is initially empty. All containers in the Pod can read and write the same
files in the emptyDir
volume, though that volume can be mounted at the same
or different paths in each container. When a Pod is removed from a node for
any reason, the data in the emptyDir
is deleted permanently.
emptyDir
volume
is safe across container crashes.
Some uses for an emptyDir
are:
- scratch space, such as for a disk-based merge sort
- checkpointing a long computation for recovery from crashes
- holding files that a content-manager container fetches while a webserver container serves the data
Depending on your environment, emptyDir
volumes are stored on whatever medium that backs the
node such as disk or SSD, or network storage. However, if you set the emptyDir.medium
field
to "Memory"
, Kubernetes mounts a tmpfs (RAM-backed filesystem) for you instead.
While tmpfs is very fast, be aware that unlike disks, tmpfs is cleared on
node reboot and any files you write count against your container's
memory limit.
SizeMemoryBackedVolumes
feature gate is enabled,
you can specify a size for memory backed volumes. If no size is specified, memory
backed volumes are sized to 50% of the memory on a Linux host.
emptyDir configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /cache
name: cache-volume
volumes:
- name: cache-volume
emptyDir: {}
fc (fibre channel)
An fc
volume type allows an existing fibre channel block storage volume
to mount in a Pod. You can specify single or multiple target world wide names (WWNs)
using the parameter targetWWNs
in your Volume configuration. If multiple WWNs are specified,
targetWWNs expect that those WWNs are from multi-path connections.
See the fibre channel example for more details.
flocker (deprecated)
Flocker is an open-source, clustered container data volume manager. Flocker provides management and orchestration of data volumes backed by a variety of storage backends.
A flocker
volume allows a Flocker dataset to be mounted into a Pod. If the
dataset does not already exist in Flocker, it needs to be first created with the Flocker
CLI or by using the Flocker API. If the dataset already exists it will be
reattached by Flocker to the node that the pod is scheduled. This means data
can be shared between pods as required.
See the Flocker example for more details.
gcePersistentDisk
A gcePersistentDisk
volume mounts a Google Compute Engine (GCE)
persistent disk (PD) into your Pod.
Unlike emptyDir
, which is erased when a pod is removed, the contents of a PD are
preserved and the volume is merely unmounted. This means that a PD can be
pre-populated with data, and that data can be shared between pods.
gcloud
or the GCE API or UI before you can use it.
There are some restrictions when using a gcePersistentDisk
:
- the nodes on which Pods are running must be GCE VMs
- those VMs need to be in the same GCE project and zone as the persistent disk
One feature of GCE persistent disk is concurrent read-only access to a persistent disk.
A gcePersistentDisk
volume permits multiple consumers to simultaneously
mount a persistent disk as read-only. This means that you can pre-populate a PD with your dataset
and then serve it in parallel from as many Pods as you need. Unfortunately,
PDs can only be mounted by a single consumer in read-write mode. Simultaneous
writers are not allowed.
Using a GCE persistent disk with a Pod controlled by a ReplicaSet will fail unless the PD is read-only or the replica count is 0 or 1.
Creating a GCE persistent disk
Before you can use a GCE persistent disk with a Pod, you need to create it.
gcloud compute disks create --size=500GB --zone=us-central1-a my-data-disk
GCE persistent disk configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist.
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
Regional persistent disks
The Regional persistent disks feature allows the creation of persistent disks that are available in two zones within the same region. In order to use this feature, the volume must be provisioned as a PersistentVolume; referencing the volume directly from a pod is not supported.
Manually provisioning a Regional PD PersistentVolume
Dynamic provisioning is possible using a StorageClass for GCE PD. Before creating a PersistentVolume, you must create the persistent disk:
gcloud compute disks create --size=500GB my-data-disk
--region us-central1
--replica-zones us-central1-a,us-central1-b
Regional persistent disk configuration example
apiVersion: v1
kind: PersistentVolume
metadata:
name: test-volume
spec:
capacity:
storage: 400Gi
accessModes:
- ReadWriteOnce
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
# failure-domain.beta.kubernetes.io/zone should be used prior to 1.21
- key: topology.kubernetes.io/zone
operator: In
values:
- us-central1-a
- us-central1-b
GCE CSI migration
Kubernetes v1.17 [beta]
The CSIMigration
feature for GCE PD, when enabled, redirects all plugin operations
from the existing in-tree plugin to the pd.csi.storage.gke.io
Container
Storage Interface (CSI) Driver. In order to use this feature, the GCE PD CSI
Driver
must be installed on the cluster and the CSIMigration
and CSIMigrationGCE
beta features must be enabled.
GCE CSI migration complete
Kubernetes v1.21 [alpha]
To disable the gcePersistentDisk
storage plugin from being loaded by the controller manager
and the kubelet, set the InTreePluginGCEUnregister
flag to true
.
gitRepo (deprecated)
gitRepo
volume type is deprecated. To provision a container with a git repo, mount an EmptyDir into an InitContainer that clones the repo using git, then mount the EmptyDir into the Pod's container.
A gitRepo
volume is an example of a volume plugin. This plugin
mounts an empty directory and clones a git repository into this directory
for your Pod to use.
Here is an example of a gitRepo
volume:
apiVersion: v1
kind: Pod
metadata:
name: server
spec:
containers:
- image: nginx
name: nginx
volumeMounts:
- mountPath: /mypath
name: git-volume
volumes:
- name: git-volume
gitRepo:
repository: "git@somewhere:me/my-git-repository.git"
revision: "22f1d8406d464b0c0874075539c1f2e96c253775"
glusterfs
A glusterfs
volume allows a Glusterfs (an open
source networked filesystem) volume to be mounted into your Pod. Unlike
emptyDir
, which is erased when a Pod is removed, the contents of a
glusterfs
volume are preserved and the volume is merely unmounted. This
means that a glusterfs volume can be pre-populated with data, and that data can
be shared between pods. GlusterFS can be mounted by multiple writers
simultaneously.
See the GlusterFS example for more details.
hostPath
HostPath volumes present many security risks, and it is a best practice to avoid the use of HostPaths when possible. When a HostPath volume must be used, it should be scoped to only the required file or directory, and mounted as ReadOnly.
If restricting HostPath access to specific directories through AdmissionPolicy, volumeMounts
MUST
be required to use readOnly
mounts for the policy to be effective.
A hostPath
volume mounts a file or directory from the host node's filesystem
into your Pod. This is not something that most Pods will need, but it offers a
powerful escape hatch for some applications.
For example, some uses for a hostPath
are:
- running a container that needs access to Docker internals; use a
hostPath
of/var/lib/docker
- running cAdvisor in a container; use a
hostPath
of/sys
- allowing a Pod to specify whether a given
hostPath
should exist prior to the Pod running, whether it should be created, and what it should exist as
In addition to the required path
property, you can optionally specify a type
for a hostPath
volume.
The supported values for field type
are:
Value | Behavior |
---|---|
Empty string (default) is for backward compatibility, which means that no checks will be performed before mounting the hostPath volume. | |
DirectoryOrCreate |
If nothing exists at the given path, an empty directory will be created there as needed with permission set to 0755, having the same group and ownership with Kubelet. |
Directory |
A directory must exist at the given path |
FileOrCreate |
If nothing exists at the given path, an empty file will be created there as needed with permission set to 0644, having the same group and ownership with Kubelet. |
File |
A file must exist at the given path |
Socket |
A UNIX socket must exist at the given path |
CharDevice |
A character device must exist at the given path |
BlockDevice |
A block device must exist at the given path |
Watch out when using this type of volume, because:
- HostPaths can expose privileged system credentials (such as for the Kubelet) or privileged APIs (such as container runtime socket), which can be used for container escape or to attack other parts of the cluster.
- Pods with identical configuration (such as created from a PodTemplate) may behave differently on different nodes due to different files on the nodes
- The files or directories created on the underlying hosts are only writable by root. You
either need to run your process as root in a
privileged Container or modify the file
permissions on the host to be able to write to a
hostPath
volume
hostPath configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /data
# this field is optional
type: Directory
FileOrCreate
mode does not create the parent directory of the file. If the parent directory
of the mounted file does not exist, the pod fails to start. To ensure that this mode works,
you can try to mount directories and files separately, as shown in the
FileOrCreate
configuration.
hostPath FileOrCreate configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-webserver
spec:
containers:
- name: test-webserver
image: k8s.gcr.io/test-webserver:latest
volumeMounts:
- mountPath: /var/local/aaa
name: mydir
- mountPath: /var/local/aaa/1.txt
name: myfile
volumes:
- name: mydir
hostPath:
# Ensure the file directory is created.
path: /var/local/aaa
type: DirectoryOrCreate
- name: myfile
hostPath:
path: /var/local/aaa/1.txt
type: FileOrCreate
iscsi
An iscsi
volume allows an existing iSCSI (SCSI over IP) volume to be mounted
into your Pod. Unlike emptyDir
, which is erased when a Pod is removed, the
contents of an iscsi
volume are preserved and the volume is merely
unmounted. This means that an iscsi volume can be pre-populated with data, and
that data can be shared between pods.
A feature of iSCSI is that it can be mounted as read-only by multiple consumers simultaneously. This means that you can pre-populate a volume with your dataset and then serve it in parallel from as many Pods as you need. Unfortunately, iSCSI volumes can only be mounted by a single consumer in read-write mode. Simultaneous writers are not allowed.
See the iSCSI example for more details.
local
A local
volume represents a mounted local storage device such as a disk,
partition or directory.
Local volumes can only be used as a statically created PersistentVolume. Dynamic provisioning is not supported.
Compared to hostPath
volumes, local
volumes are used in a durable and
portable manner without manually scheduling pods to nodes. The system is aware
of the volume's node constraints by looking at the node affinity on the PersistentVolume.
However, local
volumes are subject to the availability of the underlying
node and are not suitable for all applications. If a node becomes unhealthy,
then the local
volume becomes inaccessible by the pod. The pod using this volume
is unable to run. Applications using local
volumes must be able to tolerate this
reduced availability, as well as potential data loss, depending on the
durability characteristics of the underlying disk.
The following example shows a PersistentVolume using a local
volume and
nodeAffinity
:
apiVersion: v1
kind: PersistentVolume
metadata:
name: example-pv
spec:
capacity:
storage: 100Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: local-storage
local:
path: /mnt/disks/ssd1
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- example-node
You must set a PersistentVolume nodeAffinity
when using local
volumes.
The Kubernetes scheduler uses the PersistentVolume nodeAffinity
to schedule
these Pods to the correct node.
PersistentVolume volumeMode
can be set to "Block" (instead of the default
value "Filesystem") to expose the local volume as a raw block device.
When using local volumes, it is recommended to create a StorageClass with
volumeBindingMode
set to WaitForFirstConsumer
. For more details, see the
local StorageClass example.
Delaying volume binding ensures that the PersistentVolumeClaim binding decision
will also be evaluated with any other node constraints the Pod may have,
such as node resource requirements, node selectors, Pod affinity, and Pod anti-affinity.
An external static provisioner can be run separately for improved management of the local volume lifecycle. Note that this provisioner does not support dynamic provisioning yet. For an example on how to run an external local provisioner, see the local volume provisioner user guide.
nfs
An nfs
volume allows an existing NFS (Network File System) share to be
mounted into a Pod. Unlike emptyDir
, which is erased when a Pod is
removed, the contents of an nfs
volume are preserved and the volume is merely
unmounted. This means that an NFS volume can be pre-populated with data, and
that data can be shared between pods. NFS can be mounted by multiple
writers simultaneously.
See the NFS example for more details.
persistentVolumeClaim
A persistentVolumeClaim
volume is used to mount a
PersistentVolume into a Pod. PersistentVolumeClaims
are a way for users to "claim" durable storage (such as a GCE PersistentDisk or an
iSCSI volume) without knowing the details of the particular cloud environment.
See the information about PersistentVolumes for more details.
portworxVolume
A portworxVolume
is an elastic block storage layer that runs hyperconverged with
Kubernetes. Portworx fingerprints storage
in a server, tiers based on capabilities, and aggregates capacity across multiple servers.
Portworx runs in-guest in virtual machines or on bare metal Linux nodes.
A portworxVolume
can be dynamically created through Kubernetes or it can also
be pre-provisioned and referenced inside a Pod.
Here is an example Pod referencing a pre-provisioned Portworx volume:
apiVersion: v1
kind: Pod
metadata:
name: test-portworx-volume-pod
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /mnt
name: pxvol
volumes:
- name: pxvol
# This Portworx volume must already exist.
portworxVolume:
volumeID: "pxvol"
fsType: "<fs-type>"
pxvol
before using it in the Pod.
For more details, see the Portworx volume examples.
projected
A projected volume maps several existing volume sources into the same directory. For more details, see projected volumes.
quobyte (deprecated)
A quobyte
volume allows an existing Quobyte volume to
be mounted into your Pod.
Quobyte supports the Container Storage Interface. CSI is the recommended plugin to use Quobyte volumes inside Kubernetes. Quobyte's GitHub project has instructions for deploying Quobyte using CSI, along with examples.
rbd
An rbd
volume allows a
Rados Block Device (RBD) volume to mount
into your Pod. Unlike emptyDir
, which is erased when a pod is removed, the
contents of an rbd
volume are preserved and the volume is unmounted. This
means that a RBD volume can be pre-populated with data, and that data can be
shared between pods.
A feature of RBD is that it can be mounted as read-only by multiple consumers simultaneously. This means that you can pre-populate a volume with your dataset and then serve it in parallel from as many pods as you need. Unfortunately, RBD volumes can only be mounted by a single consumer in read-write mode. Simultaneous writers are not allowed.
See the RBD example for more details.
RBD CSI migration
Kubernetes v1.23 [alpha]
The CSIMigration
feature for RBD
, when enabled, redirects all plugin
operations from the existing in-tree plugin to the rbd.csi.ceph.com
CSI driver. In order to use this
feature, the
Ceph CSI driver
must be installed on the cluster and the CSIMigration
and csiMigrationRBD
feature gates
must be enabled.
As a Kubernetes cluster operator that administers storage, here are the prerequisites that you must complete before you attempt migration to the RBD CSI driver:
- You must install the Ceph CSI driver (
rbd.csi.ceph.com
), v3.5.0 or above, into your Kubernetes cluster. - considering the
clusterID
field is a required parameter for CSI driver for its operations, but in-tree StorageClass hasmonitors
field as a required parameter, a Kubernetes storage admin has to create a clusterID based on the monitors hash ( ex:#echo -n '<monitors_string>' | md5sum
) in the CSI config map and keep the monitors under this clusterID configuration. - Also, if the value of
adminId
in the in-tree Storageclass is different fromadmin
, theadminSecretName
mentioned in the in-tree Storageclass has to be patched with the base64 value of theadminId
parameter value, otherwise this step can be skipped.
secret
A secret
volume is used to pass sensitive information, such as passwords, to
Pods. You can store secrets in the Kubernetes API and mount them as files for
use by pods without coupling to Kubernetes directly. secret
volumes are
backed by tmpfs (a RAM-backed filesystem) so they are never written to
non-volatile storage.
subPath
volume mount will not
receive Secret updates.
For more details, see Configuring Secrets.
storageOS (deprecated)
A storageos
volume allows an existing StorageOS
volume to mount into your Pod.
StorageOS runs as a container within your Kubernetes environment, making local or attached storage accessible from any node within the Kubernetes cluster. Data can be replicated to protect against node failure. Thin provisioning and compression can improve utilization and reduce cost.
At its core, StorageOS provides block storage to containers, accessible from a file system.
The StorageOS Container requires 64-bit Linux and has no additional dependencies. A free developer license is available.
The following example is a Pod configuration with StorageOS:
apiVersion: v1
kind: Pod
metadata:
labels:
name: redis
role: master
name: test-storageos-redis
spec:
containers:
- name: master
image: kubernetes/redis:v1
env:
- name: MASTER
value: "true"
ports:
- containerPort: 6379
volumeMounts:
- mountPath: /redis-master-data
name: redis-data
volumes:
- name: redis-data
storageos:
# The `redis-vol01` volume must already exist within StorageOS in the `default` namespace.
volumeName: redis-vol01
fsType: ext4
For more information about StorageOS, dynamic provisioning, and PersistentVolumeClaims, see the StorageOS examples.
vsphereVolume
A vsphereVolume
is used to mount a vSphere VMDK volume into your Pod. The contents
of a volume are preserved when it is unmounted. It supports both VMFS and VSAN datastore.
Creating a VMDK volume
Choose one of the following methods to create a VMDK.
First ssh into ESX, then use the following command to create a VMDK:
vmkfstools -c 2G /vmfs/volumes/DatastoreName/volumes/myDisk.vmdk
Use the following command to create a VMDK:
vmware-vdiskmanager -c -t 0 -s 40GB -a lsilogic myDisk.vmdk
vSphere VMDK configuration example
apiVersion: v1
kind: Pod
metadata:
name: test-vmdk
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-vmdk
name: test-volume
volumes:
- name: test-volume
# This VMDK volume must already exist.
vsphereVolume:
volumePath: "[DatastoreName] volumes/myDisk"
fsType: ext4
For more information, see the vSphere volume examples.
vSphere CSI migration
Kubernetes v1.19 [beta]
The CSIMigration
feature for vsphereVolume
, when enabled, redirects all plugin operations
from the existing in-tree plugin to the csi.vsphere.vmware.com
CSI driver. In order to use this feature, the
vSphere CSI driver
must be installed on the cluster and the CSIMigration
and CSIMigrationvSphere
feature gates must be enabled.
This also requires minimum vSphere vCenter/ESXi Version to be 7.0u1 and minimum HW Version to be VM version 15.
The following StorageClass parameters from the built-in vsphereVolume
plugin are not supported by the vSphere CSI driver:
diskformat
hostfailurestotolerate
forceprovisioning
cachereservation
diskstripes
objectspacereservation
iopslimit
Existing volumes created using these parameters will be migrated to the vSphere CSI driver, but new volumes created by the vSphere CSI driver will not be honoring these parameters.
vSphere CSI migration complete
Kubernetes v1.19 [beta]
To turn off the vsphereVolume
plugin from being loaded by the controller manager and the kubelet, you need to set InTreePluginvSphereUnregister
feature flag to true
. You must install a csi.vsphere.vmware.com
CSI driver on all worker nodes.
Portworx CSI migration
Kubernetes v1.23 [alpha]
The CSIMigration
feature for Portworx has been added but disabled by default in Kubernetes 1.23 since it's in alpha state.
It redirects all plugin operations from the existing in-tree plugin to the
pxd.portworx.com
Container Storage Interface (CSI) Driver.
Portworx CSI Driver
must be installed on the cluster.
To enable the feature, set CSIMigrationPortworx=true
in kube-controller-manager and kubelet.
Using subPath
Sometimes, it is useful to share one volume for multiple uses in a single pod.
The volumeMounts.subPath
property specifies a sub-path inside the referenced volume
instead of its root.
The following example shows how to configure a Pod with a LAMP stack (Linux Apache MySQL PHP)
using a single, shared volume. This sample subPath
configuration is not recommended
for production use.
The PHP application's code and assets map to the volume's html
folder and
the MySQL database is stored in the volume's mysql
folder. For example:
apiVersion: v1
kind: Pod
metadata:
name: my-lamp-site
spec:
containers:
- name: mysql
image: mysql
env:
- name: MYSQL_ROOT_PASSWORD
value: "rootpasswd"
volumeMounts:
- mountPath: /var/lib/mysql
name: site-data
subPath: mysql
- name: php
image: php:7.0-apache
volumeMounts:
- mountPath: /var/www/html
name: site-data
subPath: html
volumes:
- name: site-data
persistentVolumeClaim:
claimName: my-lamp-site-data
Using subPath with expanded environment variables
Kubernetes v1.17 [stable]
Use the subPathExpr
field to construct subPath
directory names from
downward API environment variables.
The subPath
and subPathExpr
properties are mutually exclusive.
In this example, a Pod
uses subPathExpr
to create a directory pod1
within
the hostPath
volume /var/log/pods
.
The hostPath
volume takes the Pod
name from the downwardAPI
.
The host directory /var/log/pods/pod1
is mounted at /logs
in the container.
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
containers:
- name: container1
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: busybox:1.28
command: [ "sh", "-c", "while [ true ]; do echo 'Hello'; sleep 10; done | tee -a /logs/hello.txt" ]
volumeMounts:
- name: workdir1
mountPath: /logs
# The variable expansion uses round brackets (not curly brackets).
subPathExpr: $(POD_NAME)
restartPolicy: Never
volumes:
- name: workdir1
hostPath:
path: /var/log/pods
Resources
The storage media (such as Disk or SSD) of an emptyDir
volume is determined by the
medium of the filesystem holding the kubelet root dir (typically
/var/lib/kubelet
). There is no limit on how much space an emptyDir
or
hostPath
volume can consume, and no isolation between containers or between
pods.
To learn about requesting space using a resource specification, see how to manage resources.
Out-of-tree volume plugins
The out-of-tree volume plugins include Container Storage Interface (CSI), and also FlexVolume (which is deprecated). These plugins enable storage vendors to create custom storage plugins without adding their plugin source code to the Kubernetes repository.
Previously, all volume plugins were "in-tree". The "in-tree" plugins were built, linked, compiled, and shipped with the core Kubernetes binaries. This meant that adding a new storage system to Kubernetes (a volume plugin) required checking code into the core Kubernetes code repository.
Both CSI and FlexVolume allow volume plugins to be developed independent of the Kubernetes code base, and deployed (installed) on Kubernetes clusters as extensions.
For storage vendors looking to create an out-of-tree volume plugin, please refer to the volume plugin FAQ.
csi
Container Storage Interface (CSI) defines a standard interface for container orchestration systems (like Kubernetes) to expose arbitrary storage systems to their container workloads.
Please read the CSI design proposal for more information.
Once a CSI compatible volume driver is deployed on a Kubernetes cluster, users
may use the csi
volume type to attach or mount the volumes exposed by the
CSI driver.
A csi
volume can be used in a Pod in three different ways:
- through a reference to a PersistentVolumeClaim
- with a generic ephemeral volume (alpha feature)
- with a CSI ephemeral volume if the driver supports that (beta feature)
The following fields are available to storage administrators to configure a CSI persistent volume:
driver
: A string value that specifies the name of the volume driver to use. This value must correspond to the value returned in theGetPluginInfoResponse
by the CSI driver as defined in the CSI spec. It is used by Kubernetes to identify which CSI driver to call out to, and by CSI driver components to identify which PV objects belong to the CSI driver.volumeHandle
: A string value that uniquely identifies the volume. This value must correspond to the value returned in thevolume.id
field of theCreateVolumeResponse
by the CSI driver as defined in the CSI spec. The value is passed asvolume_id
on all calls to the CSI volume driver when referencing the volume.readOnly
: An optional boolean value indicating whether the volume is to be "ControllerPublished" (attached) as read only. Default is false. This value is passed to the CSI driver via thereadonly
field in theControllerPublishVolumeRequest
.fsType
: If the PV'sVolumeMode
isFilesystem
then this field may be used to specify the filesystem that should be used to mount the volume. If the volume has not been formatted and formatting is supported, this value will be used to format the volume. This value is passed to the CSI driver via theVolumeCapability
field ofControllerPublishVolumeRequest
,NodeStageVolumeRequest
, andNodePublishVolumeRequest
.volumeAttributes
: A map of string to string that specifies static properties of a volume. This map must correspond to the map returned in thevolume.attributes
field of theCreateVolumeResponse
by the CSI driver as defined in the CSI spec. The map is passed to the CSI driver via thevolume_context
field in theControllerPublishVolumeRequest
,NodeStageVolumeRequest
, andNodePublishVolumeRequest
.controllerPublishSecretRef
: A reference to the secret object containing sensitive information to pass to the CSI driver to complete the CSIControllerPublishVolume
andControllerUnpublishVolume
calls. This field is optional, and may be empty if no secret is required. If the Secret contains more than one secret, all secrets are passed.nodeStageSecretRef
: A reference to the secret object containing sensitive information to pass to the CSI driver to complete the CSINodeStageVolume
call. This field is optional, and may be empty if no secret is required. If the Secret contains more than one secret, all secrets are passed.nodePublishSecretRef
: A reference to the secret object containing sensitive information to pass to the CSI driver to complete the CSINodePublishVolume
call. This field is optional, and may be empty if no secret is required. If the secret object contains more than one secret, all secrets are passed.
CSI raw block volume support
Kubernetes v1.18 [stable]
Vendors with external CSI drivers can implement raw block volume support in Kubernetes workloads.
You can set up your PersistentVolume/PersistentVolumeClaim with raw block volume support as usual, without any CSI specific changes.
CSI ephemeral volumes
Kubernetes v1.16 [beta]
You can directly configure CSI volumes within the Pod specification. Volumes specified in this way are ephemeral and do not persist across pod restarts. See Ephemeral Volumes for more information.
For more information on how to develop a CSI driver, refer to the kubernetes-csi documentation
Migrating to CSI drivers from in-tree plugins
Kubernetes v1.17 [beta]
The CSIMigration
feature, when enabled, directs operations against existing in-tree
plugins to corresponding CSI plugins (which are expected to be installed and configured).
As a result, operators do not have to make any
configuration changes to existing Storage Classes, PersistentVolumes or PersistentVolumeClaims
(referring to in-tree plugins) when transitioning to a CSI driver that supersedes an in-tree plugin.
The operations and features that are supported include: provisioning/delete, attach/detach, mount/unmount and resizing of volumes.
In-tree plugins that support CSIMigration
and have a corresponding CSI driver implemented
are listed in Types of Volumes.
flexVolume
Kubernetes v1.23 [deprecated]
FlexVolume is an out-of-tree plugin interface that uses an exec-based model to interface with storage drivers. The FlexVolume driver binaries must be installed in a pre-defined volume plugin path on each node and in some cases the control plane nodes as well.
Pods interact with FlexVolume drivers through the flexVolume
in-tree volume plugin.
For more details, see the FlexVolume README document.
FlexVolume is deprecated. Using an out-of-tree CSI driver is the recommended way to integrate external storage with Kubernetes.
Maintainers of FlexVolume driver should implement a CSI Driver and help to migrate users of FlexVolume drivers to CSI. Users of FlexVolume should move their workloads to use the equivalent CSI Driver.
Mount propagation
Mount propagation allows for sharing volumes mounted by a container to other containers in the same pod, or even to other pods on the same node.
Mount propagation of a volume is controlled by the mountPropagation
field
in Container.volumeMounts
. Its values are:
-
None
- This volume mount will not receive any subsequent mounts that are mounted to this volume or any of its subdirectories by the host. In similar fashion, no mounts created by the container will be visible on the host. This is the default mode.This mode is equal to
private
mount propagation as described in the Linux kernel documentation -
HostToContainer
- This volume mount will receive all subsequent mounts that are mounted to this volume or any of its subdirectories.In other words, if the host mounts anything inside the volume mount, the container will see it mounted there.
Similarly, if any Pod with
Bidirectional
mount propagation to the same volume mounts anything there, the container withHostToContainer
mount propagation will see it.This mode is equal to
rslave
mount propagation as described in the Linux kernel documentation -
Bidirectional
- This volume mount behaves the same theHostToContainer
mount. In addition, all volume mounts created by the container will be propagated back to the host and to all containers of all pods that use the same volume.A typical use case for this mode is a Pod with a FlexVolume or CSI driver or a Pod that needs to mount something on the host using a
hostPath
volume.This mode is equal to
rshared
mount propagation as described in the Linux kernel documentationWarning:Bidirectional
mount propagation can be dangerous. It can damage the host operating system and therefore it is allowed only in privileged containers. Familiarity with Linux kernel behavior is strongly recommended. In addition, any volume mounts created by containers in pods must be destroyed (unmounted) by the containers on termination.
Configuration
Before mount propagation can work properly on some deployments (CoreOS, RedHat/Centos, Ubuntu) mount share must be configured correctly in Docker as shown below.
Edit your Docker's systemd
service file. Set MountFlags
as follows:
MountFlags=shared
Or, remove MountFlags=slave
if present. Then restart the Docker daemon:
sudo systemctl daemon-reload
sudo systemctl restart docker
What's next
Follow an example of deploying WordPress and MySQL with Persistent Volumes.
3.6.2 - Persistent Volumes
This document describes persistent volumes in Kubernetes. Familiarity with volumes is suggested.
Introduction
Managing storage is a distinct problem from managing compute instances. The PersistentVolume subsystem provides an API for users and administrators that abstracts details of how storage is provided from how it is consumed. To do this, we introduce two new API resources: PersistentVolume and PersistentVolumeClaim.
A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual Pod that uses the PV. This API object captures the details of the implementation of the storage, be that NFS, iSCSI, or a cloud-provider-specific storage system.
A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Claims can request specific size and access modes (e.g., they can be mounted ReadWriteOnce, ReadOnlyMany or ReadWriteMany, see AccessModes).
While PersistentVolumeClaims allow a user to consume abstract storage resources, it is common that users need PersistentVolumes with varying properties, such as performance, for different problems. Cluster administrators need to be able to offer a variety of PersistentVolumes that differ in more ways than size and access modes, without exposing users to the details of how those volumes are implemented. For these needs, there is the StorageClass resource.
See the detailed walkthrough with working examples.
Lifecycle of a volume and claim
PVs are resources in the cluster. PVCs are requests for those resources and also act as claim checks to the resource. The interaction between PVs and PVCs follows this lifecycle:
Provisioning
There are two ways PVs may be provisioned: statically or dynamically.
Static
A cluster administrator creates a number of PVs. They carry the details of the real storage, which is available for use by cluster users. They exist in the Kubernetes API and are available for consumption.
Dynamic
When none of the static PVs the administrator created match a user's PersistentVolumeClaim,
the cluster may try to dynamically provision a volume specially for the PVC.
This provisioning is based on StorageClasses: the PVC must request a
storage class and
the administrator must have created and configured that class for dynamic
provisioning to occur. Claims that request the class ""
effectively disable
dynamic provisioning for themselves.
To enable dynamic storage provisioning based on storage class, the cluster administrator
needs to enable the DefaultStorageClass
admission controller
on the API server. This can be done, for example, by ensuring that DefaultStorageClass
is
among the comma-delimited, ordered list of values for the --enable-admission-plugins
flag of
the API server component. For more information on API server command-line flags,
check kube-apiserver documentation.
Binding
A user creates, or in the case of dynamic provisioning, has already created, a PersistentVolumeClaim with a specific amount of storage requested and with certain access modes. A control loop in the master watches for new PVCs, finds a matching PV (if possible), and binds them together. If a PV was dynamically provisioned for a new PVC, the loop will always bind that PV to the PVC. Otherwise, the user will always get at least what they asked for, but the volume may be in excess of what was requested. Once bound, PersistentVolumeClaim binds are exclusive, regardless of how they were bound. A PVC to PV binding is a one-to-one mapping, using a ClaimRef which is a bi-directional binding between the PersistentVolume and the PersistentVolumeClaim.
Claims will remain unbound indefinitely if a matching volume does not exist. Claims will be bound as matching volumes become available. For example, a cluster provisioned with many 50Gi PVs would not match a PVC requesting 100Gi. The PVC can be bound when a 100Gi PV is added to the cluster.
Using
Pods use claims as volumes. The cluster inspects the claim to find the bound volume and mounts that volume for a Pod. For volumes that support multiple access modes, the user specifies which mode is desired when using their claim as a volume in a Pod.
Once a user has a claim and that claim is bound, the bound PV belongs to the user for as long as they need it. Users schedule Pods and access their claimed PVs by including a persistentVolumeClaim
section in a Pod's volumes
block. See Claims As Volumes for more details on this.
Storage Object in Use Protection
The purpose of the Storage Object in Use Protection feature is to ensure that PersistentVolumeClaims (PVCs) in active use by a Pod and PersistentVolume (PVs) that are bound to PVCs are not removed from the system, as this may result in data loss.
If a user deletes a PVC in active use by a Pod, the PVC is not removed immediately. PVC removal is postponed until the PVC is no longer actively used by any Pods. Also, if an admin deletes a PV that is bound to a PVC, the PV is not removed immediately. PV removal is postponed until the PV is no longer bound to a PVC.
You can see that a PVC is protected when the PVC's status is Terminating
and the Finalizers
list includes kubernetes.io/pvc-protection
:
kubectl describe pvc hostpath
Name: hostpath
Namespace: default
StorageClass: example-hostpath
Status: Terminating
Volume:
Labels: <none>
Annotations: volume.beta.kubernetes.io/storage-class=example-hostpath
volume.beta.kubernetes.io/storage-provisioner=example.com/hostpath
Finalizers: [kubernetes.io/pvc-protection]
...
You can see that a PV is protected when the PV's status is Terminating
and the Finalizers
list includes kubernetes.io/pv-protection
too:
kubectl describe pv task-pv-volume
Name: task-pv-volume
Labels: type=local
Annotations: <none>
Finalizers: [kubernetes.io/pv-protection]
StorageClass: standard
Status: Terminating
Claim:
Reclaim Policy: Delete
Access Modes: RWO
Capacity: 1Gi
Message:
Source:
Type: HostPath (bare host directory volume)
Path: /tmp/data
HostPathType:
Events: <none>
Reclaiming
When a user is done with their volume, they can delete the PVC objects from the API that allows reclamation of the resource. The reclaim policy for a PersistentVolume tells the cluster what to do with the volume after it has been released of its claim. Currently, volumes can either be Retained, Recycled, or Deleted.
Retain
The Retain
reclaim policy allows for manual reclamation of the resource. When the PersistentVolumeClaim is deleted, the PersistentVolume still exists and the volume is considered "released". But it is not yet available for another claim because the previous claimant's data remains on the volume. An administrator can manually reclaim the volume with the following steps.
- Delete the PersistentVolume. The associated storage asset in external infrastructure (such as an AWS EBS, GCE PD, Azure Disk, or Cinder volume) still exists after the PV is deleted.
- Manually clean up the data on the associated storage asset accordingly.
- Manually delete the associated storage asset.
If you want to reuse the same storage asset, create a new PersistentVolume with the same storage asset definition.
Delete
For volume plugins that support the Delete
reclaim policy, deletion removes both the PersistentVolume object from Kubernetes, as well as the associated storage asset in the external infrastructure, such as an AWS EBS, GCE PD, Azure Disk, or Cinder volume. Volumes that were dynamically provisioned inherit the reclaim policy of their StorageClass, which defaults to Delete
. The administrator should configure the StorageClass according to users' expectations; otherwise, the PV must be edited or patched after it is created. See Change the Reclaim Policy of a PersistentVolume.
Recycle
Recycle
reclaim policy is deprecated. Instead, the recommended approach is to use dynamic provisioning.
If supported by the underlying volume plugin, the Recycle
reclaim policy performs a basic scrub (rm -rf /thevolume/*
) on the volume and makes it available again for a new claim.
However, an administrator can configure a custom recycler Pod template using
the Kubernetes controller manager command line arguments as described in the
reference.
The custom recycler Pod template must contain a volumes
specification, as
shown in the example below:
apiVersion: v1
kind: Pod
metadata:
name: pv-recycler
namespace: default
spec:
restartPolicy: Never
volumes:
- name: vol
hostPath:
path: /any/path/it/will/be/replaced
containers:
- name: pv-recycler
image: "k8s.gcr.io/busybox"
command: ["/bin/sh", "-c", "test -e /scrub && rm -rf /scrub/..?* /scrub/.[!.]* /scrub/* && test -z \"$(ls -A /scrub)\" || exit 1"]
volumeMounts:
- name: vol
mountPath: /scrub
However, the particular path specified in the custom recycler Pod template in the volumes
part is replaced with the particular path of the volume that is being recycled.
Reserving a PersistentVolume
The control plane can bind PersistentVolumeClaims to matching PersistentVolumes in the cluster. However, if you want a PVC to bind to a specific PV, you need to pre-bind them.
By specifying a PersistentVolume in a PersistentVolumeClaim, you declare a binding between that specific PV and PVC.
If the PersistentVolume exists and has not reserved PersistentVolumeClaims through its claimRef
field, then the PersistentVolume and PersistentVolumeClaim will be bound.
The binding happens regardless of some volume matching criteria, including node affinity. The control plane still checks that storage class, access modes, and requested storage size are valid.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: foo-pvc
namespace: foo
spec:
storageClassName: "" # Empty string must be explicitly set otherwise default StorageClass will be set
volumeName: foo-pv
...
This method does not guarantee any binding privileges to the PersistentVolume. If other PersistentVolumeClaims could use the PV that you specify, you first need to reserve that storage volume. Specify the relevant PersistentVolumeClaim in the claimRef
field of the PV so that other PVCs can not bind to it.
apiVersion: v1
kind: PersistentVolume
metadata:
name: foo-pv
spec:
storageClassName: ""
claimRef:
name: foo-pvc
namespace: foo
...
This is useful if you want to consume PersistentVolumes that have their claimPolicy
set
to Retain
, including cases where you are reusing an existing PV.
Expanding Persistent Volumes Claims
Kubernetes v1.11 [beta]
Support for expanding PersistentVolumeClaims (PVCs) is enabled by default. You can expand the following types of volumes:
- azureDisk
- azureFile
- awsElasticBlockStore
- cinder (deprecated)
- csi
- flexVolume (deprecated)
- gcePersistentDisk
- glusterfs
- rbd
- portworxVolume
You can only expand a PVC if its storage class's allowVolumeExpansion
field is set to true.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gluster-vol-default
provisioner: kubernetes.io/glusterfs
parameters:
resturl: "http://192.168.10.100:8080"
restuser: ""
secretNamespace: ""
secretName: ""
allowVolumeExpansion: true
To request a larger volume for a PVC, edit the PVC object and specify a larger size. This triggers expansion of the volume that backs the underlying PersistentVolume. A new PersistentVolume is never created to satisfy the claim. Instead, an existing volume is resized.
.spec
of a matching
PersistentVolumeClaim to make the size of the PersistentVolumeClaim match the PersistentVolume,
then no storage resize happens.
The Kubernetes control plane will see that the desired state of both resources matches,
conclude that the backing volume size has been manually
increased and that no resize is necessary.
CSI Volume expansion
Kubernetes v1.16 [beta]
Support for expanding CSI volumes is enabled by default but it also requires a specific CSI driver to support volume expansion. Refer to documentation of the specific CSI driver for more information.
Resizing a volume containing a file system
You can only resize volumes containing a file system if the file system is XFS, Ext3, or Ext4.
When a volume contains a file system, the file system is only resized when a new Pod is using
the PersistentVolumeClaim in ReadWrite
mode. File system expansion is either done when a Pod is starting up
or when a Pod is running and the underlying file system supports online expansion.
FlexVolumes (deprecated since Kubernetes v1.23) allow resize if the driver is configured with the
RequiresFSResize
capability to true
. The FlexVolume can be resized on Pod restart.
Resizing an in-use PersistentVolumeClaim
Kubernetes v1.15 [beta]
ExpandInUsePersistentVolumes
feature must be enabled, which is the case automatically for many clusters for beta features. Refer to the feature gate documentation for more information.
In this case, you don't need to delete and recreate a Pod or deployment that is using an existing PVC. Any in-use PVC automatically becomes available to its Pod as soon as its file system has been expanded. This feature has no effect on PVCs that are not in use by a Pod or deployment. You must create a Pod that uses the PVC before the expansion can complete.
Similar to other volume types - FlexVolume volumes can also be expanded when in-use by a Pod.
Recovering from Failure when Expanding Volumes
If a user specifies a new size that is too big to be satisfied by underlying storage system, expansion of PVC will be continuously retried until user or cluster administrator takes some action. This can be undesirable and hence Kubernetes provides following methods of recovering from such failures.
If expanding underlying storage fails, the cluster administrator can manually recover the Persistent Volume Claim (PVC) state and cancel the resize requests. Otherwise, the resize requests are continuously retried by the controller without administrator intervention.
- Mark the PersistentVolume(PV) that is bound to the PersistentVolumeClaim(PVC) with
Retain
reclaim policy. - Delete the PVC. Since PV has
Retain
reclaim policy - we will not lose any data when we recreate the PVC. - Delete the
claimRef
entry from PV specs, so as new PVC can bind to it. This should make the PVAvailable
. - Re-create the PVC with smaller size than PV and set
volumeName
field of the PVC to the name of the PV. This should bind new PVC to existing PV. - Don't forget to restore the reclaim policy of the PV.
Kubernetes v1.23 [alpha]
RecoverVolumeExpansionFailure
feature must be enabled for this feature to work. Refer to the feature gate documentation for more information.
If the feature gates ExpandPersistentVolumes
and RecoverVolumeExpansionFailure
are both
enabled in your cluster, and expansion has failed for a PVC, you can retry expansion with a
smaller size than the previously requested value. To request a new expansion attempt with a
smaller proposed size, edit .spec.resources
for that PVC and choose a value that is less than the
value you previously tried.
This is useful if expansion to a higher value did not succeed because of capacity constraint.
If that has happened, or you suspect that it might have, you can retry expansion by specifying a
size that is within the capacity limits of underlying storage provider. You can monitor status of resize operation by watching .status.resizeStatus
and events on the PVC.
Note that,
although you can specify a lower amount of storage than what was requested previously,
the new value must still be higher than .status.capacity
.
Kubernetes does not support shrinking a PVC to less than its current size.
Types of Persistent Volumes
PersistentVolume types are implemented as plugins. Kubernetes currently supports the following plugins:
awsElasticBlockStore
- AWS Elastic Block Store (EBS)azureDisk
- Azure DiskazureFile
- Azure Filecephfs
- CephFS volumecsi
- Container Storage Interface (CSI)fc
- Fibre Channel (FC) storagegcePersistentDisk
- GCE Persistent Diskglusterfs
- Glusterfs volumehostPath
- HostPath volume (for single node testing only; WILL NOT WORK in a multi-node cluster; consider usinglocal
volume instead)iscsi
- iSCSI (SCSI over IP) storagelocal
- local storage devices mounted on nodes.nfs
- Network File System (NFS) storageportworxVolume
- Portworx volumerbd
- Rados Block Device (RBD) volumevsphereVolume
- vSphere VMDK volume
The following types of PersistentVolume are deprecated. This means that support is still available but will be removed in a future Kubernetes release.
cinder
- Cinder (OpenStack block storage) (deprecated in v1.18)flexVolume
- FlexVolume (deprecated in v1.23)flocker
- Flocker storage (deprecated in v1.22)quobyte
- Quobyte volume (deprecated in v1.22)storageos
- StorageOS volume (deprecated in v1.22)
Older versions of Kubernetes also supported the following in-tree PersistentVolume types:
photonPersistentDisk
- Photon controller persistent disk. (not available after v1.15)scaleIO
- ScaleIO volume (not available after v1.21)
Persistent Volumes
Each PV contains a spec and status, which is the specification and status of the volume. The name of a PersistentVolume object must be a valid DNS subdomain name.
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv0003
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
storageClassName: slow
mountOptions:
- hard
- nfsvers=4.1
nfs:
path: /tmp
server: 172.17.0.2
Capacity
Generally, a PV will have a specific storage capacity. This is set using the PV's capacity
attribute. Read the glossary term Quantity to understand the units expected by capacity
.
Currently, storage size is the only resource that can be set or requested. Future attributes may include IOPS, throughput, etc.
Volume Mode
Kubernetes v1.18 [stable]
Kubernetes supports two volumeModes
of PersistentVolumes: Filesystem
and Block
.
volumeMode
is an optional API parameter.
Filesystem
is the default mode used when volumeMode
parameter is omitted.
A volume with volumeMode: Filesystem
is mounted into Pods into a directory. If the volume
is backed by a block device and the device is empty, Kubernetes creates a filesystem
on the device before mounting it for the first time.
You can set the value of volumeMode
to Block
to use a volume as a raw block device.
Such volume is presented into a Pod as a block device, without any filesystem on it.
This mode is useful to provide a Pod the fastest possible way to access a volume, without
any filesystem layer between the Pod and the volume. On the other hand, the application
running in the Pod must know how to handle a raw block device.
See Raw Block Volume Support
for an example on how to use a volume with volumeMode: Block
in a Pod.
Access Modes
A PersistentVolume can be mounted on a host in any way supported by the resource provider. As shown in the table below, providers will have different capabilities and each PV's access modes are set to the specific modes supported by that particular volume. For example, NFS can support multiple read/write clients, but a specific NFS PV might be exported on the server as read-only. Each PV gets its own set of access modes describing that specific PV's capabilities.
The access modes are:
ReadWriteOnce
- the volume can be mounted as read-write by a single node. ReadWriteOnce access mode still can allow multiple pods to access the volume when the pods are running on the same node.
ReadOnlyMany
- the volume can be mounted as read-only by many nodes.
ReadWriteMany
- the volume can be mounted as read-write by many nodes.
ReadWriteOncePod
- the volume can be mounted as read-write by a single Pod. Use ReadWriteOncePod access mode if you want to ensure that only one pod across whole cluster can read that PVC or write to it. This is only supported for CSI volumes and Kubernetes version 1.22+.
The blog article Introducing Single Pod Access Mode for PersistentVolumes covers this in more detail.
In the CLI, the access modes are abbreviated to:
- RWO - ReadWriteOnce
- ROX - ReadOnlyMany
- RWX - ReadWriteMany
- RWOP - ReadWriteOncePod
Important! A volume can only be mounted using one access mode at a time, even if it supports many. For example, a GCEPersistentDisk can be mounted as ReadWriteOnce by a single node or ReadOnlyMany by many nodes, but not at the same time.
Volume Plugin | ReadWriteOnce | ReadOnlyMany | ReadWriteMany | ReadWriteOncePod |
---|---|---|---|---|
AWSElasticBlockStore | ✓ | - | - | - |
AzureFile | ✓ | ✓ | ✓ | - |
AzureDisk | ✓ | - | - | - |
CephFS | ✓ | ✓ | ✓ | - |
Cinder | ✓ | - | - | - |
CSI | depends on the driver | depends on the driver | depends on the driver | depends on the driver |
FC | ✓ | ✓ | - | - |
FlexVolume | ✓ | ✓ | depends on the driver | - |
Flocker | ✓ | - | - | - |
GCEPersistentDisk | ✓ | ✓ | - | - |
Glusterfs | ✓ | ✓ | ✓ | - |
HostPath | ✓ | - | - | - |
iSCSI | ✓ | ✓ | - | - |
Quobyte | ✓ | ✓ | ✓ | - |
NFS | ✓ | ✓ | ✓ | - |
RBD | ✓ | ✓ | - | - |
VsphereVolume | ✓ | - | - (works when Pods are collocated) | - |
PortworxVolume | ✓ | - | ✓ | - |
StorageOS | ✓ | - | - | - |
Class
A PV can have a class, which is specified by setting the
storageClassName
attribute to the name of a
StorageClass.
A PV of a particular class can only be bound to PVCs requesting
that class. A PV with no storageClassName
has no class and can only be bound
to PVCs that request no particular class.
In the past, the annotation volume.beta.kubernetes.io/storage-class
was used instead
of the storageClassName
attribute. This annotation is still working; however,
it will become fully deprecated in a future Kubernetes release.
Reclaim Policy
Current reclaim policies are:
- Retain -- manual reclamation
- Recycle -- basic scrub (
rm -rf /thevolume/*
) - Delete -- associated storage asset such as AWS EBS, GCE PD, Azure Disk, or OpenStack Cinder volume is deleted
Currently, only NFS and HostPath support recycling. AWS EBS, GCE PD, Azure Disk, and Cinder volumes support deletion.
Mount Options
A Kubernetes administrator can specify additional mount options for when a Persistent Volume is mounted on a node.
The following volume types support mount options:
awsElasticBlockStore
azureDisk
azureFile
cephfs
cinder
(deprecated in v1.18)gcePersistentDisk
glusterfs
iscsi
nfs
quobyte
(deprecated in v1.22)rbd
storageos
(deprecated in v1.22)vsphereVolume
Mount options are not validated. If a mount option is invalid, the mount fails.
In the past, the annotation volume.beta.kubernetes.io/mount-options
was used instead
of the mountOptions
attribute. This annotation is still working; however,
it will become fully deprecated in a future Kubernetes release.
Node Affinity
A PV can specify node affinity to define constraints that limit what nodes this volume can be accessed from. Pods that use a PV will only be scheduled to nodes that are selected by the node affinity. To specify node affinity, set nodeAffinity
in the .spec
of a PV. The PersistentVolume API reference has more details on this field.
Phase
A volume will be in one of the following phases:
- Available -- a free resource that is not yet bound to a claim
- Bound -- the volume is bound to a claim
- Released -- the claim has been deleted, but the resource is not yet reclaimed by the cluster
- Failed -- the volume has failed its automatic reclamation
The CLI will show the name of the PVC bound to the PV.
PersistentVolumeClaims
Each PVC contains a spec and status, which is the specification and status of the claim. The name of a PersistentVolumeClaim object must be a valid DNS subdomain name.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 8Gi
storageClassName: slow
selector:
matchLabels:
release: "stable"
matchExpressions:
- {key: environment, operator: In, values: [dev]}
Access Modes
Claims use the same conventions as volumes when requesting storage with specific access modes.
Volume Modes
Claims use the same convention as volumes to indicate the consumption of the volume as either a filesystem or block device.
Resources
Claims, like Pods, can request specific quantities of a resource. In this case, the request is for storage. The same resource model applies to both volumes and claims.
Selector
Claims can specify a label selector to further filter the set of volumes. Only the volumes whose labels match the selector can be bound to the claim. The selector can consist of two fields:
matchLabels
- the volume must have a label with this valuematchExpressions
- a list of requirements made by specifying key, list of values, and operator that relates the key and values. Valid operators include In, NotIn, Exists, and DoesNotExist.
All of the requirements, from both matchLabels
and matchExpressions
, are ANDed together – they must all be satisfied in order to match.
Class
A claim can request a particular class by specifying the name of a
StorageClass
using the attribute storageClassName
.
Only PVs of the requested class, ones with the same storageClassName
as the PVC, can
be bound to the PVC.
PVCs don't necessarily have to request a class. A PVC with its storageClassName
set
equal to ""
is always interpreted to be requesting a PV with no class, so it
can only be bound to PVs with no class (no annotation or one set equal to
""
). A PVC with no storageClassName
is not quite the same and is treated differently
by the cluster, depending on whether the
DefaultStorageClass
admission plugin
is turned on.
- If the admission plugin is turned on, the administrator may specify a
default StorageClass. All PVCs that have no
storageClassName
can be bound only to PVs of that default. Specifying a default StorageClass is done by setting the annotationstorageclass.kubernetes.io/is-default-class
equal totrue
in a StorageClass object. If the administrator does not specify a default, the cluster responds to PVC creation as if the admission plugin were turned off. If more than one default is specified, the admission plugin forbids the creation of all PVCs. - If the admission plugin is turned off, there is no notion of a default
StorageClass. All PVCs that have no
storageClassName
can be bound only to PVs that have no class. In this case, the PVCs that have nostorageClassName
are treated the same way as PVCs that have theirstorageClassName
set to""
.
Depending on installation method, a default StorageClass may be deployed to a Kubernetes cluster by addon manager during installation.
When a PVC specifies a selector
in addition to requesting a StorageClass,
the requirements are ANDed together: only a PV of the requested class and with
the requested labels may be bound to the PVC.
selector
can't have a PV dynamically provisioned for it.
In the past, the annotation volume.beta.kubernetes.io/storage-class
was used instead
of storageClassName
attribute. This annotation is still working; however,
it won't be supported in a future Kubernetes release.
Claims As Volumes
Pods access storage by using the claim as a volume. Claims must exist in the same namespace as the Pod using the claim. The cluster finds the claim in the Pod's namespace and uses it to get the PersistentVolume backing the claim. The volume is then mounted to the host and into the Pod.
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: myfrontend
image: nginx
volumeMounts:
- mountPath: "/var/www/html"
name: mypd
volumes:
- name: mypd
persistentVolumeClaim:
claimName: myclaim
A Note on Namespaces
PersistentVolumes binds are exclusive, and since PersistentVolumeClaims are namespaced objects, mounting claims with "Many" modes (ROX
, RWX
) is only possible within one namespace.
PersistentVolumes typed hostPath
A hostPath
PersistentVolume uses a file or directory on the Node to emulate network-attached storage.
See an example of hostPath
typed volume.
Raw Block Volume Support
Kubernetes v1.18 [stable]
The following volume plugins support raw block volumes, including dynamic provisioning where applicable:
- AWSElasticBlockStore
- AzureDisk
- CSI
- FC (Fibre Channel)
- GCEPersistentDisk
- iSCSI
- Local volume
- OpenStack Cinder
- RBD (Ceph Block Device)
- VsphereVolume
PersistentVolume using a Raw Block Volume
apiVersion: v1
kind: PersistentVolume
metadata:
name: block-pv
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
volumeMode: Block
persistentVolumeReclaimPolicy: Retain
fc:
targetWWNs: ["50060e801049cfd1"]
lun: 0
readOnly: false
PersistentVolumeClaim requesting a Raw Block Volume
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: block-pvc
spec:
accessModes:
- ReadWriteOnce
volumeMode: Block
resources:
requests:
storage: 10Gi
Pod specification adding Raw Block Device path in container
apiVersion: v1
kind: Pod
metadata:
name: pod-with-block-volume
spec:
containers:
- name: fc-container
image: fedora:26
command: ["/bin/sh", "-c"]
args: [ "tail -f /dev/null" ]
volumeDevices:
- name: data
devicePath: /dev/xvda
volumes:
- name: data
persistentVolumeClaim:
claimName: block-pvc
Binding Block Volumes
If a user requests a raw block volume by indicating this using the volumeMode
field in the PersistentVolumeClaim spec, the binding rules differ slightly from previous releases that didn't consider this mode as part of the spec.
Listed is a table of possible combinations the user and admin might specify for requesting a raw block device. The table indicates if the volume will be bound or not given the combinations:
Volume binding matrix for statically provisioned volumes:
PV volumeMode | PVC volumeMode | Result |
---|---|---|
unspecified | unspecified | BIND |
unspecified | Block | NO BIND |
unspecified | Filesystem | BIND |
Block | unspecified | NO BIND |
Block | Block | BIND |
Block | Filesystem | NO BIND |
Filesystem | Filesystem | BIND |
Filesystem | Block | NO BIND |
Filesystem | unspecified | BIND |
Volume Snapshot and Restore Volume from Snapshot Support
Kubernetes v1.20 [stable]
Volume snapshots only support the out-of-tree CSI volume plugins. For details, see Volume Snapshots. In-tree volume plugins are deprecated. You can read about the deprecated volume plugins in the Volume Plugin FAQ.
Create a PersistentVolumeClaim from a Volume Snapshot
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restore-pvc
spec:
storageClassName: csi-hostpath-sc
dataSource:
name: new-snapshot-test
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Volume Cloning
Volume Cloning only available for CSI volume plugins.
Create PersistentVolumeClaim from an existing PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cloned-pvc
spec:
storageClassName: my-csi-plugin
dataSource:
name: existing-src-pvc-name
kind: PersistentVolumeClaim
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Volume populators and data sources
Kubernetes v1.22 [alpha]
Kubernetes supports custom volume populators; this alpha feature was introduced
in Kubernetes 1.18. Kubernetes 1.22 reimplemented the mechanism with a redesigned API.
Check that you are reading the version of the Kubernetes documentation that matches your
cluster.
To check the version, enter kubectl version
.
To use custom volume populators, you must enable the AnyVolumeDataSource
feature gate for
the kube-apiserver and kube-controller-manager.
Volume populators take advantage of a PVC spec field called dataSourceRef
. Unlike the
dataSource
field, which can only contain either a reference to another PersistentVolumeClaim
or to a VolumeSnapshot, the dataSourceRef
field can contain a reference to any object in the
same namespace, except for core objects other than PVCs. For clusters that have the feature
gate enabled, use of the dataSourceRef
is preferred over dataSource
.
Data source references
The dataSourceRef
field behaves almost the same as the dataSource
field. If either one is
specified while the other is not, the API server will give both fields the same value. Neither
field can be changed after creation, and attempting to specify different values for the two
fields will result in a validation error. Therefore the two fields will always have the same
contents.
There are two differences between the dataSourceRef
field and the dataSource
field that
users should be aware of:
- The
dataSource
field ignores invalid values (as if the field was blank) while thedataSourceRef
field never ignores values and will cause an error if an invalid value is used. Invalid values are any core object (objects with no apiGroup) except for PVCs. - The
dataSourceRef
field may contain different types of objects, while thedataSource
field only allows PVCs and VolumeSnapshots.
Users should always use dataSourceRef
on clusters that have the feature gate enabled, and
fall back to dataSource
on clusters that do not. It is not necessary to look at both fields
under any circumstance. The duplicated values with slightly different semantics exist only for
backwards compatibility. In particular, a mixture of older and newer controllers are able to
interoperate because the fields are the same.
Using volume populators
Volume populators are controllers that can
create non-empty volumes, where the contents of the volume are determined by a Custom Resource.
Users create a populated volume by referring to a Custom Resource using the dataSourceRef
field:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: populated-pvc
spec:
dataSourceRef:
name: example-name
kind: ExampleDataSource
apiGroup: example.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
Because volume populators are external components, attempts to create a PVC that uses one can fail if not all the correct components are installed. External controllers should generate events on the PVC to provide feedback on the status of the creation, including warnings if the PVC cannot be created due to some missing component.
You can install the alpha volume data source validator controller into your cluster. That controller generates warning Events on a PVC in the case that no populator is registered to handle that kind of data source. When a suitable populator is installed for a PVC, it's the responsibility of that populator controller to report Events that relate to volume creation and issues during the process.
Writing Portable Configuration
If you're writing configuration templates or examples that run on a wide range of clusters and need persistent storage, it is recommended that you use the following pattern:
- Include PersistentVolumeClaim objects in your bundle of config (alongside Deployments, ConfigMaps, etc).
- Do not include PersistentVolume objects in the config, since the user instantiating the config may not have permission to create PersistentVolumes.
- Give the user the option of providing a storage class name when instantiating
the template.
- If the user provides a storage class name, put that value into the
persistentVolumeClaim.storageClassName
field. This will cause the PVC to match the right storage class if the cluster has StorageClasses enabled by the admin. - If the user does not provide a storage class name, leave the
persistentVolumeClaim.storageClassName
field as nil. This will cause a PV to be automatically provisioned for the user with the default StorageClass in the cluster. Many cluster environments have a default StorageClass installed, or administrators can create their own default StorageClass.
- If the user provides a storage class name, put that value into the
- In your tooling, watch for PVCs that are not getting bound after some time and surface this to the user, as this may indicate that the cluster has no dynamic storage support (in which case the user should create a matching PV) or the cluster has no storage system (in which case the user cannot deploy config requiring PVCs).
What's next
- Learn more about Creating a PersistentVolume.
- Learn more about Creating a PersistentVolumeClaim.
- Read the Persistent Storage design document.
API references
Read about the APIs described in this page:
3.6.3 - Projected Volumes
This document describes projected volumes in Kubernetes. Familiarity with volumes is suggested.
Introduction
A projected
volume maps several existing volume sources into the same directory.
Currently, the following types of volume sources can be projected:
All sources are required to be in the same namespace as the Pod. For more details, see the all-in-one volume design document.
Example configuration with a secret, a downwardAPI, and a configMap
apiVersion: v1
kind: Pod
metadata:
name: volume-test
spec:
containers:
- name: container-test
image: busybox:1.28
volumeMounts:
- name: all-in-one
mountPath: "/projected-volume"
readOnly: true
volumes:
- name: all-in-one
projected:
sources:
- secret:
name: mysecret
items:
- key: username
path: my-group/my-username
- downwardAPI:
items:
- path: "labels"
fieldRef:
fieldPath: metadata.labels
- path: "cpu_limit"
resourceFieldRef:
containerName: container-test
resource: limits.cpu
- configMap:
name: myconfigmap
items:
- key: config
path: my-group/my-config
Example configuration: secrets with a non-default permission mode set
apiVersion: v1
kind: Pod
metadata:
name: volume-test
spec:
containers:
- name: container-test
image: busybox:1.28
volumeMounts:
- name: all-in-one
mountPath: "/projected-volume"
readOnly: true
volumes:
- name: all-in-one
projected:
sources:
- secret:
name: mysecret
items:
- key: username
path: my-group/my-username
- secret:
name: mysecret2
items:
- key: password
path: my-group/my-password
mode: 511
Each projected volume source is listed in the spec under sources
. The
parameters are nearly the same with two exceptions:
- For secrets, the
secretName
field has been changed toname
to be consistent with ConfigMap naming. - The
defaultMode
can only be specified at the projected level and not for each volume source. However, as illustrated above, you can explicitly set themode
for each individual projection.
serviceAccountToken projected volumes
When the TokenRequestProjection
feature is enabled, you can inject the token
for the current service account
into a Pod at a specified path. For example:
apiVersion: v1
kind: Pod
metadata:
name: sa-token-test
spec:
containers:
- name: container-test
image: busybox:1.28
volumeMounts:
- name: token-vol
mountPath: "/service-account"
readOnly: true
serviceAccountName: default
volumes:
- name: token-vol
projected:
sources:
- serviceAccountToken:
audience: api
expirationSeconds: 3600
path: token
The example Pod has a projected volume containing the injected service account
token. Containers in this Pod can use that token to access the Kubernetes API
server, authenticating with the identity of the pod's ServiceAccount.
The audience
field contains the intended audience of the
token. A recipient of the token must identify itself with an identifier specified
in the audience of the token, and otherwise should reject the token. This field
is optional and it defaults to the identifier of the API server.
The expirationSeconds
is the expected duration of validity of the service account
token. It defaults to 1 hour and must be at least 10 minutes (600 seconds). An administrator
can also limit its maximum value by specifying the --service-account-max-token-expiration
option for the API server. The path
field specifies a relative path to the mount point
of the projected volume.
subPath
volume mount will not receive updates for those volume sources.
SecurityContext interactions
The proposal for file permission handling in projected service account volume enhancement introduced the projected files having the the correct owner permissions set.
Linux
In Linux pods that have a projected volume and RunAsUser
set in the Pod
SecurityContext
,
the projected files have the correct ownership set including container user
ownership.
Windows
In Windows pods that have a projected volume and RunAsUsername
set in the
Pod SecurityContext
, the ownership is not enforced due to the way user
accounts are managed in Windows. Windows stores and manages local user and group
accounts in a database file called Security Account Manager (SAM). Each
container maintains its own instance of the SAM database, to which the host has
no visibility into while the container is running. Windows containers are
designed to run the user mode portion of the OS in isolation from the host,
hence the maintenance of a virtual SAM database. As a result, the kubelet running
on the host does not have the ability to dynamically configure host file
ownership for virtualized container accounts. It is recommended that if files on
the host machine are to be shared with the container then they should be placed
into their own volume mount outside of C:\
.
By default, the projected files will have the following ownership as shown for an example projected volume file:
PS C:\> Get-Acl C:\var\run\secrets\kubernetes.io\serviceaccount\..2021_08_31_22_22_18.318230061\ca.crt | Format-List
Path : Microsoft.PowerShell.Core\FileSystem::C:\var\run\secrets\kubernetes.io\serviceaccount\..2021_08_31_22_22_18.318230061\ca.crt
Owner : BUILTIN\Administrators
Group : NT AUTHORITY\SYSTEM
Access : NT AUTHORITY\SYSTEM Allow FullControl
BUILTIN\Administrators Allow FullControl
BUILTIN\Users Allow ReadAndExecute, Synchronize
Audit :
Sddl : O:BAG:SYD:AI(A;ID;FA;;;SY)(A;ID;FA;;;BA)(A;ID;0x1200a9;;;BU)
This implies all administrator users like ContainerAdministrator
will have
read, write and execute access while, non-administrator users will have read and
execute access.
In general, granting the container access to the host is discouraged as it can open the door for potential security exploits.
Creating a Windows Pod with RunAsUser
in it's SecurityContext
will result in
the Pod being stuck at ContainerCreating
forever. So it is advised to not use
the Linux only RunAsUser
option with Windows Pods.
3.6.4 - Ephemeral Volumes
This document describes ephemeral volumes in Kubernetes. Familiarity with volumes is suggested, in particular PersistentVolumeClaim and PersistentVolume.
Some application need additional storage but don't care whether that data is stored persistently across restarts. For example, caching services are often limited by memory size and can move infrequently used data into storage that is slower than memory with little impact on overall performance.
Other applications expect some read-only input data to be present in files, like configuration data or secret keys.
Ephemeral volumes are designed for these use cases. Because volumes follow the Pod's lifetime and get created and deleted along with the Pod, Pods can be stopped and restarted without being limited to where some persistent volume is available.
Ephemeral volumes are specified inline in the Pod spec, which simplifies application deployment and management.
Types of ephemeral volumes
Kubernetes supports several different kinds of ephemeral volumes for different purposes:
- emptyDir: empty at Pod startup, with storage coming locally from the kubelet base directory (usually the root disk) or RAM
- configMap, downwardAPI, secret: inject different kinds of Kubernetes data into a Pod
- CSI ephemeral volumes: similar to the previous volume kinds, but provided by special CSI drivers which specifically support this feature
- generic ephemeral volumes, which can be provided by all storage drivers that also support persistent volumes
emptyDir
, configMap
, downwardAPI
, secret
are provided as
local ephemeral
storage.
They are managed by kubelet on each node.
CSI ephemeral volumes must be provided by third-party CSI storage drivers.
Generic ephemeral volumes can be provided by third-party CSI storage drivers, but also by any other storage driver that supports dynamic provisioning. Some CSI drivers are written specifically for CSI ephemeral volumes and do not support dynamic provisioning: those then cannot be used for generic ephemeral volumes.
The advantage of using third-party drivers is that they can offer functionality that Kubernetes itself does not support, for example storage with different performance characteristics than the disk that is managed by kubelet, or injecting different data.
CSI ephemeral volumes
Kubernetes v1.16 [beta]
This feature requires the CSIInlineVolume
feature gate to be enabled. It
is enabled by default starting with Kubernetes 1.16.
Conceptually, CSI ephemeral volumes are similar to configMap
,
downwardAPI
and secret
volume types: the storage is managed locally on each
node and is created together with other local resources after a Pod has been
scheduled onto a node. Kubernetes has no concept of rescheduling Pods
anymore at this stage. Volume creation has to be unlikely to fail,
otherwise Pod startup gets stuck. In particular, storage capacity
aware Pod scheduling is not
supported for these volumes. They are currently also not covered by
the storage resource usage limits of a Pod, because that is something
that kubelet can only enforce for storage that it manages itself.
Here's an example manifest for a Pod that uses CSI ephemeral storage:
kind: Pod
apiVersion: v1
metadata:
name: my-csi-app
spec:
containers:
- name: my-frontend
image: busybox:1.28
volumeMounts:
- mountPath: "/data"
name: my-csi-inline-vol
command: [ "sleep", "1000000" ]
volumes:
- name: my-csi-inline-vol
csi:
driver: inline.storage.kubernetes.io
volumeAttributes:
foo: bar
The volumeAttributes
determine what volume is prepared by the
driver. These attributes are specific to each driver and not
standardized. See the documentation of each CSI driver for further
instructions.
CSI driver restrictions
Kubernetes v1.21 [deprecated]
As a cluster administrator, you can use a PodSecurityPolicy to control which CSI drivers can be used in a Pod, specified with the
allowedCSIDrivers
field.
Generic ephemeral volumes
Kubernetes v1.23 [stable]
Generic ephemeral volumes are similar to emptyDir
volumes in the
sense that they provide a per-pod directory for scratch data that is
usually empty after provisioning. But they may also have additional
features:
- Storage can be local or network-attached.
- Volumes can have a fixed size that Pods are not able to exceed.
- Volumes may have some initial data, depending on the driver and parameters.
- Typical operations on volumes are supported assuming that the driver supports them, including snapshotting, cloning, resizing, and storage capacity tracking.
Example:
kind: Pod
apiVersion: v1
metadata:
name: my-app
spec:
containers:
- name: my-frontend
image: busybox:1.28
volumeMounts:
- mountPath: "/scratch"
name: scratch-volume
command: [ "sleep", "1000000" ]
volumes:
- name: scratch-volume
ephemeral:
volumeClaimTemplate:
metadata:
labels:
type: my-frontend-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "scratch-storage-class"
resources:
requests:
storage: 1Gi
Lifecycle and PersistentVolumeClaim
The key design idea is that the parameters for a volume claim are allowed inside a volume source of the Pod. Labels, annotations and the whole set of fields for a PersistentVolumeClaim are supported. When such a Pod gets created, the ephemeral volume controller then creates an actual PersistentVolumeClaim object in the same namespace as the Pod and ensures that the PersistentVolumeClaim gets deleted when the Pod gets deleted.
That triggers volume binding and/or provisioning, either immediately if
the StorageClass uses immediate volume binding or when the Pod is
tentatively scheduled onto a node (WaitForFirstConsumer
volume
binding mode). The latter is recommended for generic ephemeral volumes
because then the scheduler is free to choose a suitable node for
the Pod. With immediate binding, the scheduler is forced to select a node that has
access to the volume once it is available.
In terms of resource ownership,
a Pod that has generic ephemeral storage is the owner of the PersistentVolumeClaim(s)
that provide that ephemeral storage. When the Pod is deleted,
the Kubernetes garbage collector deletes the PVC, which then usually
triggers deletion of the volume because the default reclaim policy of
storage classes is to delete volumes. You can create quasi-ephemeral local storage
using a StorageClass with a reclaim policy of retain
: the storage outlives the Pod,
and in this case you need to ensure that volume clean up happens separately.
While these PVCs exist, they can be used like any other PVC. In particular, they can be referenced as data source in volume cloning or snapshotting. The PVC object also holds the current status of the volume.
PersistentVolumeClaim naming
Naming of the automatically created PVCs is deterministic: the name is
a combination of Pod name and volume name, with a hyphen (-
) in the
middle. In the example above, the PVC name will be
my-app-scratch-volume
. This deterministic naming makes it easier to
interact with the PVC because one does not have to search for it once
the Pod name and volume name are known.
The deterministic naming also introduces a potential conflict between different Pods (a Pod "pod-a" with volume "scratch" and another Pod with name "pod" and volume "a-scratch" both end up with the same PVC name "pod-a-scratch") and between Pods and manually created PVCs.
Such conflicts are detected: a PVC is only used for an ephemeral volume if it was created for the Pod. This check is based on the ownership relationship. An existing PVC is not overwritten or modified. But this does not resolve the conflict because without the right PVC, the Pod cannot start.
Security
Enabling the GenericEphemeralVolume feature allows users to create PVCs indirectly if they can create Pods, even if they do not have permission to create PVCs directly. Cluster administrators must be aware of this. If this does not fit their security model, they have two choices:
- Use a Pod Security
Policy where the
volumes
list does not contain theephemeral
volume type (deprecated in Kubernetes 1.21). - Use an admission webhook which rejects objects like Pods that have a generic ephemeral volume.
The normal namespace quota for PVCs still applies, so even if users are allowed to use this new mechanism, they cannot use it to circumvent other policies.
What's next
Ephemeral volumes managed by kubelet
CSI ephemeral volumes
- For more information on the design, see the Ephemeral Inline CSI volumes KEP.
- For more information on further development of this feature, see the enhancement tracking issue #596.
Generic ephemeral volumes
- For more information on the design, see the Generic ephemeral inline volumes KEP.
3.6.5 - Storage Classes
This document describes the concept of a StorageClass in Kubernetes. Familiarity with volumes and persistent volumes is suggested.
Introduction
A StorageClass provides a way for administrators to describe the "classes" of storage they offer. Different classes might map to quality-of-service levels, or to backup policies, or to arbitrary policies determined by the cluster administrators. Kubernetes itself is unopinionated about what classes represent. This concept is sometimes called "profiles" in other storage systems.
The StorageClass Resource
Each StorageClass contains the fields provisioner
, parameters
, and
reclaimPolicy
, which are used when a PersistentVolume belonging to the
class needs to be dynamically provisioned.
The name of a StorageClass object is significant, and is how users can request a particular class. Administrators set the name and other parameters of a class when first creating StorageClass objects, and the objects cannot be updated once they are created.
Administrators can specify a default StorageClass only for PVCs that don't request any particular class to bind to: see the PersistentVolumeClaim section for details.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
- debug
volumeBindingMode: Immediate
Provisioner
Each StorageClass has a provisioner that determines what volume plugin is used for provisioning PVs. This field must be specified.
Volume Plugin | Internal Provisioner | Config Example |
---|---|---|
AWSElasticBlockStore | ✓ | AWS EBS |
AzureFile | ✓ | Azure File |
AzureDisk | ✓ | Azure Disk |
CephFS | - | - |
Cinder | ✓ | OpenStack Cinder |
FC | - | - |
FlexVolume | - | - |
Flocker | ✓ | - |
GCEPersistentDisk | ✓ | GCE PD |
Glusterfs | ✓ | Glusterfs |
iSCSI | - | - |
Quobyte | ✓ | Quobyte |
NFS | - | NFS |
RBD | ✓ | Ceph RBD |
VsphereVolume | ✓ | vSphere |
PortworxVolume | ✓ | Portworx Volume |
ScaleIO | ✓ | ScaleIO |
StorageOS | ✓ | StorageOS |
Local | - | Local |
You are not restricted to specifying the "internal" provisioners listed here (whose names are prefixed with "kubernetes.io" and shipped alongside Kubernetes). You can also run and specify external provisioners, which are independent programs that follow a specification defined by Kubernetes. Authors of external provisioners have full discretion over where their code lives, how the provisioner is shipped, how it needs to be run, what volume plugin it uses (including Flex), etc. The repository kubernetes-sigs/sig-storage-lib-external-provisioner houses a library for writing external provisioners that implements the bulk of the specification. Some external provisioners are listed under the repository kubernetes-sigs/sig-storage-lib-external-provisioner.
For example, NFS doesn't provide an internal provisioner, but an external provisioner can be used. There are also cases when 3rd party storage vendors provide their own external provisioner.
Reclaim Policy
PersistentVolumes that are dynamically created by a StorageClass will have the
reclaim policy specified in the reclaimPolicy
field of the class, which can be
either Delete
or Retain
. If no reclaimPolicy
is specified when a
StorageClass object is created, it will default to Delete
.
PersistentVolumes that are created manually and managed via a StorageClass will have whatever reclaim policy they were assigned at creation.
Allow Volume Expansion
Kubernetes v1.11 [beta]
PersistentVolumes can be configured to be expandable. This feature when set to true
,
allows the users to resize the volume by editing the corresponding PVC object.
The following types of volumes support volume expansion, when the underlying
StorageClass has the field allowVolumeExpansion
set to true.
Volume type | Required Kubernetes version |
---|---|
gcePersistentDisk | 1.11 |
awsElasticBlockStore | 1.11 |
Cinder | 1.11 |
glusterfs | 1.11 |
rbd | 1.11 |
Azure File | 1.11 |
Azure Disk | 1.11 |
Portworx | 1.11 |
FlexVolume | 1.13 |
CSI | 1.14 (alpha), 1.16 (beta) |
Mount Options
PersistentVolumes that are dynamically created by a StorageClass will have the
mount options specified in the mountOptions
field of the class.
If the volume plugin does not support mount options but mount options are specified, provisioning will fail. Mount options are not validated on either the class or PV. If a mount option is invalid, the PV mount fails.
Volume Binding Mode
The volumeBindingMode
field controls when volume binding and dynamic
provisioning should occur. When unset, "Immediate" mode is used by default.
The Immediate
mode indicates that volume binding and dynamic
provisioning occurs once the PersistentVolumeClaim is created. For storage
backends that are topology-constrained and not globally accessible from all Nodes
in the cluster, PersistentVolumes will be bound or provisioned without knowledge of the Pod's scheduling
requirements. This may result in unschedulable Pods.
A cluster administrator can address this issue by specifying the WaitForFirstConsumer
mode which
will delay the binding and provisioning of a PersistentVolume until a Pod using the PersistentVolumeClaim is created.
PersistentVolumes will be selected or provisioned conforming to the topology that is
specified by the Pod's scheduling constraints. These include, but are not limited to, resource
requirements,
node selectors,
pod affinity and
anti-affinity,
and taints and tolerations.
The following plugins support WaitForFirstConsumer
with dynamic provisioning:
The following plugins support WaitForFirstConsumer
with pre-created PersistentVolume binding:
- All of the above
- Local
Kubernetes v1.17 [stable]
If you choose to use WaitForFirstConsumer
, do not use nodeName
in the Pod spec
to specify node affinity. If nodeName
is used in this case, the scheduler will be bypassed and PVC will remain in pending
state.
Instead, you can use node selector for hostname in this case as shown below.
apiVersion: v1
kind: Pod
metadata:
name: task-pv-pod
spec:
nodeSelector:
kubernetes.io/hostname: kube-01
volumes:
- name: task-pv-storage
persistentVolumeClaim:
claimName: task-pv-claim
containers:
- name: task-pv-container
image: nginx
ports:
- containerPort: 80
name: "http-server"
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: task-pv-storage
Allowed Topologies
When a cluster operator specifies the WaitForFirstConsumer
volume binding mode, it is no longer necessary
to restrict provisioning to specific topologies in most situations. However,
if still required, allowedTopologies
can be specified.
This example demonstrates how to restrict the topology of provisioned volumes to specific
zones and should be used as a replacement for the zone
and zones
parameters for the
supported plugins.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-central-1a
- us-central-1b
Parameters
Storage Classes have parameters that describe volumes belonging to the storage
class. Different parameters may be accepted depending on the provisioner
. For
example, the value io1
, for the parameter type
, and the parameter
iopsPerGB
are specific to EBS. When a parameter is omitted, some default is
used.
There can be at most 512 parameters defined for a StorageClass. The total length of the parameters object including its keys and values cannot exceed 256 KiB.
AWS EBS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/aws-ebs
parameters:
type: io1
iopsPerGB: "10"
fsType: ext4
type
:io1
,gp2
,gp3
,sc1
,st1
. See AWS docs for details. Default:gp3
.zone
(Deprecated): AWS zone. If neitherzone
norzones
is specified, volumes are generally round-robin-ed across all active zones where Kubernetes cluster has a node.zone
andzones
parameters must not be used at the same time.zones
(Deprecated): A comma separated list of AWS zone(s). If neitherzone
norzones
is specified, volumes are generally round-robin-ed across all active zones where Kubernetes cluster has a node.zone
andzones
parameters must not be used at the same time.iopsPerGB
: only forio1
volumes. I/O operations per second per GiB. AWS volume plugin multiplies this with size of requested volume to compute IOPS of the volume and caps it at 20 000 IOPS (maximum supported by AWS, see AWS docs. A string is expected here, i.e."10"
, not10
.fsType
: fsType that is supported by kubernetes. Default:"ext4"
.encrypted
: denotes whether the EBS volume should be encrypted or not. Valid values are"true"
or"false"
. A string is expected here, i.e."true"
, nottrue
.kmsKeyId
: optional. The full Amazon Resource Name of the key to use when encrypting the volume. If none is supplied butencrypted
is true, a key is generated by AWS. See AWS docs for valid ARN value.
GCE PD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
fstype: ext4
replication-type: none
-
type
:pd-standard
orpd-ssd
. Default:pd-standard
-
zone
(Deprecated): GCE zone. If neitherzone
norzones
is specified, volumes are generally round-robin-ed across all active zones where Kubernetes cluster has a node.zone
andzones
parameters must not be used at the same time. -
zones
(Deprecated): A comma separated list of GCE zone(s). If neitherzone
norzones
is specified, volumes are generally round-robin-ed across all active zones where Kubernetes cluster has a node.zone
andzones
parameters must not be used at the same time. -
fstype
:ext4
orxfs
. Default:ext4
. The defined filesystem type must be supported by the host operating system. -
replication-type
:none
orregional-pd
. Default:none
.
If replication-type
is set to none
, a regular (zonal) PD will be provisioned.
If replication-type
is set to regional-pd
, a
Regional Persistent Disk
will be provisioned. It's highly recommended to have
volumeBindingMode: WaitForFirstConsumer
set, in which case when you create
a Pod that consumes a PersistentVolumeClaim which uses this StorageClass, a
Regional Persistent Disk is provisioned with two zones. One zone is the same
as the zone that the Pod is scheduled in. The other zone is randomly picked
from the zones available to the cluster. Disk zones can be further constrained
using allowedTopologies
.
Glusterfs
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/glusterfs
parameters:
resturl: "http://127.0.0.1:8081"
clusterid: "630372ccdc720a92c681fb928f27b53f"
restauthenabled: "true"
restuser: "admin"
secretNamespace: "default"
secretName: "heketi-secret"
gidMin: "40000"
gidMax: "50000"
volumetype: "replicate:3"
-
resturl
: Gluster REST service/Heketi service url which provision gluster volumes on demand. The general format should beIPaddress:Port
and this is a mandatory parameter for GlusterFS dynamic provisioner. If Heketi service is exposed as a routable service in openshift/kubernetes setup, this can have a format similar tohttp://heketi-storage-project.cloudapps.mystorage.com
where the fqdn is a resolvable Heketi service url. -
restauthenabled
: Gluster REST service authentication boolean that enables authentication to the REST server. If this value is"true"
,restuser
andrestuserkey
orsecretNamespace
+secretName
have to be filled. This option is deprecated, authentication is enabled when any ofrestuser
,restuserkey
,secretName
orsecretNamespace
is specified. -
restuser
: Gluster REST service/Heketi user who has access to create volumes in the Gluster Trusted Pool. -
restuserkey
: Gluster REST service/Heketi user's password which will be used for authentication to the REST server. This parameter is deprecated in favor ofsecretNamespace
+secretName
. -
secretNamespace
,secretName
: Identification of Secret instance that contains user password to use when talking to Gluster REST service. These parameters are optional, empty password will be used when bothsecretNamespace
andsecretName
are omitted. The provided secret must have type"kubernetes.io/glusterfs"
, for example created in this way:kubectl create secret generic heketi-secret \ --type="kubernetes.io/glusterfs" --from-literal=key='opensesame' \ --namespace=default
Example of a secret can be found in glusterfs-provisioning-secret.yaml.
-
clusterid
:630372ccdc720a92c681fb928f27b53f
is the ID of the cluster which will be used by Heketi when provisioning the volume. It can also be a list of clusterids, for example:"8452344e2becec931ece4e33c4674e4e,42982310de6c63381718ccfa6d8cf397"
. This is an optional parameter. -
gidMin
,gidMax
: The minimum and maximum value of GID range for the StorageClass. A unique value (GID) in this range ( gidMin-gidMax ) will be used for dynamically provisioned volumes. These are optional values. If not specified, the volume will be provisioned with a value between 2000-2147483647 which are defaults for gidMin and gidMax respectively. -
volumetype
: The volume type and its parameters can be configured with this optional value. If the volume type is not mentioned, it's up to the provisioner to decide the volume type.For example:
- Replica volume:
volumetype: replicate:3
where '3' is replica count. - Disperse/EC volume:
volumetype: disperse:4:2
where '4' is data and '2' is the redundancy count. - Distribute volume:
volumetype: none
For available volume types and administration options, refer to the Administration Guide.
For further reference information, see How to configure Heketi.
When persistent volumes are dynamically provisioned, the Gluster plugin automatically creates an endpoint and a headless service in the name
gluster-dynamic-<claimname>
. The dynamic endpoint and service are automatically deleted when the persistent volume claim is deleted. - Replica volume:
NFS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: example-nfs
provisioner: example.com/external-nfs
parameters:
server: nfs-server.example.com
path: /share
readOnly: "false"
server
: Server is the hostname or IP address of the NFS server.path
: Path that is exported by the NFS server.readOnly
: A flag indicating whether the storage will be mounted as read only (default false).
Kubernetes doesn't include an internal NFS provisioner. You need to use an external provisioner to create a StorageClass for NFS. Here are some examples:
OpenStack Cinder
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gold
provisioner: kubernetes.io/cinder
parameters:
availability: nova
availability
: Availability Zone. If not specified, volumes are generally round-robin-ed across all active zones where Kubernetes cluster has a node.
Kubernetes v1.11 [deprecated]
This internal provisioner of OpenStack is deprecated. Please use the external cloud provider for OpenStack.
vSphere
There are two types of provisioners for vSphere storage classes:
- CSI provisioner:
csi.vsphere.vmware.com
- vCP provisioner:
kubernetes.io/vsphere-volume
In-tree provisioners are deprecated. For more information on the CSI provisioner, see Kubernetes vSphere CSI Driver and vSphereVolume CSI migration.
CSI Provisioner
The vSphere CSI StorageClass provisioner works with Tanzu Kubernetes clusters. For an example, refer to the vSphere CSI repository.
vCP Provisioner
The following examples use the VMware Cloud Provider (vCP) StorageClass provisioner.
-
Create a StorageClass with a user specified disk format.
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast provisioner: kubernetes.io/vsphere-volume parameters: diskformat: zeroedthick
diskformat
:thin
,zeroedthick
andeagerzeroedthick
. Default:"thin"
. -
Create a StorageClass with a disk format on a user specified datastore.
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast provisioner: kubernetes.io/vsphere-volume parameters: diskformat: zeroedthick datastore: VSANDatastore
datastore
: The user can also specify the datastore in the StorageClass. The volume will be created on the datastore specified in the StorageClass, which in this case isVSANDatastore
. This field is optional. If the datastore is not specified, then the volume will be created on the datastore specified in the vSphere config file used to initialize the vSphere Cloud Provider. -
Storage Policy Management inside kubernetes
-
Using existing vCenter SPBM policy
One of the most important features of vSphere for Storage Management is policy based Management. Storage Policy Based Management (SPBM) is a storage policy framework that provides a single unified control plane across a broad range of data services and storage solutions. SPBM enables vSphere administrators to overcome upfront storage provisioning challenges, such as capacity planning, differentiated service levels and managing capacity headroom.
The SPBM policies can be specified in the StorageClass using the
storagePolicyName
parameter. -
Virtual SAN policy support inside Kubernetes
Vsphere Infrastructure (VI) Admins will have the ability to specify custom Virtual SAN Storage Capabilities during dynamic volume provisioning. You can now define storage requirements, such as performance and availability, in the form of storage capabilities during dynamic volume provisioning. The storage capability requirements are converted into a Virtual SAN policy which are then pushed down to the Virtual SAN layer when a persistent volume (virtual disk) is being created. The virtual disk is distributed across the Virtual SAN datastore to meet the requirements.
You can see Storage Policy Based Management for dynamic provisioning of volumes for more details on how to use storage policies for persistent volumes management.
-
There are few vSphere examples which you try out for persistent volume management inside Kubernetes for vSphere.
Ceph RBD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/rbd
parameters:
monitors: 10.16.153.105:6789
adminId: kube
adminSecretName: ceph-secret
adminSecretNamespace: kube-system
pool: kube
userId: kube
userSecretName: ceph-secret-user
userSecretNamespace: default
fsType: ext4
imageFormat: "2"
imageFeatures: "layering"
-
monitors
: Ceph monitors, comma delimited. This parameter is required. -
adminId
: Ceph client ID that is capable of creating images in the pool. Default is "admin". -
adminSecretName
: Secret Name foradminId
. This parameter is required. The provided secret must have type "kubernetes.io/rbd". -
adminSecretNamespace
: The namespace foradminSecretName
. Default is "default". -
pool
: Ceph RBD pool. Default is "rbd". -
userId
: Ceph client ID that is used to map the RBD image. Default is the same asadminId
. -
userSecretName
: The name of Ceph Secret foruserId
to map RBD image. It must exist in the same namespace as PVCs. This parameter is required. The provided secret must have type "kubernetes.io/rbd", for example created in this way:kubectl create secret generic ceph-secret --type="kubernetes.io/rbd" \ --from-literal=key='QVFEQ1pMdFhPUnQrSmhBQUFYaERWNHJsZ3BsMmNjcDR6RFZST0E9PQ==' \ --namespace=kube-system
-
userSecretNamespace
: The namespace foruserSecretName
. -
fsType
: fsType that is supported by kubernetes. Default:"ext4"
. -
imageFormat
: Ceph RBD image format, "1" or "2". Default is "2". -
imageFeatures
: This parameter is optional and should only be used if you setimageFormat
to "2". Currently supported features arelayering
only. Default is "", and no features are turned on.
Quobyte
Kubernetes v1.22 [deprecated]
The Quobyte in-tree storage plugin is deprecated, an
example
StorageClass
for the out-of-tree Quobyte plugin can be found at the Quobyte CSI repository.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/quobyte
parameters:
quobyteAPIServer: "http://138.68.74.142:7860"
registry: "138.68.74.142:7861"
adminSecretName: "quobyte-admin-secret"
adminSecretNamespace: "kube-system"
user: "root"
group: "root"
quobyteConfig: "BASE"
quobyteTenant: "DEFAULT"
-
quobyteAPIServer
: API Server of Quobyte in the format"http(s)://api-server:7860"
-
registry
: Quobyte registry to use to mount the volume. You can specify the registry as<host>:<port>
pair or if you want to specify multiple registries, put a comma between them.<host1>:<port>,<host2>:<port>,<host3>:<port>
. The host can be an IP address or if you have a working DNS you can also provide the DNS names. -
adminSecretNamespace
: The namespace foradminSecretName
. Default is "default". -
adminSecretName
: secret that holds information about the Quobyte user and the password to authenticate against the API server. The provided secret must have type "kubernetes.io/quobyte" and the keysuser
andpassword
, for example:kubectl create secret generic quobyte-admin-secret \ --type="kubernetes.io/quobyte" --from-literal=user='admin' --from-literal=password='opensesame' \ --namespace=kube-system
-
user
: maps all access to this user. Default is "root". -
group
: maps all access to this group. Default is "nfsnobody". -
quobyteConfig
: use the specified configuration to create the volume. You can create a new configuration or modify an existing one with the Web console or the quobyte CLI. Default is "BASE". -
quobyteTenant
: use the specified tenant ID to create/delete the volume. This Quobyte tenant has to be already present in Quobyte. Default is "DEFAULT".
Azure Disk
Azure Unmanaged Disk storage class
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/azure-disk
parameters:
skuName: Standard_LRS
location: eastus
storageAccount: azure_storage_account_name
skuName
: Azure storage account Sku tier. Default is empty.location
: Azure storage account location. Default is empty.storageAccount
: Azure storage account name. If a storage account is provided, it must reside in the same resource group as the cluster, andlocation
is ignored. If a storage account is not provided, a new storage account will be created in the same resource group as the cluster.
Azure Disk storage class (starting from v1.7.2)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/azure-disk
parameters:
storageaccounttype: Standard_LRS
kind: managed
storageaccounttype
: Azure storage account Sku tier. Default is empty.kind
: Possible values areshared
,dedicated
, andmanaged
(default). Whenkind
isshared
, all unmanaged disks are created in a few shared storage accounts in the same resource group as the cluster. Whenkind
isdedicated
, a new dedicated storage account will be created for the new unmanaged disk in the same resource group as the cluster. Whenkind
ismanaged
, all managed disks are created in the same resource group as the cluster.resourceGroup
: Specify the resource group in which the Azure disk will be created. It must be an existing resource group name. If it is unspecified, the disk will be placed in the same resource group as the current Kubernetes cluster.
- Premium VM can attach both Standard_LRS and Premium_LRS disks, while Standard VM can only attach Standard_LRS disks.
- Managed VM can only attach managed disks and unmanaged VM can only attach unmanaged disks.
Azure File
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurefile
provisioner: kubernetes.io/azure-file
parameters:
skuName: Standard_LRS
location: eastus
storageAccount: azure_storage_account_name
skuName
: Azure storage account Sku tier. Default is empty.location
: Azure storage account location. Default is empty.storageAccount
: Azure storage account name. Default is empty. If a storage account is not provided, all storage accounts associated with the resource group are searched to find one that matchesskuName
andlocation
. If a storage account is provided, it must reside in the same resource group as the cluster, andskuName
andlocation
are ignored.secretNamespace
: the namespace of the secret that contains the Azure Storage Account Name and Key. Default is the same as the Pod.secretName
: the name of the secret that contains the Azure Storage Account Name and Key. Default isazure-storage-account-<accountName>-secret
readOnly
: a flag indicating whether the storage will be mounted as read only. Defaults to false which means a read/write mount. This setting will impact theReadOnly
setting in VolumeMounts as well.
During storage provisioning, a secret named by secretName
is created for the
mounting credentials. If the cluster has enabled both
RBAC and
Controller Roles,
add the create
permission of resource secret
for clusterrole
system:controller:persistent-volume-binder
.
In a multi-tenancy context, it is strongly recommended to set the value for
secretNamespace
explicitly, otherwise the storage account credentials may
be read by other users.
Portworx Volume
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: portworx-io-priority-high
provisioner: kubernetes.io/portworx-volume
parameters:
repl: "1"
snap_interval: "70"
priority_io: "high"
fs
: filesystem to be laid out:none/xfs/ext4
(default:ext4
).block_size
: block size in Kbytes (default:32
).repl
: number of synchronous replicas to be provided in the form of replication factor1..3
(default:1
) A string is expected here i.e."1"
and not1
.priority_io
: determines whether the volume will be created from higher performance or a lower priority storagehigh/medium/low
(default:low
).snap_interval
: clock/time interval in minutes for when to trigger snapshots. Snapshots are incremental based on difference with the prior snapshot, 0 disables snaps (default:0
). A string is expected here i.e."70"
and not70
.aggregation_level
: specifies the number of chunks the volume would be distributed into, 0 indicates a non-aggregated volume (default:0
). A string is expected here i.e."0"
and not0
ephemeral
: specifies whether the volume should be cleaned-up after unmount or should be persistent.emptyDir
use case can set this value to true andpersistent volumes
use case such as for databases like Cassandra should set to false,true/false
(defaultfalse
). A string is expected here i.e."true"
and nottrue
.
ScaleIO
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/scaleio
parameters:
gateway: https://192.168.99.200:443/api
system: scaleio
protectionDomain: pd0
storagePool: sp1
storageMode: ThinProvisioned
secretRef: sio-secret
readOnly: "false"
fsType: xfs
provisioner
: attribute is set tokubernetes.io/scaleio
gateway
: address to a ScaleIO API gateway (required)system
: the name of the ScaleIO system (required)protectionDomain
: the name of the ScaleIO protection domain (required)storagePool
: the name of the volume storage pool (required)storageMode
: the storage provision mode:ThinProvisioned
(default) orThickProvisioned
secretRef
: reference to a configured Secret object (required)readOnly
: specifies the access mode to the mounted volume (default false)fsType
: the file system to use for the volume (default ext4)
The ScaleIO Kubernetes volume plugin requires a configured Secret object.
The secret must be created with type kubernetes.io/scaleio
and use the same
namespace value as that of the PVC where it is referenced
as shown in the following command:
kubectl create secret generic sio-secret --type="kubernetes.io/scaleio" \
--from-literal=username=sioadmin --from-literal=password=d2NABDNjMA== \
--namespace=default
StorageOS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/storageos
parameters:
pool: default
description: Kubernetes volume
fsType: ext4
adminSecretNamespace: default
adminSecretName: storageos-secret
pool
: The name of the StorageOS distributed capacity pool to provision the volume from. Uses thedefault
pool which is normally present if not specified.description
: The description to assign to volumes that were created dynamically. All volume descriptions will be the same for the storage class, but different storage classes can be used to allow descriptions for different use cases. Defaults toKubernetes volume
.fsType
: The default filesystem type to request. Note that user-defined rules within StorageOS may override this value. Defaults toext4
.adminSecretNamespace
: The namespace where the API configuration secret is located. Required if adminSecretName set.adminSecretName
: The name of the secret to use for obtaining the StorageOS API credentials. If not specified, default values will be attempted.
The StorageOS Kubernetes volume plugin can use a Secret object to specify an
endpoint and credentials to access the StorageOS API. This is only required when
the defaults have been changed.
The secret must be created with type kubernetes.io/storageos
as shown in the
following command:
kubectl create secret generic storageos-secret \
--type="kubernetes.io/storageos" \
--from-literal=apiAddress=tcp://localhost:5705 \
--from-literal=apiUsername=storageos \
--from-literal=apiPassword=storageos \
--namespace=default
Secrets used for dynamically provisioned volumes may be created in any namespace
and referenced with the adminSecretNamespace
parameter. Secrets used by
pre-provisioned volumes must be created in the same namespace as the PVC that
references it.
Local
Kubernetes v1.14 [stable]
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
Local volumes do not currently support dynamic provisioning, however a StorageClass
should still be created to delay volume binding until Pod scheduling. This is
specified by the WaitForFirstConsumer
volume binding mode.
Delaying volume binding allows the scheduler to consider all of a Pod's scheduling constraints when choosing an appropriate PersistentVolume for a PersistentVolumeClaim.
3.6.6 - Dynamic Volume Provisioning
Dynamic volume provisioning allows storage volumes to be created on-demand.
Without dynamic provisioning, cluster administrators have to manually make
calls to their cloud or storage provider to create new storage volumes, and
then create PersistentVolume
objects
to represent them in Kubernetes. The dynamic provisioning feature eliminates
the need for cluster administrators to pre-provision storage. Instead, it
automatically provisions storage when it is requested by users.
Background
The implementation of dynamic volume provisioning is based on the API object StorageClass
from the API group storage.k8s.io
. A cluster administrator can define as many
StorageClass
objects as needed, each specifying a volume plugin (aka
provisioner) that provisions a volume and the set of parameters to pass to
that provisioner when provisioning.
A cluster administrator can define and expose multiple flavors of storage (from
the same or different storage systems) within a cluster, each with a custom set
of parameters. This design also ensures that end users don't have to worry
about the complexity and nuances of how storage is provisioned, but still
have the ability to select from multiple storage options.
More information on storage classes can be found here.
Enabling Dynamic Provisioning
To enable dynamic provisioning, a cluster administrator needs to pre-create one or more StorageClass objects for users. StorageClass objects define which provisioner should be used and what parameters should be passed to that provisioner when dynamic provisioning is invoked. The name of a StorageClass object must be a valid DNS subdomain name.
The following manifest creates a storage class "slow" which provisions standard disk-like persistent disks.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
The following manifest creates a storage class "fast" which provisions SSD-like persistent disks.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
Using Dynamic Provisioning
Users request dynamically provisioned storage by including a storage class in
their PersistentVolumeClaim
. Before Kubernetes v1.6, this was done via the
volume.beta.kubernetes.io/storage-class
annotation. However, this annotation
is deprecated since v1.9. Users now can and should instead use the
storageClassName
field of the PersistentVolumeClaim
object. The value of
this field must match the name of a StorageClass
configured by the
administrator (see below).
To select the "fast" storage class, for example, a user would create the following PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: claim1
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast
resources:
requests:
storage: 30Gi
This claim results in an SSD-like Persistent Disk being automatically provisioned. When the claim is deleted, the volume is destroyed.
Defaulting Behavior
Dynamic provisioning can be enabled on a cluster such that all claims are dynamically provisioned if no storage class is specified. A cluster administrator can enable this behavior by:
- Marking one
StorageClass
object as default; - Making sure that the
DefaultStorageClass
admission controller is enabled on the API server.
An administrator can mark a specific StorageClass
as default by adding the
storageclass.kubernetes.io/is-default-class
annotation to it.
When a default StorageClass
exists in a cluster and a user creates a
PersistentVolumeClaim
with storageClassName
unspecified, the
DefaultStorageClass
admission controller automatically adds the
storageClassName
field pointing to the default storage class.
Note that there can be at most one default storage class on a cluster, or
a PersistentVolumeClaim
without storageClassName
explicitly specified cannot
be created.
Topology Awareness
In Multi-Zone clusters, Pods can be spread across Zones in a Region. Single-Zone storage backends should be provisioned in the Zones where Pods are scheduled. This can be accomplished by setting the Volume Binding Mode.
3.6.7 - Volume Snapshots
In Kubernetes, a VolumeSnapshot represents a snapshot of a volume on a storage system. This document assumes that you are already familiar with Kubernetes persistent volumes.
Introduction
Similar to how API resources PersistentVolume
and PersistentVolumeClaim
are used to provision volumes for users and administrators, VolumeSnapshotContent
and VolumeSnapshot
API resources are provided to create volume snapshots for users and administrators.
A VolumeSnapshotContent
is a snapshot taken from a volume in the cluster that has been provisioned by an administrator. It is a resource in the cluster just like a PersistentVolume is a cluster resource.
A VolumeSnapshot
is a request for snapshot of a volume by a user. It is similar to a PersistentVolumeClaim.
VolumeSnapshotClass
allows you to specify different attributes belonging to a VolumeSnapshot
. These attributes may differ among snapshots taken from the same volume on the storage system and therefore cannot be expressed by using the same StorageClass
of a PersistentVolumeClaim
.
Volume snapshots provide Kubernetes users with a standardized way to copy a volume's contents at a particular point in time without creating an entirely new volume. This functionality enables, for example, database administrators to backup databases before performing edit or delete modifications.
Users need to be aware of the following when using this feature:
- API Objects
VolumeSnapshot
,VolumeSnapshotContent
, andVolumeSnapshotClass
are CRDs, not part of the core API. VolumeSnapshot
support is only available for CSI drivers.- As part of the deployment process of
VolumeSnapshot
, the Kubernetes team provides a snapshot controller to be deployed into the control plane, and a sidecar helper container called csi-snapshotter to be deployed together with the CSI driver. The snapshot controller watchesVolumeSnapshot
andVolumeSnapshotContent
objects and is responsible for the creation and deletion ofVolumeSnapshotContent
object. The sidecar csi-snapshotter watchesVolumeSnapshotContent
objects and triggersCreateSnapshot
andDeleteSnapshot
operations against a CSI endpoint. - There is also a validating webhook server which provides tightened validation on snapshot objects. This should be installed by the Kubernetes distros along with the snapshot controller and CRDs, not CSI drivers. It should be installed in all Kubernetes clusters that has the snapshot feature enabled.
- CSI drivers may or may not have implemented the volume snapshot functionality. The CSI drivers that have provided support for volume snapshot will likely use the csi-snapshotter. See CSI Driver documentation for details.
- The CRDs and snapshot controller installations are the responsibility of the Kubernetes distribution.
Lifecycle of a volume snapshot and volume snapshot content
VolumeSnapshotContents
are resources in the cluster. VolumeSnapshots
are requests for those resources. The interaction between VolumeSnapshotContents
and VolumeSnapshots
follow this lifecycle:
Provisioning Volume Snapshot
There are two ways snapshots may be provisioned: pre-provisioned or dynamically provisioned.
Pre-provisioned
A cluster administrator creates a number of VolumeSnapshotContents
. They carry the details of the real volume snapshot on the storage system which is available for use by cluster users. They exist in the Kubernetes API and are available for consumption.
Dynamic
Instead of using a pre-existing snapshot, you can request that a snapshot to be dynamically taken from a PersistentVolumeClaim. The VolumeSnapshotClass specifies storage provider-specific parameters to use when taking a snapshot.
Binding
The snapshot controller handles the binding of a VolumeSnapshot
object with an appropriate VolumeSnapshotContent
object, in both pre-provisioned and dynamically provisioned scenarios. The binding is a one-to-one mapping.
In the case of pre-provisioned binding, the VolumeSnapshot will remain unbound until the requested VolumeSnapshotContent object is created.
Persistent Volume Claim as Snapshot Source Protection
The purpose of this protection is to ensure that in-use PersistentVolumeClaim API objects are not removed from the system while a snapshot is being taken from it (as this may result in data loss).
While a snapshot is being taken of a PersistentVolumeClaim, that PersistentVolumeClaim is in-use. If you delete a PersistentVolumeClaim API object in active use as a snapshot source, the PersistentVolumeClaim object is not removed immediately. Instead, removal of the PersistentVolumeClaim object is postponed until the snapshot is readyToUse or aborted.
Delete
Deletion is triggered by deleting the VolumeSnapshot
object, and the DeletionPolicy
will be followed. If the DeletionPolicy
is Delete
, then the underlying storage snapshot will be deleted along with the VolumeSnapshotContent
object. If the DeletionPolicy
is Retain
, then both the underlying snapshot and VolumeSnapshotContent
remain.
VolumeSnapshots
Each VolumeSnapshot contains a spec and a status.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: new-snapshot-test
spec:
volumeSnapshotClassName: csi-hostpath-snapclass
source:
persistentVolumeClaimName: pvc-test
persistentVolumeClaimName
is the name of the PersistentVolumeClaim data source for the snapshot. This field is required for dynamically provisioning a snapshot.
A volume snapshot can request a particular class by specifying the name of a
VolumeSnapshotClass
using the attribute volumeSnapshotClassName
. If nothing is set, then the default class is used if available.
For pre-provisioned snapshots, you need to specify a volumeSnapshotContentName
as the source for the snapshot as shown in the following example. The volumeSnapshotContentName
source field is required for pre-provisioned snapshots.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-snapshot
spec:
source:
volumeSnapshotContentName: test-content
Volume Snapshot Contents
Each VolumeSnapshotContent contains a spec and status. In dynamic provisioning, the snapshot common controller creates VolumeSnapshotContent
objects. Here is an example:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
name: snapcontent-72d9a349-aacd-42d2-a240-d775650d2455
spec:
deletionPolicy: Delete
driver: hostpath.csi.k8s.io
source:
volumeHandle: ee0cfb94-f8d4-11e9-b2d8-0242ac110002
volumeSnapshotClassName: csi-hostpath-snapclass
volumeSnapshotRef:
name: new-snapshot-test
namespace: default
uid: 72d9a349-aacd-42d2-a240-d775650d2455
volumeHandle
is the unique identifier of the volume created on the storage backend and returned by the CSI driver during the volume creation. This field is required for dynamically provisioning a snapshot. It specifies the volume source of the snapshot.
For pre-provisioned snapshots, you (as cluster administrator) are responsible for creating the VolumeSnapshotContent
object as follows.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
name: new-snapshot-content-test
spec:
deletionPolicy: Delete
driver: hostpath.csi.k8s.io
source:
snapshotHandle: 7bdd0de3-aaeb-11e8-9aae-0242ac110002
volumeSnapshotRef:
name: new-snapshot-test
namespace: default
snapshotHandle
is the unique identifier of the volume snapshot created on the storage backend. This field is required for the pre-provisioned snapshots. It specifies the CSI snapshot id on the storage system that this VolumeSnapshotContent
represents.
Provisioning Volumes from Snapshots
You can provision a new volume, pre-populated with data from a snapshot, by using
the dataSource field in the PersistentVolumeClaim
object.
For more details, see Volume Snapshot and Restore Volume from Snapshot.
3.6.8 - Volume Snapshot Classes
This document describes the concept of VolumeSnapshotClass in Kubernetes. Familiarity with volume snapshots and storage classes is suggested.
Introduction
Just like StorageClass provides a way for administrators to describe the "classes" of storage they offer when provisioning a volume, VolumeSnapshotClass provides a way to describe the "classes" of storage when provisioning a volume snapshot.
The VolumeSnapshotClass Resource
Each VolumeSnapshotClass contains the fields driver
, deletionPolicy
, and parameters
,
which are used when a VolumeSnapshot belonging to the class needs to be
dynamically provisioned.
The name of a VolumeSnapshotClass object is significant, and is how users can request a particular class. Administrators set the name and other parameters of a class when first creating VolumeSnapshotClass objects, and the objects cannot be updated once they are created.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-hostpath-snapclass
driver: hostpath.csi.k8s.io
deletionPolicy: Delete
parameters:
Administrators can specify a default VolumeSnapshotClass for VolumeSnapshots
that don't request any particular class to bind to by adding the
snapshot.storage.kubernetes.io/is-default-class: "true"
annotation:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-hostpath-snapclass
annotations:
snapshot.storage.kubernetes.io/is-default-class: "true"
driver: hostpath.csi.k8s.io
deletionPolicy: Delete
parameters:
Driver
Volume snapshot classes have a driver that determines what CSI volume plugin is used for provisioning VolumeSnapshots. This field must be specified.
DeletionPolicy
Volume snapshot classes have a deletionPolicy. It enables you to configure what happens to a VolumeSnapshotContent when the VolumeSnapshot object it is bound to is to be deleted. The deletionPolicy of a volume snapshot class can either be Retain
or Delete
. This field must be specified.
If the deletionPolicy is Delete
, then the underlying storage snapshot will be deleted along with the VolumeSnapshotContent object. If the deletionPolicy is Retain
, then both the underlying snapshot and VolumeSnapshotContent remain.
Parameters
Volume snapshot classes have parameters that describe volume snapshots belonging to
the volume snapshot class. Different parameters may be accepted depending on the
driver
.
3.6.9 - CSI Volume Cloning
This document describes the concept of cloning existing CSI Volumes in Kubernetes. Familiarity with Volumes is suggested.
Introduction
The CSI Volume Cloning feature adds support for specifying existing PVCs in the dataSource
field to indicate a user would like to clone a Volume.
A Clone is defined as a duplicate of an existing Kubernetes Volume that can be consumed as any standard Volume would be. The only difference is that upon provisioning, rather than creating a "new" empty Volume, the back end device creates an exact duplicate of the specified Volume.
The implementation of cloning, from the perspective of the Kubernetes API, adds the ability to specify an existing PVC as a dataSource during new PVC creation. The source PVC must be bound and available (not in use).
Users need to be aware of the following when using this feature:
- Cloning support (
VolumePVCDataSource
) is only available for CSI drivers. - Cloning support is only available for dynamic provisioners.
- CSI drivers may or may not have implemented the volume cloning functionality.
- You can only clone a PVC when it exists in the same namespace as the destination PVC (source and destination must be in the same namespace).
- Cloning is only supported within the same Storage Class.
- Destination volume must be the same storage class as the source
- Default storage class can be used and storageClassName omitted in the spec
- Cloning can only be performed between two volumes that use the same VolumeMode setting (if you request a block mode volume, the source MUST also be block mode)
Provisioning
Clones are provisioned like any other PVC with the exception of adding a dataSource that references an existing PVC in the same namespace.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: clone-of-pvc-1
namespace: myns
spec:
accessModes:
- ReadWriteOnce
storageClassName: cloning
resources:
requests:
storage: 5Gi
dataSource:
kind: PersistentVolumeClaim
name: pvc-1
spec.resources.requests.storage
, and the value you specify must be the same or larger than the capacity of the source volume.
The result is a new PVC with the name clone-of-pvc-1
that has the exact same content as the specified source pvc-1
.
Usage
Upon availability of the new PVC, the cloned PVC is consumed the same as other PVC. It's also expected at this point that the newly created PVC is an independent object. It can be consumed, cloned, snapshotted, or deleted independently and without consideration for it's original dataSource PVC. This also implies that the source is not linked in any way to the newly created clone, it may also be modified or deleted without affecting the newly created clone.
3.6.10 - Storage Capacity
Storage capacity is limited and may vary depending on the node on which a pod runs: network-attached storage might not be accessible by all nodes, or storage is local to a node to begin with.
Kubernetes v1.21 [beta]
This page describes how Kubernetes keeps track of storage capacity and how the scheduler uses that information to schedule Pods onto nodes that have access to enough storage capacity for the remaining missing volumes. Without storage capacity tracking, the scheduler may choose a node that doesn't have enough capacity to provision a volume and multiple scheduling retries will be needed.
Tracking storage capacity is supported for Container Storage Interface (CSI) drivers and needs to be enabled when installing a CSI driver.
API
There are two API extensions for this feature:
- CSIStorageCapacity objects: these get produced by a CSI driver in the namespace where the driver is installed. Each object contains capacity information for one storage class and defines which nodes have access to that storage.
- The
CSIDriverSpec.StorageCapacity
field: when set totrue
, the Kubernetes scheduler will consider storage capacity for volumes that use the CSI driver.
Scheduling
Storage capacity information is used by the Kubernetes scheduler if:
- the
CSIStorageCapacity
feature gate is true, - a Pod uses a volume that has not been created yet,
- that volume uses a StorageClass which references a CSI driver and
uses
WaitForFirstConsumer
volume binding mode, and - the
CSIDriver
object for the driver hasStorageCapacity
set to true.
In that case, the scheduler only considers nodes for the Pod which
have enough storage available to them. This check is very
simplistic and only compares the size of the volume against the
capacity listed in CSIStorageCapacity
objects with a topology that
includes the node.
For volumes with Immediate
volume binding mode, the storage driver
decides where to create the volume, independently of Pods that will
use the volume. The scheduler then schedules Pods onto nodes where the
volume is available after the volume has been created.
For CSI ephemeral volumes, scheduling always happens without considering storage capacity. This is based on the assumption that this volume type is only used by special CSI drivers which are local to a node and do not need significant resources there.
Rescheduling
When a node has been selected for a Pod with WaitForFirstConsumer
volumes, that decision is still tentative. The next step is that the
CSI storage driver gets asked to create the volume with a hint that the
volume is supposed to be available on the selected node.
Because Kubernetes might have chosen a node based on out-dated capacity information, it is possible that the volume cannot really be created. The node selection is then reset and the Kubernetes scheduler tries again to find a node for the Pod.
Limitations
Storage capacity tracking increases the chance that scheduling works on the first try, but cannot guarantee this because the scheduler has to decide based on potentially out-dated information. Usually, the same retry mechanism as for scheduling without any storage capacity information handles scheduling failures.
One situation where scheduling can fail permanently is when a Pod uses multiple volumes: one volume might have been created already in a topology segment which then does not have enough capacity left for another volume. Manual intervention is necessary to recover from this, for example by increasing capacity or deleting the volume that was already created. Further work is needed to handle this automatically.
Enabling storage capacity tracking
Storage capacity tracking is a beta feature and enabled by default in a Kubernetes cluster since Kubernetes 1.21. In addition to having the feature enabled in the cluster, a CSI driver also has to support it. Please refer to the driver's documentation for details.
What's next
- For more information on the design, see the Storage Capacity Constraints for Pod Scheduling KEP.
- For more information on further development of this feature, see the enhancement tracking issue #1472.
- Learn about Kubernetes Scheduler
3.6.11 - Node-specific Volume Limits
This page describes the maximum number of volumes that can be attached to a Node for various cloud providers.
Cloud providers like Google, Amazon, and Microsoft typically have a limit on how many volumes can be attached to a Node. It is important for Kubernetes to respect those limits. Otherwise, Pods scheduled on a Node could get stuck waiting for volumes to attach.
Kubernetes default limits
The Kubernetes scheduler has default limits on the number of volumes that can be attached to a Node:
Cloud service | Maximum volumes per Node |
---|---|
Amazon Elastic Block Store (EBS) | 39 |
Google Persistent Disk | 16 |
Microsoft Azure Disk Storage | 16 |
Custom limits
You can change these limits by setting the value of the
KUBE_MAX_PD_VOLS
environment variable, and then starting the scheduler.
CSI drivers might have a different procedure, see their documentation
on how to customize their limits.
Use caution if you set a limit that is higher than the default limit. Consult the cloud provider's documentation to make sure that Nodes can actually support the limit you set.
The limit applies to the entire cluster, so it affects all Nodes.
Dynamic volume limits
Kubernetes v1.17 [stable]
Dynamic volume limits are supported for following volume types.
- Amazon EBS
- Google Persistent Disk
- Azure Disk
- CSI
For volumes managed by in-tree volume plugins, Kubernetes automatically determines the Node type and enforces the appropriate maximum number of volumes for the node. For example:
-
On Google Compute Engine, up to 127 volumes can be attached to a node, depending on the node type.
-
For Amazon EBS disks on M5,C5,R5,T3 and Z1D instance types, Kubernetes allows only 25 volumes to be attached to a Node. For other instance types on Amazon Elastic Compute Cloud (EC2), Kubernetes allows 39 volumes to be attached to a Node.
-
On Azure, up to 64 disks can be attached to a node, depending on the node type. For more details, refer to Sizes for virtual machines in Azure.
-
If a CSI storage driver advertises a maximum number of volumes for a Node (using
NodeGetInfo
), the kube-scheduler honors that limit. Refer to the CSI specifications for details. -
For volumes managed by in-tree plugins that have been migrated to a CSI driver, the maximum number of volumes will be the one reported by the CSI driver.
3.6.12 - Volume Health Monitoring
Kubernetes v1.21 [alpha]
CSI volume health monitoring allows CSI Drivers to detect abnormal volume conditions from the underlying storage systems and report them as events on PVCs or Pods.
Volume health monitoring
Kubernetes volume health monitoring is part of how Kubernetes implements the Container Storage Interface (CSI). Volume health monitoring feature is implemented in two components: an External Health Monitor controller, and the kubelet.
If a CSI Driver supports Volume Health Monitoring feature from the controller side, an event will be reported on the related PersistentVolumeClaim (PVC) when an abnormal volume condition is detected on a CSI volume.
The External Health Monitor controller also watches for node failure events. You can enable node failure monitoring by setting the enable-node-watcher
flag to true. When the external health monitor detects a node failure event, the controller reports an Event will be reported on the PVC to indicate that pods using this PVC are on a failed node.
If a CSI Driver supports Volume Health Monitoring feature from the node side, an Event will be reported on every Pod using the PVC when an abnormal volume condition is detected on a CSI volume.
CSIVolumeHealth
feature gate to use this feature from the node side.
What's next
See the CSI driver documentation to find out which CSI drivers have implemented this feature.
3.7 - Configuration
3.7.1 - Configuration Best Practices
This document highlights and consolidates configuration best practices that are introduced throughout the user guide, Getting Started documentation, and examples.
This is a living document. If you think of something that is not on this list but might be useful to others, please don't hesitate to file an issue or submit a PR.
General Configuration Tips
-
When defining configurations, specify the latest stable API version.
-
Configuration files should be stored in version control before being pushed to the cluster. This allows you to quickly roll back a configuration change if necessary. It also aids cluster re-creation and restoration.
-
Write your configuration files using YAML rather than JSON. Though these formats can be used interchangeably in almost all scenarios, YAML tends to be more user-friendly.
-
Group related objects into a single file whenever it makes sense. One file is often easier to manage than several. See the guestbook-all-in-one.yaml file as an example of this syntax.
-
Note also that many
kubectl
commands can be called on a directory. For example, you can callkubectl apply
on a directory of config files. -
Don't specify default values unnecessarily: simple, minimal configuration will make errors less likely.
-
Put object descriptions in annotations, to allow better introspection.
"Naked" Pods versus ReplicaSets, Deployments, and Jobs
-
Don't use naked Pods (that is, Pods not bound to a ReplicaSet or Deployment) if you can avoid it. Naked Pods will not be rescheduled in the event of a node failure.
A Deployment, which both creates a ReplicaSet to ensure that the desired number of Pods is always available, and specifies a strategy to replace Pods (such as RollingUpdate), is almost always preferable to creating Pods directly, except for some explicit
restartPolicy: Never
scenarios. A Job may also be appropriate.
Services
-
Create a Service before its corresponding backend workloads (Deployments or ReplicaSets), and before any workloads that need to access it. When Kubernetes starts a container, it provides environment variables pointing to all the Services which were running when the container was started. For example, if a Service named
foo
exists, all containers will get the following variables in their initial environment:FOO_SERVICE_HOST=<the host the Service is running on> FOO_SERVICE_PORT=<the port the Service is running on>
This does imply an ordering requirement - any
Service
that aPod
wants to access must be created before thePod
itself, or else the environment variables will not be populated. DNS does not have this restriction. -
An optional (though strongly recommended) cluster add-on is a DNS server. The DNS server watches the Kubernetes API for new
Services
and creates a set of DNS records for each. If DNS has been enabled throughout the cluster then allPods
should be able to do name resolution ofServices
automatically. -
Don't specify a
hostPort
for a Pod unless it is absolutely necessary. When you bind a Pod to ahostPort
, it limits the number of places the Pod can be scheduled, because each <hostIP
,hostPort
,protocol
> combination must be unique. If you don't specify thehostIP
andprotocol
explicitly, Kubernetes will use0.0.0.0
as the defaulthostIP
andTCP
as the defaultprotocol
.If you only need access to the port for debugging purposes, you can use the apiserver proxy or
kubectl port-forward
.If you explicitly need to expose a Pod's port on the node, consider using a NodePort Service before resorting to
hostPort
. -
Avoid using
hostNetwork
, for the same reasons ashostPort
. -
Use headless Services (which have a
ClusterIP
ofNone
) for service discovery when you don't needkube-proxy
load balancing.
Using Labels
- Define and use labels that identify semantic attributes of your application or Deployment, such as
{ app: myapp, tier: frontend, phase: test, deployment: v3 }
. You can use these labels to select the appropriate Pods for other resources; for example, a Service that selects alltier: frontend
Pods, or allphase: test
components ofapp: myapp
. See the guestbook app for examples of this approach.
A Service can be made to span multiple Deployments by omitting release-specific labels from its selector. When you need to update a running service without downtime, use a Deployment.
A desired state of an object is described by a Deployment, and if changes to that spec are applied, the deployment controller changes the actual state to the desired state at a controlled rate.
-
Use the Kubernetes common labels for common use cases. These standardized labels enrich the metadata in a way that allows tools, including
kubectl
and dashboard, to work in an interoperable way. -
You can manipulate labels for debugging. Because Kubernetes controllers (such as ReplicaSet) and Services match to Pods using selector labels, removing the relevant labels from a Pod will stop it from being considered by a controller or from being served traffic by a Service. If you remove the labels of an existing Pod, its controller will create a new Pod to take its place. This is a useful way to debug a previously "live" Pod in a "quarantine" environment. To interactively remove or add labels, use
kubectl label
.
Using kubectl
-
Use
kubectl apply -f <directory>
. This looks for Kubernetes configuration in all.yaml
,.yml
, and.json
files in<directory>
and passes it toapply
. -
Use label selectors for
get
anddelete
operations instead of specific object names. See the sections on label selectors and using labels effectively. -
Use
kubectl create deployment
andkubectl expose
to quickly create single-container Deployments and Services. See Use a Service to Access an Application in a Cluster for an example.
3.7.2 - ConfigMaps
A ConfigMap is an API object used to store non-confidential data in key-value pairs. Pods can consume ConfigMaps as environment variables, command-line arguments, or as configuration files in a volume.
A ConfigMap allows you to decouple environment-specific configuration from your container images, so that your applications are easily portable.
Motivation
Use a ConfigMap for setting configuration data separately from application code.
For example, imagine that you are developing an application that you can run on your
own computer (for development) and in the cloud (to handle real traffic).
You write the code to look in an environment variable named DATABASE_HOST
.
Locally, you set that variable to localhost
. In the cloud, you set it to
refer to a Kubernetes Service
that exposes the database component to your cluster.
This lets you fetch a container image running in the cloud and
debug the exact same code locally if needed.
A ConfigMap is not designed to hold large chunks of data. The data stored in a ConfigMap cannot exceed 1 MiB. If you need to store settings that are larger than this limit, you may want to consider mounting a volume or use a separate database or file service.
ConfigMap object
A ConfigMap is an API object
that lets you store configuration for other objects to use. Unlike most
Kubernetes objects that have a spec
, a ConfigMap has data
and binaryData
fields. These fields accept key-value pairs as their values. Both the data
field and the binaryData
are optional. The data
field is designed to
contain UTF-8 strings while the binaryData
field is designed to
contain binary data as base64-encoded strings.
The name of a ConfigMap must be a valid DNS subdomain name.
Each key under the data
or the binaryData
field must consist of
alphanumeric characters, -
, _
or .
. The keys stored in data
must not
overlap with the keys in the binaryData
field.
Starting from v1.19, you can add an immutable
field to a ConfigMap
definition to create an immutable ConfigMap.
ConfigMaps and Pods
You can write a Pod spec
that refers to a ConfigMap and configures the container(s)
in that Pod based on the data in the ConfigMap. The Pod and the ConfigMap must be in
the same namespace.
spec
of a static Pod cannot refer to a ConfigMap
or any other API objects.
Here's an example ConfigMap that has some keys with single values, and other keys where the value looks like a fragment of a configuration format.
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
data:
# property-like keys; each key maps to a simple value
player_initial_lives: "3"
ui_properties_file_name: "user-interface.properties"
# file-like keys
game.properties: |
enemy.types=aliens,monsters
player.maximum-lives=5
user-interface.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
There are four different ways that you can use a ConfigMap to configure a container inside a Pod:
- Inside a container command and args
- Environment variables for a container
- Add a file in read-only volume, for the application to read
- Write code to run inside the Pod that uses the Kubernetes API to read a ConfigMap
These different methods lend themselves to different ways of modeling the data being consumed. For the first three methods, the kubelet uses the data from the ConfigMap when it launches container(s) for a Pod.
The fourth method means you have to write code to read the ConfigMap and its data. However, because you're using the Kubernetes API directly, your application can subscribe to get updates whenever the ConfigMap changes, and react when that happens. By accessing the Kubernetes API directly, this technique also lets you access a ConfigMap in a different namespace.
Here's an example Pod that uses values from game-demo
to configure a Pod:
apiVersion: v1
kind: Pod
metadata:
name: configmap-demo-pod
spec:
containers:
- name: demo
image: alpine
command: ["sleep", "3600"]
env:
# Define the environment variable
- name: PLAYER_INITIAL_LIVES # Notice that the case is different here
# from the key name in the ConfigMap.
valueFrom:
configMapKeyRef:
name: game-demo # The ConfigMap this value comes from.
key: player_initial_lives # The key to fetch.
- name: UI_PROPERTIES_FILE_NAME
valueFrom:
configMapKeyRef:
name: game-demo
key: ui_properties_file_name
volumeMounts:
- name: config
mountPath: "/config"
readOnly: true
volumes:
# You set volumes at the Pod level, then mount them into containers inside that Pod
- name: config
configMap:
# Provide the name of the ConfigMap you want to mount.
name: game-demo
# An array of keys from the ConfigMap to create as files
items:
- key: "game.properties"
path: "game.properties"
- key: "user-interface.properties"
path: "user-interface.properties"
A ConfigMap doesn't differentiate between single line property values and multi-line file-like values. What matters is how Pods and other objects consume those values.
For this example, defining a volume and mounting it inside the demo
container as /config
creates two files,
/config/game.properties
and /config/user-interface.properties
,
even though there are four keys in the ConfigMap. This is because the Pod
definition specifies an items
array in the volumes
section.
If you omit the items
array entirely, every key in the ConfigMap becomes
a file with the same name as the key, and you get 4 files.
Using ConfigMaps
ConfigMaps can be mounted as data volumes. ConfigMaps can also be used by other parts of the system, without being directly exposed to the Pod. For example, ConfigMaps can hold data that other parts of the system should use for configuration.
The most common way to use ConfigMaps is to configure settings for containers running in a Pod in the same namespace. You can also use a ConfigMap separately.
For example, you might encounter addons or operators that adjust their behavior based on a ConfigMap.
Using ConfigMaps as files from a Pod
To consume a ConfigMap in a volume in a Pod:
- Create a ConfigMap or use an existing one. Multiple Pods can reference the same ConfigMap.
- Modify your Pod definition to add a volume under
.spec.volumes[]
. Name the volume anything, and have a.spec.volumes[].configMap.name
field set to reference your ConfigMap object. - Add a
.spec.containers[].volumeMounts[]
to each container that needs the ConfigMap. Specify.spec.containers[].volumeMounts[].readOnly = true
and.spec.containers[].volumeMounts[].mountPath
to an unused directory name where you would like the ConfigMap to appear. - Modify your image or command line so that the program looks for files in
that directory. Each key in the ConfigMap
data
map becomes the filename undermountPath
.
This is an example of a Pod that mounts a ConfigMap in a volume:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
readOnly: true
volumes:
- name: foo
configMap:
name: myconfigmap
Each ConfigMap you want to use needs to be referred to in .spec.volumes
.
If there are multiple containers in the Pod, then each container needs its
own volumeMounts
block, but only one .spec.volumes
is needed per ConfigMap.
Mounted ConfigMaps are updated automatically
When a ConfigMap currently consumed in a volume is updated, projected keys are eventually updated as well.
The kubelet checks whether the mounted ConfigMap is fresh on every periodic sync.
However, the kubelet uses its local cache for getting the current value of the ConfigMap.
The type of the cache is configurable using the ConfigMapAndSecretChangeDetectionStrategy
field in
the KubeletConfiguration struct.
A ConfigMap can be either propagated by watch (default), ttl-based, or by redirecting
all requests directly to the API server.
As a result, the total delay from the moment when the ConfigMap is updated to the moment
when new keys are projected to the Pod can be as long as the kubelet sync period + cache
propagation delay, where the cache propagation delay depends on the chosen cache type
(it equals to watch propagation delay, ttl of cache, or zero correspondingly).
ConfigMaps consumed as environment variables are not updated automatically and require a pod restart.
Immutable ConfigMaps
Kubernetes v1.21 [stable]
The Kubernetes feature Immutable Secrets and ConfigMaps provides an option to set individual Secrets and ConfigMaps as immutable. For clusters that extensively use ConfigMaps (at least tens of thousands of unique ConfigMap to Pod mounts), preventing changes to their data has the following advantages:
- protects you from accidental (or unwanted) updates that could cause applications outages
- improves performance of your cluster by significantly reducing load on kube-apiserver, by closing watches for ConfigMaps marked as immutable.
This feature is controlled by the ImmutableEphemeralVolumes
feature gate.
You can create an immutable ConfigMap by setting the immutable
field to true
.
For example:
apiVersion: v1
kind: ConfigMap
metadata:
...
data:
...
immutable: true
Once a ConfigMap is marked as immutable, it is not possible to revert this change
nor to mutate the contents of the data
or the binaryData
field. You can
only delete and recreate the ConfigMap. Because existing Pods maintain a mount point
to the deleted ConfigMap, it is recommended to recreate these pods.
What's next
- Read about Secrets.
- Read Configure a Pod to Use a ConfigMap.
- Read The Twelve-Factor App to understand the motivation for separating code from configuration.
3.7.3 - Secrets
A Secret is an object that contains a small amount of sensitive data such as a password, a token, or a key. Such information might otherwise be put in a Pod specification or in a container image. Using a Secret means that you don't need to include confidential data in your application code.
Because Secrets can be created independently of the Pods that use them, there is less risk of the Secret (and its data) being exposed during the workflow of creating, viewing, and editing Pods. Kubernetes, and applications that run in your cluster, can also take additional precautions with Secrets, such as avoiding writing secret data to nonvolatile storage.
Secrets are similar to ConfigMaps but are specifically intended to hold confidential data.
Kubernetes Secrets are, by default, stored unencrypted in the API server's underlying data store (etcd). Anyone with API access can retrieve or modify a Secret, and so can anyone with access to etcd. Additionally, anyone who is authorized to create a Pod in a namespace can use that access to read any Secret in that namespace; this includes indirect access such as the ability to create a Deployment.
In order to safely use Secrets, take at least the following steps:
- Enable Encryption at Rest for Secrets.
- Enable or configure RBAC rules that restrict reading and writing the Secret. Be aware that secrets can be obtained implicitly by anyone with the permission to create a Pod.
- Where appropriate, also use mechanisms such as RBAC to limit which principals are allowed to create new Secrets or replace existing ones.
See Information security for Secrets for more details.
Uses for Secrets
There are three main ways for a Pod to use a Secret:
- As files in a volume mounted on one or more of its containers.
- As container environment variable.
- By the kubelet when pulling images for the Pod.
The Kubernetes control plane also uses Secrets; for example, bootstrap token Secrets are a mechanism to help automate node registration.
Alternatives to Secrets
Rather than using a Secret to protect confidential data, you can pick from alternatives.
Here are some of your options:
- if your cloud-native component needs to authenticate to another application that you know is running within the same Kubernetes cluster, you can use a ServiceAccount and its tokens to identify your client.
- there are third-party tools that you can run, either within or outside your cluster, that provide secrets management. For example, a service that Pods access over HTTPS, that reveals a secret if the client correctly authenticates (for example, with a ServiceAccount token).
- for authentication, you can implement a custom signer for X.509 certificates, and use CertificateSigningRequests to let that custom signer issue certificates to Pods that need them.
- you can use a device plugin to expose node-local encryption hardware to a specific Pod. For example, you can schedule trusted Pods onto nodes that provide a Trusted Platform Module, configured out-of-band.
You can also combine two or more of those options, including the option to use Secret objects themselves.
For example: implement (or deploy) an operator that fetches short-lived session tokens from an external service, and then creates Secrets based on those short-lived session tokens. Pods running in your cluster can make use of the session tokens, and operator ensures they are valid. This separation means that you can run Pods that are unaware of the exact mechanisms for issuing and refreshing those session tokens.
Working with Secrets
Creating a Secret
There are several options to create a Secret:
Constraints on Secret names and data
The name of a Secret object must be a valid DNS subdomain name.
You can specify the data
and/or the stringData
field when creating a
configuration file for a Secret. The data
and the stringData
fields are optional.
The values for all keys in the data
field have to be base64-encoded strings.
If the conversion to base64 string is not desirable, you can choose to specify
the stringData
field instead, which accepts arbitrary strings as values.
The keys of data
and stringData
must consist of alphanumeric characters,
-
, _
or .
. All key-value pairs in the stringData
field are internally
merged into the data
field. If a key appears in both the data
and the
stringData
field, the value specified in the stringData
field takes
precedence.
Size limit
Individual secrets are limited to 1MiB in size. This is to discourage creation of very large secrets that could exhaust the API server and kubelet memory. However, creation of many smaller secrets could also exhaust memory. You can use a resource quota to limit the number of Secrets (or other resources) in a namespace.
Editing a Secret
You can edit an existing Secret using kubectl:
kubectl edit secrets mysecret
This opens your default editor and allows you to update the base64 encoded Secret
values in the data
field; for example:
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file, it will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
username: YWRtaW4=
password: MWYyZDFlMmU2N2Rm
kind: Secret
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: { ... }
creationTimestamp: 2020-01-22T18:41:56Z
name: mysecret
namespace: default
resourceVersion: "164619"
uid: cfee02d6-c137-11e5-8d73-42010af00002
type: Opaque
That example manifest defines a Secret with two keys in the data
field: username
and password
.
The values are Base64 strings in the manifest; however, when you use the Secret with a Pod
then the kubelet provides the decoded data to the Pod and its containers.
You can package many keys and values into one Secret, or use many Secrets, whichever is convenient.
Using a Secret
Secrets can be mounted as data volumes or exposed as environment variables to be used by a container in a Pod. Secrets can also be used by other parts of the system, without being directly exposed to the Pod. For example, Secrets can hold credentials that other parts of the system should use to interact with external systems on your behalf.
Secret volume sources are validated to ensure that the specified object reference actually points to an object of type Secret. Therefore, a Secret needs to be created before any Pods that depend on it.
If the Secret cannot be fetched (perhaps because it does not exist, or due to a temporary lack of connection to the API server) the kubelet periodically retries running that Pod. The kubelet also reports an Event for that Pod, including details of the problem fetching the Secret.
Optional Secrets
When you define a container environment variable based on a Secret, you can mark it as optional. The default is for the Secret to be required.
None of a Pod's containers will start until all non-optional Secrets are available.
If a Pod references a specific key in a Secret and that Secret does exist, but is missing the named key, the Pod fails during startup.
Using Secrets as files from a Pod
If you want to access data from a Secret in a Pod, one way to do that is to have Kubernetes make the value of that Secret be available as a file inside the filesystem of one or more of the Pod's containers.
To configure that, you:
- Create a secret or use an existing one. Multiple Pods can reference the same secret.
- Modify your Pod definition to add a volume under
.spec.volumes[]
. Name the volume anything, and have a.spec.volumes[].secret.secretName
field equal to the name of the Secret object. - Add a
.spec.containers[].volumeMounts[]
to each container that needs the secret. Specify.spec.containers[].volumeMounts[].readOnly = true
and.spec.containers[].volumeMounts[].mountPath
to an unused directory name where you would like the secrets to appear. - Modify your image or command line so that the program looks for files in that directory. Each key in the secret
data
map becomes the filename undermountPath
.
This is an example of a Pod that mounts a Secret named mysecret
in a volume:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
readOnly: true
volumes:
- name: foo
secret:
secretName: mysecret
optional: false # default setting; "mysecret" must exist
Each Secret you want to use needs to be referred to in .spec.volumes
.
If there are multiple containers in the Pod, then each container needs its
own volumeMounts
block, but only one .spec.volumes
is needed per Secret.
Versions of Kubernetes before v1.22 automatically created credentials for accessing the Kubernetes API. This older mechanism was based on creating token Secrets that could then be mounted into running Pods. In more recent versions, including Kubernetes v1.23, API credentials are obtained directly by using the TokenRequest API, and are mounted into Pods using a projected volume. The tokens obtained using this method have bounded lifetimes, and are automatically invalidated when the Pod they are mounted into is deleted.
You can still manually create a service account token Secret; for example, if you need a token that never expires. However, using the TokenRequest subresource to obtain a token to access the API is recommended instead.
Projection of Secret keys to specific paths
You can also control the paths within the volume where Secret keys are projected.
You can use the .spec.volumes[].secret.items
field to change the target path of each key:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
readOnly: true
volumes:
- name: foo
secret:
secretName: mysecret
items:
- key: username
path: my-group/my-username
What will happen:
- the
username
key frommysecret
is available to the container at the path/etc/foo/my-group/my-username
instead of at/etc/foo/username
. - the
password
key from that Secret object is not projected.
If .spec.volumes[].secret.items
is used, only keys specified in items
are projected.
To consume all keys from the Secret, all of them must be listed in the items
field.
If you list keys explicitly, then all listed keys must exist in the corresponding Secret. Otherwise, the volume is not created.
Secret files permissions
You can set the POSIX file access permission bits for a single Secret key.
If you don't specify any permissions, 0644
is used by default.
You can also set a default mode for the entire Secret volume and override per key if needed.
For example, you can specify a default mode like this:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
volumes:
- name: foo
secret:
secretName: mysecret
defaultMode: 0400
The secret is mounted on /etc/foo
; all the files created by the
secret volume mount have permission 0400
.
defaultMode
(for example, 0400 in octal is 256 in decimal) instead.If you're writing YAML, you can write the
defaultMode
in octal.
Consuming Secret values from volumes
Inside the container that mounts a secret volume, the secret keys appear as files. The secret values are base64 decoded and stored inside these files.
This is the result of commands executed inside the container from the example above:
ls /etc/foo/
The output is similar to:
username
password
cat /etc/foo/username
The output is similar to:
admin
cat /etc/foo/password
The output is similar to:
1f2d1e2e67df
The program in a container is responsible for reading the secret data from these files, as needed.
Mounted Secrets are updated automatically
When a volume contains data from a Secret, and that Secret is updated, Kubernetes tracks this and updates the data in the volume, using an eventually-consistent approach.
The kubelet keeps a cache of the current keys and values for the Secrets that are used in
volumes for pods on that node.
You can configure the way that the kubelet detects changes from the cached values. The configMapAndSecretChangeDetectionStrategy
field in
the kubelet configuration controls which strategy the kubelet uses. The default strategy is Watch
.
Updates to Secrets can be either propagated by an API watch mechanism (the default), based on a cache with a defined time-to-live, or polled from the cluster API server on each kubelet synchronisation loop.
As a result, the total delay from the moment when the Secret is updated to the moment when new keys are projected to the Pod can be as long as the kubelet sync period + cache propagation delay, where the cache propagation delay depends on the chosen cache type (following the same order listed in the previous paragraph, these are: watch propagation delay, the configured cache TTL, or zero for direct polling).
Using Secrets as environment variables
To use a Secret in an environment variable in a Pod:
- Create a Secret (or use an existing one). Multiple Pods can reference the same Secret.
- Modify your Pod definition in each container that you wish to consume the value of a secret
key to add an environment variable for each secret key you wish to consume. The environment
variable that consumes the secret key should populate the secret's name and key in
env[].valueFrom.secretKeyRef
. - Modify your image and/or command line so that the program looks for values in the specified environment variables.
This is an example of a Pod that uses a Secret via environment variables:
apiVersion: v1
kind: Pod
metadata:
name: secret-env-pod
spec:
containers:
- name: mycontainer
image: redis
env:
- name: SECRET_USERNAME
valueFrom:
secretKeyRef:
name: mysecret
key: username
optional: false # same as default; "mysecret" must exist
# and include a key named "username"
- name: SECRET_PASSWORD
valueFrom:
secretKeyRef:
name: mysecret
key: password
optional: false # same as default; "mysecret" must exist
# and include a key named "password"
restartPolicy: Never
Invalid environment variables
Secrets used to populate environment variables by the envFrom
field that have keys
that are considered invalid environment variable names will have those keys
skipped. The Pod is allowed to start.
If you define a Pod with an invalid variable name, the failed Pod startup includes
an event with the reason set to InvalidVariableNames
and a message that lists the
skipped invalid keys. The following example shows a Pod that refers to a Secret
named mysecret
, where mysecret
contains 2 invalid keys: 1badkey
and 2alsobad
.
kubectl get events
The output is similar to:
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON
0s 0s 1 dapi-test-pod Pod Warning InvalidEnvironmentVariableNames kubelet, 127.0.0.1 Keys [1badkey, 2alsobad] from the EnvFrom secret default/mysecret were skipped since they are considered invalid environment variable names.
Consuming Secret values from environment variables
Inside a container that consumes a Secret using environment variables, the secret keys appear as normal environment variables. The values of those variables are the base64 decoded values of the secret data.
This is the result of commands executed inside the container from the example above:
echo "$SECRET_USERNAME"
The output is similar to:
admin
echo "$SECRET_PASSWORD"
The output is similar to:
1f2d1e2e67df
Container image pull secrets
If you want to fetch container images from a private repository, you need a way for the kubelet on each node to authenticate to that repository. You can configure image pull secrets to make this possible. These secrets are configured at the Pod level.
The imagePullSecrets
field for a Pod is a list of references to Secrets in the same namespace
as the Pod.
You can use an imagePullSecrets
to pass image registry access credentials to
the kubelet. The kubelet uses this information to pull a private image on behalf of your Pod.
See PodSpec
in the Pod API reference
for more information about the imagePullSecrets
field.
Using imagePullSecrets
The imagePullSecrets
field is a list of references to secrets in the same namespace.
You can use an imagePullSecrets
to pass a secret that contains a Docker (or other) image registry
password to the kubelet. The kubelet uses this information to pull a private image on behalf of your Pod.
See the PodSpec API for more information about the imagePullSecrets
field.
Manually specifying an imagePullSecret
You can learn how to specify imagePullSecrets
from the container images
documentation.
Arranging for imagePullSecrets to be automatically attached
You can manually create imagePullSecrets
, and reference these from
a ServiceAccount. Any Pods created with that ServiceAccount
or created with that ServiceAccount by default, will get their imagePullSecrets
field set to that of the service account.
See Add ImagePullSecrets to a service account
for a detailed explanation of that process.
Using Secrets with static Pods
You cannot use ConfigMaps or Secrets with static Pods.
Use cases
Use case: As container environment variables
Create a secret
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
data:
USER_NAME: YWRtaW4=
PASSWORD: MWYyZDFlMmU2N2Rm
Create the Secret:
kubectl apply -f mysecret.yaml
Use envFrom
to define all of the Secret's data as container environment variables. The key from the Secret becomes the environment variable name in the Pod.
apiVersion: v1
kind: Pod
metadata:
name: secret-test-pod
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
command: [ "/bin/sh", "-c", "env" ]
envFrom:
- secretRef:
name: mysecret
restartPolicy: Never
Use case: Pod with SSH keys
Create a Secret containing some SSH keys:
kubectl create secret generic ssh-key-secret --from-file=ssh-privatekey=/path/to/.ssh/id_rsa --from-file=ssh-publickey=/path/to/.ssh/id_rsa.pub
The output is similar to:
secret "ssh-key-secret" created
You can also create a kustomization.yaml
with a secretGenerator
field containing ssh keys.
Think carefully before sending your own SSH keys: other users of the cluster may have access to the Secret.
You could instead create an SSH private key representing a service identity that you want to be accessible to all the users with whom you share the Kubernetes cluster, and that you can revoke if the credentials are compromised.
Now you can create a Pod which references the secret with the SSH key and consumes it in a volume:
apiVersion: v1
kind: Pod
metadata:
name: secret-test-pod
labels:
name: secret-test
spec:
volumes:
- name: secret-volume
secret:
secretName: ssh-key-secret
containers:
- name: ssh-test-container
image: mySshImage
volumeMounts:
- name: secret-volume
readOnly: true
mountPath: "/etc/secret-volume"
When the container's command runs, the pieces of the key will be available in:
/etc/secret-volume/ssh-publickey
/etc/secret-volume/ssh-privatekey
The container is then free to use the secret data to establish an SSH connection.
Use case: Pods with prod / test credentials
This example illustrates a Pod which consumes a secret containing production credentials and another Pod which consumes a secret with test environment credentials.
You can create a kustomization.yaml
with a secretGenerator
field or run
kubectl create secret
.
kubectl create secret generic prod-db-secret --from-literal=username=produser --from-literal=password=Y4nys7f11
The output is similar to:
secret "prod-db-secret" created
You can also create a secret for test environment credentials.
kubectl create secret generic test-db-secret --from-literal=username=testuser --from-literal=password=iluvtests
The output is similar to:
secret "test-db-secret" created
Special characters such as $
, \
, *
, =
, and !
will be interpreted by your shell and require escaping.
In most shells, the easiest way to escape the password is to surround it with single quotes ('
).
For example, if your actual password is S!B\*d$zDsb=
, you should execute the command this way:
kubectl create secret generic dev-db-secret --from-literal=username=devuser --from-literal=password='S!B\*d$zDsb='
You do not need to escape special characters in passwords from files (--from-file
).
Now make the Pods:
cat <<EOF > pod.yaml
apiVersion: v1
kind: List
items:
- kind: Pod
apiVersion: v1
metadata:
name: prod-db-client-pod
labels:
name: prod-db-client
spec:
volumes:
- name: secret-volume
secret:
secretName: prod-db-secret
containers:
- name: db-client-container
image: myClientImage
volumeMounts:
- name: secret-volume
readOnly: true
mountPath: "/etc/secret-volume"
- kind: Pod
apiVersion: v1
metadata:
name: test-db-client-pod
labels:
name: test-db-client
spec:
volumes:
- name: secret-volume
secret:
secretName: test-db-secret
containers:
- name: db-client-container
image: myClientImage
volumeMounts:
- name: secret-volume
readOnly: true
mountPath: "/etc/secret-volume"
EOF
Add the pods to the same kustomization.yaml
:
cat <<EOF >> kustomization.yaml
resources:
- pod.yaml
EOF
Apply all those objects on the API server by running:
kubectl apply -k .
Both containers will have the following files present on their filesystems with the values for each container's environment:
/etc/secret-volume/username
/etc/secret-volume/password
Note how the specs for the two Pods differ only in one field; this facilitates creating Pods with different capabilities from a common Pod template.
You could further simplify the base Pod specification by using two service accounts:
prod-user
with theprod-db-secret
test-user
with thetest-db-secret
The Pod specification is shortened to:
apiVersion: v1
kind: Pod
metadata:
name: prod-db-client-pod
labels:
name: prod-db-client
spec:
serviceAccount: prod-db-client
containers:
- name: db-client-container
image: myClientImage
Use case: dotfiles in a secret volume
You can make your data "hidden" by defining a key that begins with a dot.
This key represents a dotfile or "hidden" file. For example, when the following secret
is mounted into a volume, secret-volume
:
apiVersion: v1
kind: Secret
metadata:
name: dotfile-secret
data:
.secret-file: dmFsdWUtMg0KDQo=
---
apiVersion: v1
kind: Pod
metadata:
name: secret-dotfiles-pod
spec:
volumes:
- name: secret-volume
secret:
secretName: dotfile-secret
containers:
- name: dotfile-test-container
image: k8s.gcr.io/busybox
command:
- ls
- "-l"
- "/etc/secret-volume"
volumeMounts:
- name: secret-volume
readOnly: true
mountPath: "/etc/secret-volume"
The volume will contain a single file, called .secret-file
, and
the dotfile-test-container
will have this file present at the path
/etc/secret-volume/.secret-file
.
ls -l
;
you must use ls -la
to see them when listing directory contents.
Use case: Secret visible to one container in a Pod
Consider a program that needs to handle HTTP requests, do some complex business logic, and then sign some messages with an HMAC. Because it has complex application logic, there might be an unnoticed remote file reading exploit in the server, which could expose the private key to an attacker.
This could be divided into two processes in two containers: a frontend container which handles user interaction and business logic, but which cannot see the private key; and a signer container that can see the private key, and responds to simple signing requests from the frontend (for example, over localhost networking).
With this partitioned approach, an attacker now has to trick the application server into doing something rather arbitrary, which may be harder than getting it to read a file.
Types of Secret
When creating a Secret, you can specify its type using the type
field of
the Secret
resource, or certain equivalent kubectl
command line flags (if available).
The Secret type is used to facilitate programmatic handling of the Secret data.
Kubernetes provides several built-in types for some common usage scenarios. These types vary in terms of the validations performed and the constraints Kubernetes imposes on them.
Built-in Type | Usage |
---|---|
Opaque |
arbitrary user-defined data |
kubernetes.io/service-account-token |
ServiceAccount token |
kubernetes.io/dockercfg |
serialized ~/.dockercfg file |
kubernetes.io/dockerconfigjson |
serialized ~/.docker/config.json file |
kubernetes.io/basic-auth |
credentials for basic authentication |
kubernetes.io/ssh-auth |
credentials for SSH authentication |
kubernetes.io/tls |
data for a TLS client or server |
bootstrap.kubernetes.io/token |
bootstrap token data |
You can define and use your own Secret type by assigning a non-empty string as the
type
value for a Secret object (an empty string is treated as an Opaque
type).
Kubernetes doesn't impose any constraints on the type name. However, if you are using one of the built-in types, you must meet all the requirements defined for that type.
If you are defining a type of secret that's for public use, follow the convention
and structure the secret type to have your domain name before the name, separated
by a /
. For example: cloud-hosting.example.net/cloud-api-credentials
.
Opaque secrets
Opaque
is the default Secret type if omitted from a Secret configuration file.
When you create a Secret using kubectl
, you will use the generic
subcommand to indicate an Opaque
Secret type. For example, the following
command creates an empty Secret of type Opaque
.
kubectl create secret generic empty-secret
kubectl get secret empty-secret
The output looks like:
NAME TYPE DATA AGE
empty-secret Opaque 0 2m6s
The DATA
column shows the number of data items stored in the Secret.
In this case, 0
means you have created an empty Secret.
Service account token Secrets
A kubernetes.io/service-account-token
type of Secret is used to store a
token that identifies a
service account.
When using this Secret type, you need to ensure that the
kubernetes.io/service-account.name
annotation is set to an existing
service account name. A Kubernetes
controller fills in some
other fields such as the kubernetes.io/service-account.uid
annotation, and the
token
key in the data
field, which is set to contain an authentication
token.
The following example configuration declares a service account token Secret:
apiVersion: v1
kind: Secret
metadata:
name: secret-sa-sample
annotations:
kubernetes.io/service-account.name: "sa-name"
type: kubernetes.io/service-account-token
data:
# You can include additional key value pairs as you do with Opaque Secrets
extra: YmFyCg==
When creating a Pod
, Kubernetes automatically finds or creates a service account
Secret and then automatically modifies your Pod to use this Secret. The service account
token Secret contains credentials for accessing the Kubernetes API.
The automatic creation and use of API credentials can be disabled or overridden if desired. However, if all you need to do is securely access the API server, this is the recommended workflow.
See the ServiceAccount
documentation for more information on how service accounts work.
You can also check the automountServiceAccountToken
field and the
serviceAccountName
field of the
Pod
for information on referencing service account from Pods.
Docker config Secrets
You can use one of the following type
values to create a Secret to
store the credentials for accessing a container image registry:
kubernetes.io/dockercfg
kubernetes.io/dockerconfigjson
The kubernetes.io/dockercfg
type is reserved to store a serialized
~/.dockercfg
which is the legacy format for configuring Docker command line.
When using this Secret type, you have to ensure the Secret data
field
contains a .dockercfg
key whose value is content of a ~/.dockercfg
file
encoded in the base64 format.
The kubernetes.io/dockerconfigjson
type is designed for storing a serialized
JSON that follows the same format rules as the ~/.docker/config.json
file
which is a new format for ~/.dockercfg
.
When using this Secret type, the data
field of the Secret object must
contain a .dockerconfigjson
key, in which the content for the
~/.docker/config.json
file is provided as a base64 encoded string.
Below is an example for a kubernetes.io/dockercfg
type of Secret:
apiVersion: v1
kind: Secret
metadata:
name: secret-dockercfg
type: kubernetes.io/dockercfg
data:
.dockercfg: |
"<base64 encoded ~/.dockercfg file>"
stringData
field instead.
When you create these types of Secrets using a manifest, the API
server checks whether the expected key exists in the data
field, and
it verifies if the value provided can be parsed as a valid JSON. The API
server doesn't validate if the JSON actually is a Docker config file.
When you do not have a Docker config file, or you want to use kubectl
to create a Secret for accessing a container registry, you can do:
kubectl create secret docker-registry secret-tiger-docker \
--docker-email=tiger@acme.example \
--docker-username=tiger \
--docker-password=pass1234 \
--docker-server=my-registry.example:5000
That command creates a Secret of type kubernetes.io/dockerconfigjson
.
If you dump the .data.dockercfgjson
field from that new Secret and then
decode it from base64:
kubectl get secret secret-tiger-docker -o jsonpath='{.data.*}' | base64 -d
then the output is equivalent to this JSON document (which is also a valid Docker configuration file):
{
"auths": {
"my-registry.example:5000": {
"username": "tiger",
"password": "pass1234",
"email": "tiger@acme.example",
"auth": "dGlnZXI6cGFzczEyMzQ="
}
}
}
auth
value there is base64 encoded; it is obscured but not secret.
Anyone who can read that Secret can learn the registry access bearer token.
Basic authentication Secret
The kubernetes.io/basic-auth
type is provided for storing credentials needed
for basic authentication. When using this Secret type, the data
field of the
Secret must contain one of the following two keys:
username
: the user name for authenticationpassword
: the password or token for authentication
Both values for the above two keys are base64 encoded strings. You can, of
course, provide the clear text content using the stringData
for Secret
creation.
The following manifest is an example of a basic authentication Secret:
apiVersion: v1
kind: Secret
metadata:
name: secret-basic-auth
type: kubernetes.io/basic-auth
stringData:
username: admin # required field for kubernetes.io/basic-auth
password: t0p-Secret # required field for kubernetes.io/basic-auth
The basic authentication Secret type is provided only for convenience.
You can create an Opaque
type for credentials used for basic authentication.
However, using the defined and public Secret type (kubernetes.io/basic-auth
) helps other
people to understand the purpose of your Secret, and sets a convention for what key names
to expect.
The Kubernetes API verifies that the required keys are set for a Secret
of this type.
SSH authentication secrets
The builtin type kubernetes.io/ssh-auth
is provided for storing data used in
SSH authentication. When using this Secret type, you will have to specify a
ssh-privatekey
key-value pair in the data
(or stringData
) field
as the SSH credential to use.
The following manifest is an example of a Secret used for SSH public/private key authentication:
apiVersion: v1
kind: Secret
metadata:
name: secret-ssh-auth
type: kubernetes.io/ssh-auth
data:
# the data is abbreviated in this example
ssh-privatekey: |
MIIEpQIBAAKCAQEAulqb/Y ...
The SSH authentication Secret type is provided only for user's convenience.
You could instead create an Opaque
type Secret for credentials used for SSH authentication.
However, using the defined and public Secret type (kubernetes.io/ssh-auth
) helps other
people to understand the purpose of your Secret, and sets a convention for what key names
to expect.
and the API server does verify if the required keys are provided in a Secret
configuration.
known_hosts
file added to a
ConfigMap.
TLS secrets
Kubernetes provides a builtin Secret type kubernetes.io/tls
for storing
a certificate and its associated key that are typically used for TLS.
One common use for TLS secrets is to configure encryption in transit for
an Ingress, but you can also use it
with other resources or directly in your workload.
When using this type of Secret, the tls.key
and the tls.crt
key must be provided
in the data
(or stringData
) field of the Secret configuration, although the API
server doesn't actually validate the values for each key.
The following YAML contains an example config for a TLS Secret:
apiVersion: v1
kind: Secret
metadata:
name: secret-tls
type: kubernetes.io/tls
data:
# the data is abbreviated in this example
tls.crt: |
MIIC2DCCAcCgAwIBAgIBATANBgkqh ...
tls.key: |
MIIEpgIBAAKCAQEA7yn3bRHQ5FHMQ ...
The TLS Secret type is provided for user's convenience. You can create an Opaque
for credentials used for TLS server and/or client. However, using the builtin Secret
type helps ensure the consistency of Secret format in your project; the API server
does verify if the required keys are provided in a Secret configuration.
When creating a TLS Secret using kubectl
, you can use the tls
subcommand
as shown in the following example:
kubectl create secret tls my-tls-secret \
--cert=path/to/cert/file \
--key=path/to/key/file
The public/private key pair must exist before hand. The public key certificate
for --cert
must be DER format as per
Section 5.1 of RFC 7468,
and must match the given private key for --key
(PKCS #8 in DER format;
Section 11 of RFC 7468).
A kubernetes.io/tls Secret stores the Base64-encoded DER data for keys and certificates. If you're familiar with PEM format for private keys and for certificates, the base64 data are the same as that format except that you omit the initial and the last lines that are used in PEM.
For example, for a certificate, you do not include --------BEGIN CERTIFICATE-----
and -------END CERTIFICATE----
.
Bootstrap token Secrets
A bootstrap token Secret can be created by explicitly specifying the Secret
type
to bootstrap.kubernetes.io/token
. This type of Secret is designed for
tokens used during the node bootstrap process. It stores tokens used to sign
well-known ConfigMaps.
A bootstrap token Secret is usually created in the kube-system
namespace and
named in the form bootstrap-token-<token-id>
where <token-id>
is a 6 character
string of the token ID.
As a Kubernetes manifest, a bootstrap token Secret might look like the following:
apiVersion: v1
kind: Secret
metadata:
name: bootstrap-token-5emitj
namespace: kube-system
type: bootstrap.kubernetes.io/token
data:
auth-extra-groups: c3lzdGVtOmJvb3RzdHJhcHBlcnM6a3ViZWFkbTpkZWZhdWx0LW5vZGUtdG9rZW4=
expiration: MjAyMC0wOS0xM1QwNDozOToxMFo=
token-id: NWVtaXRq
token-secret: a3E0Z2lodnN6emduMXAwcg==
usage-bootstrap-authentication: dHJ1ZQ==
usage-bootstrap-signing: dHJ1ZQ==
A bootstrap type Secret has the following keys specified under data
:
token-id
: A random 6 character string as the token identifier. Required.token-secret
: A random 16 character string as the actual token secret. Required.description
: A human-readable string that describes what the token is used for. Optional.expiration
: An absolute UTC time using RFC3339 specifying when the token should be expired. Optional.usage-bootstrap-<usage>
: A boolean flag indicating additional usage for the bootstrap token.auth-extra-groups
: A comma-separated list of group names that will be authenticated as in addition to thesystem:bootstrappers
group.
The above YAML may look confusing because the values are all in base64 encoded strings. In fact, you can create an identical Secret using the following YAML:
apiVersion: v1
kind: Secret
metadata:
# Note how the Secret is named
name: bootstrap-token-5emitj
# A bootstrap token Secret usually resides in the kube-system namespace
namespace: kube-system
type: bootstrap.kubernetes.io/token
stringData:
auth-extra-groups: "system:bootstrappers:kubeadm:default-node-token"
expiration: "2020-09-13T04:39:10Z"
# This token ID is used in the name
token-id: "5emitj"
token-secret: "kq4gihvszzgn1p0r"
# This token can be used for authentication
usage-bootstrap-authentication: "true"
# and it can be used for signing
usage-bootstrap-signing: "true"
Immutable Secrets
Kubernetes v1.21 [stable]
Kubernetes lets you mark specific Secrets (and ConfigMaps) as immutable. Preventing changes to the data of an existing Secret has the following benefits:
- protects you from accidental (or unwanted) updates that could cause applications outages
- (for clusters that extensively use Secrets - at least tens of thousands of unique Secret to Pod mounts), switching to immutable Secrets improves the performance of your cluster by significantly reducing load on kube-apiserver. The kubelet does not need to maintain a [watch] on any Secrets that are marked as immutable.
Marking a Secret as immutable
You can create an immutable Secret by setting the immutable
field to true
. For example,
apiVersion: v1
kind: Secret
metadata:
...
data:
...
immutable: true
You can also update any existing mutable Secret to make it immutable.
data
field. You can only delete and recreate the Secret.
Existing Pods maintain a mount point to the deleted Secret - it is recommended to recreate
these pods.
Information security for Secrets
Although ConfigMap and Secret work similarly, Kubernetes applies some additional protection for Secret objects.
Secrets often hold values that span a spectrum of importance, many of which can cause escalations within Kubernetes (e.g. service account tokens) and to external systems. Even if an individual app can reason about the power of the Secrets it expects to interact with, other apps within the same namespace can render those assumptions invalid.
A Secret is only sent to a node if a Pod on that node requires it.
For mounting secrets into Pods, the kubelet stores a copy of the data into a tmpfs
so that the confidential data is not written to durable storage.
Once the Pod that depends on the Secret is deleted, the kubelet deletes its local copy
of the confidential data from the Secret.
There may be several containers in a Pod. By default, containers you define only have access to the default ServiceAccount and its related Secret. You must explicitly define environment variables or map a volume into a container in order to provide access to any other Secret.
There may be Secrets for several Pods on the same node. However, only the Secrets that a Pod requests are potentially visible within its containers. Therefore, one Pod does not have access to the Secrets of another Pod.
Security recommendations for developers
- Applications still need to protect the value of confidential information after reading it from an environment variable or volume. For example, your application must avoid logging the secret data in the clear or transmitting it to an untrusted party.
- If you are defining multiple containers in a Pod, and only one of those containers needs access to a Secret, define the volume mount or environment variable configuration so that the other containers do not have access to that Secret.
- If you configure a Secret through a manifest, with the secret data encoded as base64, sharing this file or checking it in to a source repository means the secret is available to everyone who can read the manifest. Base64 encoding is not an encryption method, it provides no additional confidentiality over plain text.
- When deploying applications that interact with the Secret API, you should limit access using authorization policies such as RBAC.
- In the Kubernetes API,
watch
andlist
requests for Secrets within a namespace are extremely powerful capabilities. Avoid granting this access where feasible, since listing Secrets allows the clients to inspect the values of every Secret in that namespace.
Security recommendations for cluster administrators
- Reserve the ability to
watch
orlist
all secrets in a cluster (using the Kubernetes API), so that only the most privileged, system-level components can perform this action. - When deploying applications that interact with the Secret API, you should limit access using authorization policies such as RBAC.
- In the API server, objects (including Secrets) are persisted into
etcd; therefore:
- only allow cluster admistrators to access etcd (this includes read-only access);
- enable encryption at rest for Secret objects, so that the data of these Secrets are not stored in the clear into etcd;
- consider wiping / shredding the durable storage used by etcd once it is no longer in use;
- if there are multiple etcd instances, make sure that etcd is using SSL/TLS for communication between etcd peers.
What's next
- Learn how to manage Secrets using
kubectl
- Learn how to manage Secrets using config file
- Learn how to manage Secrets using kustomize
- Read the API reference for
Secret
3.7.4 - Resource Management for Pods and Containers
When you specify a Pod, you can optionally specify how much of each resource a container needs. The most common resources to specify are CPU and memory (RAM); there are others.
When you specify the resource request for containers in a Pod, the kube-scheduler uses this information to decide which node to place the Pod on. When you specify a resource limit for a container, the kubelet enforces those limits so that the running container is not allowed to use more of that resource than the limit you set. The kubelet also reserves at least the request amount of that system resource specifically for that container to use.
Requests and limits
If the node where a Pod is running has enough of a resource available, it's possible (and
allowed) for a container to use more resource than its request
for that resource specifies.
However, a container is not allowed to use more than its resource limit
.
For example, if you set a memory
request of 256 MiB for a container, and that container is in
a Pod scheduled to a Node with 8GiB of memory and no other Pods, then the container can try to use
more RAM.
If you set a memory
limit of 4GiB for that container, the kubelet (and
container runtime) enforce the limit.
The runtime prevents the container from using more than the configured resource limit. For example:
when a process in the container tries to consume more than the allowed amount of memory,
the system kernel terminates the process that attempted the allocation, with an out of memory
(OOM) error.
Limits can be implemented either reactively (the system intervenes once it sees a violation) or by enforcement (the system prevents the container from ever exceeding the limit). Different runtimes can have different ways to implement the same restrictions.
Resource types
CPU and memory are each a resource type. A resource type has a base unit. CPU represents compute processing and is specified in units of Kubernetes CPUs. Memory is specified in units of bytes. For Linux workloads, you can specify huge page resources. Huge pages are a Linux-specific feature where the node kernel allocates blocks of memory that are much larger than the default page size.
For example, on a system where the default page size is 4KiB, you could specify a limit,
hugepages-2Mi: 80Mi
. If the container tries allocating over 40 2MiB huge pages (a
total of 80 MiB), that allocation fails.
hugepages-*
resources.
This is different from the memory
and cpu
resources.
CPU and memory are collectively referred to as compute resources, or resources. Compute resources are measurable quantities that can be requested, allocated, and consumed. They are distinct from API resources. API resources, such as Pods and Services are objects that can be read and modified through the Kubernetes API server.
Resource requests and limits of Pod and container
For each container, you can specify resource limits and requests, including the following:
spec.containers[].resources.limits.cpu
spec.containers[].resources.limits.memory
spec.containers[].resources.limits.hugepages-<size>
spec.containers[].resources.requests.cpu
spec.containers[].resources.requests.memory
spec.containers[].resources.requests.hugepages-<size>
Although you can only specify requests and limits for individual containers, it is also useful to think about the overall resource requests and limits for a Pod. For a particular resource, a Pod resource request/limit is the sum of the resource requests/limits of that type for each container in the Pod.
Resource units in Kubernetes
CPU resource units
Limits and requests for CPU resources are measured in cpu units. In Kubernetes, 1 CPU unit is equivalent to 1 physical CPU core, or 1 virtual core, depending on whether the node is a physical host or a virtual machine running inside a physical machine.
Fractional requests are allowed. When you define a container with
spec.containers[].resources.requests.cpu
set to 0.5
, you are requesting half
as much CPU time compared to if you asked for 1.0
CPU.
For CPU resource units, the quantity expression 0.1
is equivalent to the
expression 100m
, which can be read as "one hundred millicpu". Some people say
"one hundred millicores", and this is understood to mean the same thing.
CPU resource is always specified as an absolute amount of resource, never as a relative amount. For example,
500m
CPU represents the roughly same amount of computing power whether that container
runs on a single-core, dual-core, or 48-core machine.
1m
. Because of this, it's useful to specify CPU units less than 1.0
or 1000m
using
the milliCPU form; for example, 5m
rather than 0.005
.
Memory resource units
Limits and requests for memory
are measured in bytes. You can express memory as
a plain integer or as a fixed-point number using one of these
quantity suffixes:
E, P, T, G, M, k. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi,
Mi, Ki. For example, the following represent roughly the same value:
128974848, 129e6, 129M, 128974848000m, 123Mi
Pay attention to the case of the suffixes. If you request 400m
of memory, this is a request
for 0.4 bytes. Someone who types that probably meant to ask for 400 mebibytes (400Mi
)
or 400 megabytes (400M
).
Container resources example
The following Pod has two containers. Both containers are defined with a request for 0.25 CPU and 64MiB (226 bytes) of memory. Each container has a limit of 0.5 CPU and 128MiB of memory. You can say the Pod has a request of 0.5 CPU and 128 MiB of memory, and a limit of 1 CPU and 256MiB of memory.
---
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
- name: log-aggregator
image: images.my-company.example/log-aggregator:v6
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
How Pods with resource requests are scheduled
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.
How Kubernetes applies resource requests and limits
When the kubelet starts a container as part of a Pod, the kubelet passes that container's requests and limits for memory and CPU to the container runtime.
On Linux, the container runtime typically configures kernel cgroups that apply and enforce the limits you defined.
- The CPU limit defines a hard ceiling on how much CPU time that the container can use. During each scheduling interval (time slice), the Linux kernel checks to see if this limit is exceeded; if so, the kernel waits before allowing that cgroup to resume execution.
- The CPU request typically defines a weighting. If several different containers (cgroups) want to run on a contended system, workloads with larger CPU requests are allocated more CPU time than workloads with small requests.
- The memory request is mainly used during (Kubernetes) Pod scheduling. On a node that uses
cgroups v2, the container runtime might use the memory request as a hint to set
memory.min
andmemory.low
. - The memory limit defines a memory limit for that cgroup. If the container tries to allocate more memory than this limit, the Linux kernel out-of-memory subsystem activates and, typically, intervenes by stopping one of the processes in the container that tried to allocate memory. If that process is the container's PID 1, and the container is marked as restartable, Kubernetes restarts the container.
- The memory limit for the Pod or container can also apply to pages in memory backed
volumes, such as an
emptyDir
. The kubelet trackstmpfs
emptyDir volumes as container memory use, rather than as local ephemeral storage.
If a container exceeds its memory request and the node that it runs on becomes short of memory overall, it is likely that the Pod the container belongs to will be evicted.
A container might or might not be allowed to exceed its CPU limit for extended periods of time. However, container runtimes don't terminate Pods or containers for excessive CPU usage.
To determine whether a container cannot be scheduled or is being killed due to resource limits, see the Troubleshooting section.
Monitoring compute & memory resource usage
The kubelet reports the resource usage of a Pod as part of the Pod
status
.
If optional tools for monitoring are available in your cluster, then Pod resource usage can be retrieved either from the Metrics API directly or from your monitoring tools.
Local ephemeral storage
Kubernetes v1.10 [beta]
Nodes have local ephemeral storage, backed by locally-attached writeable devices or, sometimes, by RAM. "Ephemeral" means that there is no long-term guarantee about durability.
Pods use ephemeral local storage for scratch space, caching, and for logs.
The kubelet can provide scratch space to Pods using local ephemeral storage to
mount emptyDir
volumes into containers.
The kubelet also uses this kind of storage to hold node-level container logs, container images, and the writable layers of running containers.
As a beta feature, Kubernetes lets you track, reserve and limit the amount of ephemeral local storage a Pod can consume.
Configurations for local ephemeral storage
Kubernetes supports two ways to configure local ephemeral storage on a node:
In this configuration, you place all different kinds of ephemeral local data
(emptyDir
volumes, writeable layers, container images, logs) into one filesystem.
The most effective way to configure the kubelet means dedicating this filesystem
to Kubernetes (kubelet) data.
The kubelet also writes node-level container logs and treats these similarly to ephemeral local storage.
The kubelet writes logs to files inside its configured log directory (/var/log
by default); and has a base directory for other locally stored data
(/var/lib/kubelet
by default).
Typically, both /var/lib/kubelet
and /var/log
are on the system root filesystem,
and the kubelet is designed with that layout in mind.
Your node can have as many other filesystems, not used for Kubernetes, as you like.
You have a filesystem on the node that you're using for ephemeral data that
comes from running Pods: logs, and emptyDir
volumes. You can use this filesystem
for other data (for example: system logs not related to Kubernetes); it can even
be the root filesystem.
The kubelet also writes node-level container logs into the first filesystem, and treats these similarly to ephemeral local storage.
You also use a separate filesystem, backed by a different logical storage device. In this configuration, the directory where you tell the kubelet to place container image layers and writeable layers is on this second filesystem.
The first filesystem does not hold any image layers or writeable layers.
Your node can have as many other filesystems, not used for Kubernetes, as you like.
The kubelet can measure how much local storage it is using. It does this provided that:
- the
LocalStorageCapacityIsolation
feature gate is enabled (the feature is on by default), and - you have set up the node using one of the supported configurations for local ephemeral storage.
If you have a different configuration, then the kubelet does not apply resource limits for ephemeral local storage.
tmpfs
emptyDir volumes as container memory use, rather
than as local ephemeral storage.
Setting requests and limits for local ephemeral storage
You can specify ephemeral-storage
for managing local ephemeral storage. Each
container of a Pod can specify either or both of the following:
spec.containers[].resources.limits.ephemeral-storage
spec.containers[].resources.requests.ephemeral-storage
Limits and requests for ephemeral-storage
are measured in byte quantities.
You can express storage as a plain integer or as a fixed-point number using one of these suffixes:
E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi,
Mi, Ki. For example, the following quantities all represent roughly the same value:
128974848
129e6
129M
123Mi
In the following example, the Pod has two containers. Each container has a request of 2GiB of local ephemeral storage. Each container has a limit of 4GiB of local ephemeral storage. Therefore, the Pod has a request of 4GiB of local ephemeral storage, and a limit of 8GiB of local ephemeral storage.
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
ephemeral-storage: "2Gi"
limits:
ephemeral-storage: "4Gi"
volumeMounts:
- name: ephemeral
mountPath: "/tmp"
- name: log-aggregator
image: images.my-company.example/log-aggregator:v6
resources:
requests:
ephemeral-storage: "2Gi"
limits:
ephemeral-storage: "4Gi"
volumeMounts:
- name: ephemeral
mountPath: "/tmp"
volumes:
- name: ephemeral
emptyDir: {}
How Pods with ephemeral-storage requests are scheduled
When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum amount of local ephemeral storage it can provide for Pods. For more information, see Node Allocatable.
The scheduler ensures that the sum of the resource requests of the scheduled containers is less than the capacity of the node.
Ephemeral storage consumption management
If the kubelet is managing local ephemeral storage as a resource, then the kubelet measures storage use in:
emptyDir
volumes, except tmpfsemptyDir
volumes- directories holding node-level logs
- writeable container layers
If a Pod is using more ephemeral storage than you allow it to, the kubelet sets an eviction signal that triggers Pod eviction.
For container-level isolation, if a container's writable layer and log usage exceeds its storage limit, the kubelet marks the Pod for eviction.
For pod-level isolation the kubelet works out an overall Pod storage limit by
summing the limits for the containers in that Pod. In this case, if the sum of
the local ephemeral storage usage from all containers and also the Pod's emptyDir
volumes exceeds the overall Pod storage limit, then the kubelet also marks the Pod
for eviction.
If the kubelet is not measuring local ephemeral storage, then a Pod that exceeds its local storage limit will not be evicted for breaching local storage resource limits.
However, if the filesystem space for writeable container layers, node-level logs,
or emptyDir
volumes falls low, the node
taints itself as short on local storage
and this taint triggers eviction for any Pods that don't specifically tolerate the taint.
See the supported configurations for ephemeral local storage.
The kubelet supports different ways to measure Pod storage use:
The kubelet performs regular, scheduled checks that scan each
emptyDir
volume, container log directory, and writeable container layer.
The scan measures how much space is used.
In this mode, the kubelet does not track open file descriptors for deleted files.
If you (or a container) create a file inside an emptyDir
volume,
something then opens that file, and you delete the file while it is
still open, then the inode for the deleted file stays until you close
that file but the kubelet does not categorize the space as in use.
Kubernetes v1.15 [alpha]
Project quotas are an operating-system level feature for managing
storage use on filesystems. With Kubernetes, you can enable project
quotas for monitoring storage use. Make sure that the filesystem
backing the emptyDir
volumes, on the node, provides project quota support.
For example, XFS and ext4fs offer project quotas.
Kubernetes uses project IDs starting from 1048576
. The IDs in use are
registered in /etc/projects
and /etc/projid
. If project IDs in
this range are used for other purposes on the system, those project
IDs must be registered in /etc/projects
and /etc/projid
so that
Kubernetes does not use them.
Quotas are faster and more accurate than directory scanning. When a directory is assigned to a project, all files created under a directory are created in that project, and the kernel merely has to keep track of how many blocks are in use by files in that project. If a file is created and deleted, but has an open file descriptor, it continues to consume space. Quota tracking records that space accurately whereas directory scans overlook the storage used by deleted files.
If you want to use project quotas, you should:
-
Enable the
LocalStorageCapacityIsolationFSQuotaMonitoring=true
feature gate using thefeatureGates
field in the kubelet configuration or the--feature-gates
command line flag. -
Ensure that the root filesystem (or optional runtime filesystem) has project quotas enabled. All XFS filesystems support project quotas. For ext4 filesystems, you need to enable the project quota tracking feature while the filesystem is not mounted.
# For ext4, with /dev/block-device not mounted sudo tune2fs -O project -Q prjquota /dev/block-device
-
Ensure that the root filesystem (or optional runtime filesystem) is mounted with project quotas enabled. For both XFS and ext4fs, the mount option is named
prjquota
.
Extended resources
Extended resources are fully-qualified resource names outside the
kubernetes.io
domain. They allow cluster operators to advertise and users to
consume the non-Kubernetes-built-in resources.
There are two steps required to use Extended Resources. First, the cluster operator must advertise an Extended Resource. Second, users must request the Extended Resource in Pods.
Managing extended resources
Node-level extended resources
Node-level extended resources are tied to nodes.
Device plugin managed resources
See Device Plugin for how to advertise device plugin managed resources on each node.
Other resources
To advertise a new node-level extended resource, the cluster operator can
submit a PATCH
HTTP request to the API server to specify the available
quantity in the status.capacity
for a node in the cluster. After this
operation, the node's status.capacity
will include a new resource. The
status.allocatable
field is updated automatically with the new resource
asynchronously by the kubelet.
Because the scheduler uses the node's status.allocatable
value when
evaluating Pod fitness, the scheduler only takes account of the new value after
that asynchronous update. There may be a short delay between patching the
node capacity with a new resource and the time when the first Pod that requests
the resource can be scheduled on that node.
Example:
Here is an example showing how to use curl
to form an HTTP request that
advertises five "example.com/foo" resources on node k8s-node-1
whose master
is k8s-master
.
curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/example.com~1foo", "value": "5"}]' \
http://k8s-master:8080/api/v1/nodes/k8s-node-1/status
~1
is the encoding for the character /
in the patch path. The operation path value in JSON-Patch is interpreted as a
JSON-Pointer. For more details, see
IETF RFC 6901, section 3.
Cluster-level extended resources
Cluster-level extended resources are not tied to nodes. They are usually managed by scheduler extenders, which handle the resource consumption and resource quota.
You can specify the extended resources that are handled by scheduler extenders in scheduler configuration
Example:
The following configuration for a scheduler policy indicates that the cluster-level extended resource "example.com/foo" is handled by the scheduler extender.
- The scheduler sends a Pod to the scheduler extender only if the Pod requests "example.com/foo".
- The
ignoredByScheduler
field specifies that the scheduler does not check the "example.com/foo" resource in itsPodFitsResources
predicate.
{
"kind": "Policy",
"apiVersion": "v1",
"extenders": [
{
"urlPrefix":"<extender-endpoint>",
"bindVerb": "bind",
"managedResources": [
{
"name": "example.com/foo",
"ignoredByScheduler": true
}
]
}
]
}
Consuming extended resources
Users can consume extended resources in Pod specs like CPU and memory. The scheduler takes care of the resource accounting so that no more than the available amount is simultaneously allocated to Pods.
The API server restricts quantities of extended resources to whole numbers.
Examples of valid quantities are 3
, 3000m
and 3Ki
. Examples of
invalid quantities are 0.5
and 1500m
.
kubernetes.io
which is reserved.
To consume an extended resource in a Pod, include the resource name as a key
in the spec.containers[].resources.limits
map in the container spec.
A Pod is scheduled only if all of the resource requests are satisfied, including
CPU, memory and any extended resources. The Pod remains in the PENDING
state
as long as the resource request cannot be satisfied.
Example:
The Pod below requests 2 CPUs and 1 "example.com/foo" (an extended resource).
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: myimage
resources:
requests:
cpu: 2
example.com/foo: 1
limits:
example.com/foo: 1
PID limiting
Process ID (PID) limits allow for the configuration of a kubelet to limit the number of PIDs that a given Pod can consume. See PID Limiting for information.
Troubleshooting
My Pods are pending with event message FailedScheduling
If the scheduler cannot find any node where a Pod can fit, the Pod remains
unscheduled until a place can be found. An
Event is produced
each time the scheduler fails to find a place for the Pod. You can use kubectl
to view the events for a Pod; for example:
kubectl describe pod frontend | grep -A 9999999999 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 23s default-scheduler 0/42 nodes available: insufficient cpu
In the preceding example, the Pod named "frontend" fails to be scheduled due to insufficient CPU resource on any node. Similar error messages can also suggest failure due to insufficient memory (PodExceedsFreeMemory). In general, if a Pod is pending with a message of this type, there are several things to try:
- Add more nodes to the cluster.
- Terminate unneeded Pods to make room for pending Pods.
- Check that the Pod is not larger than all the nodes. For example, if all the
nodes have a capacity of
cpu: 1
, then a Pod with a request ofcpu: 1.1
will never be scheduled. - Check for node taints. If most of your nodes are tainted, and the new Pod does not tolerate that taint, the scheduler only considers placements onto the remaining nodes that don't have that taint.
You can check node capacities and amounts allocated with the
kubectl describe nodes
command. For example:
kubectl describe nodes e2e-test-node-pool-4lw4
Name: e2e-test-node-pool-4lw4
[ ... lines removed for clarity ...]
Capacity:
cpu: 2
memory: 7679792Ki
pods: 110
Allocatable:
cpu: 1800m
memory: 7474992Ki
pods: 110
[ ... lines removed for clarity ...]
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system fluentd-gcp-v1.38-28bv1 100m (5%) 0 (0%) 200Mi (2%) 200Mi (2%)
kube-system kube-dns-3297075139-61lj3 260m (13%) 0 (0%) 100Mi (1%) 170Mi (2%)
kube-system kube-proxy-e2e-test-... 100m (5%) 0 (0%) 0 (0%) 0 (0%)
kube-system monitoring-influxdb-grafana-v4-z1m12 200m (10%) 200m (10%) 600Mi (8%) 600Mi (8%)
kube-system node-problem-detector-v0.1-fj7m3 20m (1%) 200m (10%) 20Mi (0%) 100Mi (1%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
680m (34%) 400m (20%) 920Mi (11%) 1070Mi (13%)
In the preceding output, you can see that if a Pod requests more than 1.120 CPUs or more than 6.23Gi of memory, that Pod will not fit on the node.
By looking at the “Pods” section, you can see which Pods are taking up space on the node.
The amount of resources available to Pods is less than the node capacity because
system daemons use a portion of the available resources. Within the Kubernetes API,
each Node has a .status.allocatable
field
(see NodeStatus
for details).
The .status.allocatable
field describes the amount of resources that are available
to Pods on that node (for example: 15 virtual CPUs and 7538 MiB of memory).
For more information on node allocatable resources in Kubernetes, see
Reserve Compute Resources for System Daemons.
You can configure resource quotas to limit the total amount of resources that a namespace can consume. Kubernetes enforces quotas for objects in particular namespace when there is a ResourceQuota in that namespace. For example, if you assign specific namespaces to different teams, you can add ResourceQuotas into those namespaces. Setting resource quotas helps to prevent one team from using so much of any resource that this over-use affects other teams.
You should also consider what access you grant to that namespace: full write access to a namespace allows someone with that access to remove any resource, including a configured ResourceQuota.
My container is terminated
Your container might get terminated because it is resource-starved. To check
whether a container is being killed because it is hitting a resource limit, call
kubectl describe pod
on the Pod of interest:
kubectl describe pod simmemleak-hra99
The output is similar to:
Name: simmemleak-hra99
Namespace: default
Image(s): saadali/simmemleak
Node: kubernetes-node-tf0f/10.240.216.66
Labels: name=simmemleak
Status: Running
Reason:
Message:
IP: 10.244.2.75
Containers:
simmemleak:
Image: saadali/simmemleak:latest
Limits:
cpu: 100m
memory: 50Mi
State: Running
Started: Tue, 07 Jul 2019 12:54:41 -0700
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Fri, 07 Jul 2019 12:54:30 -0700
Finished: Fri, 07 Jul 2019 12:54:33 -0700
Ready: False
Restart Count: 5
Conditions:
Type Status
Ready False
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 42s default-scheduler Successfully assigned simmemleak-hra99 to kubernetes-node-tf0f
Normal Pulled 41s kubelet Container image "saadali/simmemleak:latest" already present on machine
Normal Created 41s kubelet Created container simmemleak
Normal Started 40s kubelet Started container simmemleak
Normal Killing 32s kubelet Killing container with id ead3fb35-5cf5-44ed-9ae1-488115be66c6: Need to kill Pod
In the preceding example, the Restart Count: 5
indicates that the simmemleak
container in the Pod was terminated and restarted five times (so far).
The OOMKilled
reason shows that the container tried to use more memory than its limit.
Your next step might be to check the application code for a memory leak. If you find that the application is behaving how you expect, consider setting a higher memory limit (and possibly request) for that container.
What's next
- Get hands-on experience assigning Memory resources to containers and Pods.
- Get hands-on experience assigning CPU resources to containers and Pods.
- Read how the API reference defines a container and its resource requirements
- Read about project quotas in XFS
- Read more about the kube-scheduler configuration reference (v1beta3)
3.7.5 - Organizing Cluster Access Using kubeconfig Files
Use kubeconfig files to organize information about clusters, users, namespaces, and
authentication mechanisms. The kubectl
command-line tool uses kubeconfig files to
find the information it needs to choose a cluster and communicate with the API server
of a cluster.
kubeconfig
.
By default, kubectl
looks for a file named config
in the $HOME/.kube
directory.
You can specify other kubeconfig files by setting the KUBECONFIG
environment
variable or by setting the
--kubeconfig
flag.
For step-by-step instructions on creating and specifying kubeconfig files, see Configure Access to Multiple Clusters.
Supporting multiple clusters, users, and authentication mechanisms
Suppose you have several clusters, and your users and components authenticate in a variety of ways. For example:
- A running kubelet might authenticate using certificates.
- A user might authenticate using tokens.
- Administrators might have sets of certificates that they provide to individual users.
With kubeconfig files, you can organize your clusters, users, and namespaces. You can also define contexts to quickly and easily switch between clusters and namespaces.
Context
A context element in a kubeconfig file is used to group access parameters
under a convenient name. Each context has three parameters: cluster, namespace, and user.
By default, the kubectl
command-line tool uses parameters from
the current context to communicate with the cluster.
To choose the current context:
kubectl config use-context
The KUBECONFIG environment variable
The KUBECONFIG
environment variable holds a list of kubeconfig files.
For Linux and Mac, the list is colon-delimited. For Windows, the list
is semicolon-delimited. The KUBECONFIG
environment variable is not
required. If the KUBECONFIG
environment variable doesn't exist,
kubectl
uses the default kubeconfig file, $HOME/.kube/config
.
If the KUBECONFIG
environment variable does exist, kubectl
uses
an effective configuration that is the result of merging the files
listed in the KUBECONFIG
environment variable.
Merging kubeconfig files
To see your configuration, enter this command:
kubectl config view
As described previously, the output might be from a single kubeconfig file, or it might be the result of merging several kubeconfig files.
Here are the rules that kubectl
uses when it merges kubeconfig files:
-
If the
--kubeconfig
flag is set, use only the specified file. Do not merge. Only one instance of this flag is allowed.Otherwise, if the
KUBECONFIG
environment variable is set, use it as a list of files that should be merged. Merge the files listed in theKUBECONFIG
environment variable according to these rules:- Ignore empty filenames.
- Produce errors for files with content that cannot be deserialized.
- The first file to set a particular value or map key wins.
- Never change the value or map key.
Example: Preserve the context of the first file to set
current-context
. Example: If two files specify ared-user
, use only values from the first file'sred-user
. Even if the second file has non-conflicting entries underred-user
, discard them.
For an example of setting the
KUBECONFIG
environment variable, see Setting the KUBECONFIG environment variable.Otherwise, use the default kubeconfig file,
$HOME/.kube/config
, with no merging. -
Determine the context to use based on the first hit in this chain:
- Use the
--context
command-line flag if it exists. - Use the
current-context
from the merged kubeconfig files.
An empty context is allowed at this point.
- Use the
-
Determine the cluster and user. At this point, there might or might not be a context. Determine the cluster and user based on the first hit in this chain, which is run twice: once for user and once for cluster:
- Use a command-line flag if it exists:
--user
or--cluster
. - If the context is non-empty, take the user or cluster from the context.
The user and cluster can be empty at this point.
- Use a command-line flag if it exists:
-
Determine the actual cluster information to use. At this point, there might or might not be cluster information. Build each piece of the cluster information based on this chain; the first hit wins:
- Use command line flags if they exist:
--server
,--certificate-authority
,--insecure-skip-tls-verify
. - If any cluster information attributes exist from the merged kubeconfig files, use them.
- If there is no server location, fail.
- Use command line flags if they exist:
-
Determine the actual user information to use. Build user information using the same rules as cluster information, except allow only one authentication technique per user:
- Use command line flags if they exist:
--client-certificate
,--client-key
,--username
,--password
,--token
. - Use the
user
fields from the merged kubeconfig files. - If there are two conflicting techniques, fail.
- Use command line flags if they exist:
-
For any information still missing, use default values and potentially prompt for authentication information.
File references
File and path references in a kubeconfig file are relative to the location of the kubeconfig file.
File references on the command line are relative to the current working directory.
In $HOME/.kube/config
, relative paths are stored relatively, and absolute paths
are stored absolutely.
Proxy
You can configure kubectl
to use proxy by setting proxy-url
in the kubeconfig file, like:
apiVersion: v1
kind: Config
proxy-url: https://proxy.host:3128
clusters:
- cluster:
name: development
users:
- name: developer
contexts:
- context:
name: development
What's next
3.8 - Security
3.8.1 - Overview of Cloud Native Security
This overview defines a model for thinking about Kubernetes security in the context of Cloud Native security.
The 4C's of Cloud Native security
You can think about security in layers. The 4C's of Cloud Native security are Cloud, Clusters, Containers, and Code.
Each layer of the Cloud Native security model builds upon the next outermost layer. The Code layer benefits from strong base (Cloud, Cluster, Container) security layers. You cannot safeguard against poor security standards in the base layers by addressing security at the Code level.
Cloud
In many ways, the Cloud (or co-located servers, or the corporate datacenter) is the trusted computing base of a Kubernetes cluster. If the Cloud layer is vulnerable (or configured in a vulnerable way) then there is no guarantee that the components built on top of this base are secure. Each cloud provider makes security recommendations for running workloads securely in their environment.
Cloud provider security
If you are running a Kubernetes cluster on your own hardware or a different cloud provider, consult your documentation for security best practices. Here are links to some of the popular cloud providers' security documentation:
IaaS Provider | Link |
---|---|
Alibaba Cloud | https://www.alibabacloud.com/trust-center |
Amazon Web Services | https://aws.amazon.com/security/ |
Google Cloud Platform | https://cloud.google.com/security/ |
IBM Cloud | https://www.ibm.com/cloud/security |
Microsoft Azure | https://docs.microsoft.com/en-us/azure/security/azure-security |
Oracle Cloud Infrastructure | https://www.oracle.com/security/ |
VMWare VSphere | https://www.vmware.com/security/hardening-guides.html |
Infrastructure security
Suggestions for securing your infrastructure in a Kubernetes cluster:
Area of Concern for Kubernetes Infrastructure | Recommendation |
---|---|
Network access to API Server (Control plane) | All access to the Kubernetes control plane is not allowed publicly on the internet and is controlled by network access control lists restricted to the set of IP addresses needed to administer the cluster. |
Network access to Nodes (nodes) | Nodes should be configured to only accept connections (via network access control lists) from the control plane on the specified ports, and accept connections for services in Kubernetes of type NodePort and LoadBalancer. If possible, these nodes should not be exposed on the public internet entirely. |
Kubernetes access to Cloud Provider API | Each cloud provider needs to grant a different set of permissions to the Kubernetes control plane and nodes. It is best to provide the cluster with cloud provider access that follows the principle of least privilege for the resources it needs to administer. The Kops documentation provides information about IAM policies and roles. |
Access to etcd | Access to etcd (the datastore of Kubernetes) should be limited to the control plane only. Depending on your configuration, you should attempt to use etcd over TLS. More information can be found in the etcd documentation. |
etcd Encryption | Wherever possible it's a good practice to encrypt all storage at rest, and since etcd holds the state of the entire cluster (including Secrets) its disk should especially be encrypted at rest. |
Cluster
There are two areas of concern for securing Kubernetes:
- Securing the cluster components that are configurable
- Securing the applications which run in the cluster
Components of the Cluster
If you want to protect your cluster from accidental or malicious access and adopt good information practices, read and follow the advice about securing your cluster.
Components in the cluster (your application)
Depending on the attack surface of your application, you may want to focus on specific aspects of security. For example: If you are running a service (Service A) that is critical in a chain of other resources and a separate workload (Service B) which is vulnerable to a resource exhaustion attack, then the risk of compromising Service A is high if you do not limit the resources of Service B. The following table lists areas of security concerns and recommendations for securing workloads running in Kubernetes:
Area of Concern for Workload Security | Recommendation |
---|---|
RBAC Authorization (Access to the Kubernetes API) | https://kubernetes.io/docs/reference/access-authn-authz/rbac/ |
Authentication | https://kubernetes.io/docs/concepts/security/controlling-access/ |
Application secrets management (and encrypting them in etcd at rest) | https://kubernetes.io/docs/concepts/configuration/secret/ https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/ |
Ensuring that pods meet defined Pod Security Standards | https://kubernetes.io/docs/concepts/security/pod-security-standards/#policy-instantiation |
Quality of Service (and Cluster resource management) | https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ |
Network Policies | https://kubernetes.io/docs/concepts/services-networking/network-policies/ |
TLS for Kubernetes Ingress | https://kubernetes.io/docs/concepts/services-networking/ingress/#tls |
Container
Container security is outside the scope of this guide. Here are general recommendations and links to explore this topic:
Area of Concern for Containers | Recommendation |
---|---|
Container Vulnerability Scanning and OS Dependency Security | As part of an image build step, you should scan your containers for known vulnerabilities. |
Image Signing and Enforcement | Sign container images to maintain a system of trust for the content of your containers. |
Disallow privileged users | When constructing containers, consult your documentation for how to create users inside of the containers that have the least level of operating system privilege necessary in order to carry out the goal of the container. |
Use container runtime with stronger isolation | Select container runtime classes that provide stronger isolation |
Code
Application code is one of the primary attack surfaces over which you have the most control. While securing application code is outside of the Kubernetes security topic, here are recommendations to protect application code:
Code security
Area of Concern for Code | Recommendation |
---|---|
Access over TLS only | If your code needs to communicate by TCP, perform a TLS handshake with the client ahead of time. With the exception of a few cases, encrypt everything in transit. Going one step further, it's a good idea to encrypt network traffic between services. This can be done through a process known as mutual TLS authentication or mTLS which performs a two sided verification of communication between two certificate holding services. |
Limiting port ranges of communication | This recommendation may be a bit self-explanatory, but wherever possible you should only expose the ports on your service that are absolutely essential for communication or metric gathering. |
3rd Party Dependency Security | It is a good practice to regularly scan your application's third party libraries for known security vulnerabilities. Each programming language has a tool for performing this check automatically. |
Static Code Analysis | Most languages provide a way for a snippet of code to be analyzed for any potentially unsafe coding practices. Whenever possible you should perform checks using automated tooling that can scan codebases for common security errors. Some of the tools can be found at: https://owasp.org/www-community/Source_Code_Analysis_Tools |
Dynamic probing attacks | There are a few automated tools that you can run against your service to try some of the well known service attacks. These include SQL injection, CSRF, and XSS. One of the most popular dynamic analysis tools is the OWASP Zed Attack proxy tool. |
What's next
Learn about related Kubernetes security topics:
3.8.2 - Pod Security Standards
The Pod Security Standards define three different policies to broadly cover the security spectrum. These policies are cumulative and range from highly-permissive to highly-restrictive. This guide outlines the requirements of each policy.
Profile | Description |
---|---|
Privileged | Unrestricted policy, providing the widest possible level of permissions. This policy allows for known privilege escalations. |
Baseline | Minimally restrictive policy which prevents known privilege escalations. Allows the default (minimally specified) Pod configuration. |
Restricted | Heavily restricted policy, following current Pod hardening best practices. |
Profile Details
Privileged
The Privileged policy is purposely-open, and entirely unrestricted. This type of policy is typically aimed at system- and infrastructure-level workloads managed by privileged, trusted users.
The Privileged policy is defined by an absence of restrictions. For allow-by-default enforcement mechanisms (such as gatekeeper), the Privileged policy may be an absence of applied constraints rather than an instantiated profile. In contrast, for a deny-by-default mechanism (such as Pod Security Policy) the Privileged policy should enable all controls (disable all restrictions).
Baseline
The Baseline policy is aimed at ease of adoption for common containerized workloads while preventing known privilege escalations. This policy is targeted at application operators and developers of non-critical applications. The following listed controls should be enforced/disallowed:
*
) indicate all elements in a list. For example,
spec.containers[*].securityContext
refers to the Security Context object for all defined
containers. If any of the listed containers fails to meet the requirements, the entire pod will
fail validation.
Control | Policy |
---|---|
HostProcess |
Windows pods offer the ability to run HostProcess containers which enables privileged access to the Windows node. Privileged access to the host is disallowed in the baseline policy. HostProcess pods are an alpha feature as of Kubernetes v1.22. Restricted Fields
Allowed Values
|
Host Namespaces |
Sharing the host namespaces must be disallowed. Restricted Fields
Allowed Values
|
Privileged Containers |
Privileged Pods disable most security mechanisms and must be disallowed. Restricted Fields
Allowed Values
|
Capabilities |
Adding additional capabilities beyond those listed below must be disallowed. Restricted Fields
Allowed Values
|
HostPath Volumes |
HostPath volumes must be forbidden. Restricted Fields
Allowed Values
|
Host Ports |
HostPorts should be disallowed, or at minimum restricted to a known list. Restricted Fields
Allowed Values
|
AppArmor |
On supported hosts, the Restricted Fields
Allowed Values
|
SELinux |
Setting the SELinux type is restricted, and setting a custom SELinux user or role option is forbidden. Restricted Fields
Allowed Values
Restricted Fields
Allowed Values
|
/proc Mount Type |
The default Restricted Fields
Allowed Values
|
Seccomp |
Seccomp profile must not be explicitly set to Restricted Fields
Allowed Values
|
Sysctls |
Sysctls can disable security mechanisms or affect all containers on a host, and should be disallowed except for an allowed "safe" subset. A sysctl is considered safe if it is namespaced in the container or the Pod, and it is isolated from other Pods or processes on the same Node. Restricted Fields
Allowed Values
|
Restricted
The Restricted policy is aimed at enforcing current Pod hardening best practices, at the expense of some compatibility. It is targeted at operators and developers of security-critical applications, as well as lower-trust users. The following listed controls should be enforced/disallowed:
*
) indicate all elements in a list. For example,
spec.containers[*].securityContext
refers to the Security Context object for all defined
containers. If any of the listed containers fails to meet the requirements, the entire pod will
fail validation.
Control | Policy |
Everything from the baseline profile. | |
Volume Types |
The restricted policy only permits the following volume types. Restricted Fields
Allowed Values Every item in thespec.volumes[*] list must set one of the following fields to a non-null value:
|
Privilege Escalation (v1.8+) |
Privilege escalation (such as via set-user-ID or set-group-ID file mode) should not be allowed. Restricted Fields
Allowed Values
|
Running as Non-root |
Containers must be required to run as non-root users. Restricted Fields
Allowed Values
nil if the pod-level
spec.securityContext.runAsNonRoot is set to true .
|
Running as Non-root user (v1.23+) |
Containers must not set runAsUser to 0 Restricted Fields
Allowed Values
|
Seccomp (v1.19+) |
Seccomp profile must be explicitly set to one of the allowed values. Both the Restricted Fields
Allowed Values
nil if the pod-level
spec.securityContext.seccompProfile.type field is set appropriately.
Conversely, the pod-level field may be undefined/nil if _all_ container-
level fields are set.
|
Capabilities (v1.22+) |
Containers must drop Restricted Fields
Allowed Values
Restricted Fields
Allowed Values
|
Policy Instantiation
Decoupling policy definition from policy instantiation allows for a common understanding and consistent language of policies across clusters, independent of the underlying enforcement mechanism.
As mechanisms mature, they will be defined below on a per-policy basis. The methods of enforcement of individual policies are not defined here.
Pod Security Admission Controller
PodSecurityPolicy (Deprecated)
FAQ
Why isn't there a profile between privileged and baseline?
The three profiles defined here have a clear linear progression from most secure (restricted) to least secure (privileged), and cover a broad set of workloads. Privileges required above the baseline policy are typically very application specific, so we do not offer a standard profile in this niche. This is not to say that the privileged profile should always be used in this case, but that policies in this space need to be defined on a case-by-case basis.
SIG Auth may reconsider this position in the future, should a clear need for other profiles arise.
What's the difference between a security profile and a security context?
Security Contexts configure Pods and Containers at runtime. Security contexts are defined as part of the Pod and container specifications in the Pod manifest, and represent parameters to the container runtime.
Security profiles are control plane mechanisms to enforce specific settings in the Security Context, as well as other related parameters outside the Security Context. As of July 2021, Pod Security Policies are deprecated in favor of the built-in Pod Security Admission Controller.
Other alternatives for enforcing security profiles are being developed in the Kubernetes ecosystem, such as:
What profiles should I apply to my Windows Pods?
Windows in Kubernetes has some limitations and differentiators from standard Linux-based workloads. Specifically, many of the Pod SecurityContext fields have no effect on Windows. As such, no standardized Pod Security profiles currently exist.
If you apply the restricted profile for a Windows pod, this may have an impact on the pod at runtime. The restricted profile requires enforcing Linux-specific restrictions (such as seccomp profile, and disallowing privilege escalation). If the kubelet and / or its container runtime ignore these Linux-specific values, then the Windows pod should still work normally within the restricted profile. However, the lack of enforcement means that there is no additional restriction, for Pods that use Windows containers, compared to the baseline profile.
The use of the HostProcess flag to create a HostProcess pod should only be done in alignment with the privileged policy. Creation of a Windows HostProcess pod is blocked under the baseline and restricted policies, so any HostProcess pod should be considered privileged.
What about sandboxed Pods?
There is not currently an API standard that controls whether a Pod is considered sandboxed or not. Sandbox Pods may be identified by the use of a sandboxed runtime (such as gVisor or Kata Containers), but there is no standard definition of what a sandboxed runtime is.
The protections necessary for sandboxed workloads can differ from others. For example, the need to restrict privileged permissions is lessened when the workload is isolated from the underlying kernel. This allows for workloads requiring heightened permissions to still be isolated.
Additionally, the protection of sandboxed workloads is highly dependent on the method of sandboxing. As such, no single recommended profile is recommended for all sandboxed workloads.
3.8.3 - Pod Security Admission
Kubernetes v1.23 [beta]
The Kubernetes Pod Security Standards define different isolation levels for Pods. These standards let you define how you want to restrict the behavior of pods in a clear, consistent fashion.
As a Beta feature, Kubernetes offers a built-in Pod Security admission controller, the successor to PodSecurityPolicies. Pod security restrictions are applied at the namespace level when pods are created.
Enabling the PodSecurity
admission plugin
In v1.23, the PodSecurity
feature gate
is a Beta feature and is enabled by default.
In v1.22, the PodSecurity
feature gate
is an Alpha feature and must be enabled in kube-apiserver
in order to use the built-in admission plugin.
--feature-gates="...,PodSecurity=true"
Alternative: installing the PodSecurity
admission webhook
For environments where the built-in PodSecurity
admission plugin cannot be used,
either because the cluster is older than v1.22, or the PodSecurity
feature cannot be enabled,
the PodSecurity
admission logic is also available as a Beta validating admission webhook.
A pre-built container image, certificate generation scripts, and example manifests are available at https://git.k8s.io/pod-security-admission/webhook.
To install:
git clone git@github.com:kubernetes/pod-security-admission.git
cd pod-security-admission/webhook
make certs
kubectl apply -k .
Pod Security levels
Pod Security admission places requirements on a Pod's Security
Context and other related fields according
to the three levels defined by the Pod Security
Standards: privileged
, baseline
, and
restricted
. Refer to the Pod Security Standards
page for an in-depth look at those requirements.
Pod Security Admission labels for namespaces
Once the feature is enabled or the webhook is installed, you can configure namespaces to define the admission control mode you want to use for pod security in each namespace. Kubernetes defines a set of labels that you can set to define which of the predefined Pod Security Standard levels you want to use for a namespace. The label you select defines what action the control plane takes if a potential violation is detected:
Mode | Description |
---|---|
enforce | Policy violations will cause the pod to be rejected. |
audit | Policy violations will trigger the addition of an audit annotation to the event recorded in the audit log, but are otherwise allowed. |
warn | Policy violations will trigger a user-facing warning, but are otherwise allowed. |
A namespace can configure any or all modes, or even set a different level for different modes.
For each mode, there are two labels that determine the policy used:
# The per-mode level label indicates which policy level to apply for the mode.
#
# MODE must be one of `enforce`, `audit`, or `warn`.
# LEVEL must be one of `privileged`, `baseline`, or `restricted`.
pod-security.kubernetes.io/<MODE>: <LEVEL>
# Optional: per-mode version label that can be used to pin the policy to the
# version that shipped with a given Kubernetes minor version (for example v1.23).
#
# MODE must be one of `enforce`, `audit`, or `warn`.
# VERSION must be a valid Kubernetes minor version, or `latest`.
pod-security.kubernetes.io/<MODE>-version: <VERSION>
Check out Enforce Pod Security Standards with Namespace Labels to see example usage.
Workload resources and Pod templates
Pods are often created indirectly, by creating a workload object such as a Deployment or Job. The workload object defines a Pod template and a controller for the workload resource creates Pods based on that template. To help catch violations early, both the audit and warning modes are applied to the workload resources. However, enforce mode is not applied to workload resources, only to the resulting pod objects.
Exemptions
You can define exemptions from pod security enforcement in order to allow the creation of pods that would have otherwise been prohibited due to the policy associated with a given namespace. Exemptions can be statically configured in the Admission Controller configuration.
Exemptions must be explicitly enumerated. Requests meeting exemption criteria are ignored by the
Admission Controller (all enforce
, audit
and warn
behaviors are skipped). Exemption dimensions include:
- Usernames: requests from users with an exempt authenticated (or impersonated) username are ignored.
- RuntimeClassNames: pods and workload resources specifying an exempt runtime class name are ignored.
- Namespaces: pods and workload resources in an exempt namespace are ignored.
system:serviceaccount:kube-system:replicaset-controller
)
should generally not be exempted, as doing so would implicitly exempt any user that can create the
corresponding workload resource.
Updates to the following pod fields are exempt from policy checks, meaning that if a pod update request only changes these fields, it will not be denied even if the pod is in violation of the current policy level:
- Any metadata updates except changes to the seccomp or AppArmor annotations:
seccomp.security.alpha.kubernetes.io/pod
(deprecated)container.seccomp.security.alpha.kubernetes.io/*
(deprecated)container.apparmor.security.beta.kubernetes.io/*
- Valid updates to
.spec.activeDeadlineSeconds
- Valid updates to
.spec.tolerations
What's next
3.8.4 - Controlling Access to the Kubernetes API
This page provides an overview of controlling access to the Kubernetes API.
Users access the Kubernetes API using kubectl
,
client libraries, or by making REST requests. Both human users and
Kubernetes service accounts can be
authorized for API access.
When a request reaches the API, it goes through several stages, illustrated in the
following diagram:
Transport security
In a typical Kubernetes cluster, the API serves on port 443, protected by TLS. The API server presents a certificate. This certificate may be signed using a private certificate authority (CA), or based on a public key infrastructure linked to a generally recognized CA.
If your cluster uses a private certificate authority, you need a copy of that CA
certificate configured into your ~/.kube/config
on the client, so that you can
trust the connection and be confident it was not intercepted.
Your client can present a TLS client certificate at this stage.
Authentication
Once TLS is established, the HTTP request moves to the Authentication step. This is shown as step 1 in the diagram. The cluster creation script or cluster admin configures the API server to run one or more Authenticator modules. Authenticators are described in more detail in Authentication.
The input to the authentication step is the entire HTTP request; however, it typically examines the headers and/or client certificate.
Authentication modules include client certificates, password, and plain tokens, bootstrap tokens, and JSON Web Tokens (used for service accounts).
Multiple authentication modules can be specified, in which case each one is tried in sequence, until one of them succeeds.
If the request cannot be authenticated, it is rejected with HTTP status code 401.
Otherwise, the user is authenticated as a specific username
, and the user name
is available to subsequent steps to use in their decisions. Some authenticators
also provide the group memberships of the user, while other authenticators
do not.
While Kubernetes uses usernames for access control decisions and in request logging,
it does not have a User
object nor does it store usernames or other information about
users in its API.
Authorization
After the request is authenticated as coming from a specific user, the request must be authorized. This is shown as step 2 in the diagram.
A request must include the username of the requester, the requested action, and the object affected by the action. The request is authorized if an existing policy declares that the user has permissions to complete the requested action.
For example, if Bob has the policy below, then he can read pods only in the namespace projectCaribou
:
{
"apiVersion": "abac.authorization.kubernetes.io/v1beta1",
"kind": "Policy",
"spec": {
"user": "bob",
"namespace": "projectCaribou",
"resource": "pods",
"readonly": true
}
}
If Bob makes the following request, the request is authorized because he is allowed to read objects in the projectCaribou
namespace:
{
"apiVersion": "authorization.k8s.io/v1beta1",
"kind": "SubjectAccessReview",
"spec": {
"resourceAttributes": {
"namespace": "projectCaribou",
"verb": "get",
"group": "unicorn.example.org",
"resource": "pods"
}
}
}
If Bob makes a request to write (create
or update
) to the objects in the projectCaribou
namespace, his authorization is denied. If Bob makes a request to read (get
) objects in a different namespace such as projectFish
, then his authorization is denied.
Kubernetes authorization requires that you use common REST attributes to interact with existing organization-wide or cloud-provider-wide access control systems. It is important to use REST formatting because these control systems might interact with other APIs besides the Kubernetes API.
Kubernetes supports multiple authorization modules, such as ABAC mode, RBAC Mode, and Webhook mode. When an administrator creates a cluster, they configure the authorization modules that should be used in the API server. If more than one authorization modules are configured, Kubernetes checks each module, and if any module authorizes the request, then the request can proceed. If all of the modules deny the request, then the request is denied (HTTP status code 403).
To learn more about Kubernetes authorization, including details about creating policies using the supported authorization modules, see Authorization.
Admission control
Admission Control modules are software modules that can modify or reject requests. In addition to the attributes available to Authorization modules, Admission Control modules can access the contents of the object that is being created or modified.
Admission controllers act on requests that create, modify, delete, or connect to (proxy) an object. Admission controllers do not act on requests that merely read objects. When multiple admission controllers are configured, they are called in order.
This is shown as step 3 in the diagram.
Unlike Authentication and Authorization modules, if any admission controller module rejects, then the request is immediately rejected.
In addition to rejecting objects, admission controllers can also set complex defaults for fields.
The available Admission Control modules are described in Admission Controllers.
Once a request passes all admission controllers, it is validated using the validation routines for the corresponding API object, and then written to the object store (shown as step 4).
API server ports and IPs
The previous discussion applies to requests sent to the secure port of the API server (the typical case). The API server can actually serve on 2 ports:
By default, the Kubernetes API server serves HTTP on 2 ports:
-
localhost
port:- is intended for testing and bootstrap, and for other components of the master node (scheduler, controller-manager) to talk to the API
- no TLS
- default is port 8080
- default IP is localhost, change with
--insecure-bind-address
flag. - request bypasses authentication and authorization modules.
- request handled by admission control module(s).
- protected by need to have host access
-
“Secure port”:
- use whenever possible
- uses TLS. Set cert with
--tls-cert-file
and key with--tls-private-key-file
flag. - default is port 6443, change with
--secure-port
flag. - default IP is first non-localhost network interface, change with
--bind-address
flag. - request handled by authentication and authorization modules.
- request handled by admission control module(s).
- authentication and authorization modules run.
What's next
Read more documentation on authentication, authorization and API access control:
- Authenticating
- Admission Controllers
- Authorization
- Certificate Signing Requests
- including CSR approval and certificate signing
- Service accounts
You can learn about:
- how Pods can use Secrets to obtain API credentials.
3.9 - Policies
3.9.1 - Limit Ranges
By default, containers run with unbounded compute resources on a Kubernetes cluster. With resource quotas, cluster administrators can restrict resource consumption and creation on a namespace basis. Within a namespace, a Pod or Container can consume as much CPU and memory as defined by the namespace's resource quota. There is a concern that one Pod or Container could monopolize all available resources. A LimitRange is a policy to constrain resource allocations (to Pods or Containers) in a namespace.
A LimitRange provides constraints that can:
- Enforce minimum and maximum compute resources usage per Pod or Container in a namespace.
- Enforce minimum and maximum storage request per PersistentVolumeClaim in a namespace.
- Enforce a ratio between request and limit for a resource in a namespace.
- Set default request/limit for compute resources in a namespace and automatically inject them to Containers at runtime.
Enabling LimitRange
LimitRange support has been enabled by default since Kubernetes 1.10.
A LimitRange is enforced in a particular namespace when there is a LimitRange object in that namespace.
The name of a LimitRange object must be a valid DNS subdomain name.
Overview of Limit Range
- The administrator creates one LimitRange in one namespace.
- Users create resources like Pods, Containers, and PersistentVolumeClaims in the namespace.
- The
LimitRanger
admission controller enforces defaults and limits for all Pods and Containers that do not set compute resource requirements and tracks usage to ensure it does not exceed resource minimum, maximum and ratio defined in any LimitRange present in the namespace. - If creating or updating a resource (Pod, Container, PersistentVolumeClaim) that violates a LimitRange constraint, the request to the API server will fail with an HTTP status code
403 FORBIDDEN
and a message explaining the constraint that have been violated. - If a LimitRange is activated in a namespace for compute resources like
cpu
andmemory
, users must specify requests or limits for those values. Otherwise, the system may reject Pod creation. - LimitRange validations occurs only at Pod Admission stage, not on Running Pods.
Examples of policies that could be created using limit range are:
- In a 2 node cluster with a capacity of 8 GiB RAM and 16 cores, constrain Pods in a namespace to request 100m of CPU with a max limit of 500m for CPU and request 200Mi for Memory with a max limit of 600Mi for Memory.
- Define default CPU limit and request to 150m and memory default request to 300Mi for Containers started with no cpu and memory requests in their specs.
In the case where the total limits of the namespace is less than the sum of the limits of the Pods/Containers, there may be contention for resources. In this case, the Containers or Pods will not be created.
Neither contention nor changes to a LimitRange will affect already created resources.
What's next
Refer to the LimitRanger design document for more information.
For examples on using limits, see:
- how to configure minimum and maximum CPU constraints per namespace.
- how to configure minimum and maximum Memory constraints per namespace.
- how to configure default CPU Requests and Limits per namespace.
- how to configure default Memory Requests and Limits per namespace.
- how to configure minimum and maximum Storage consumption per namespace.
- a detailed example on configuring quota per namespace.
3.9.2 - Resource Quotas
When several users or teams share a cluster with a fixed number of nodes, there is a concern that one team could use more than its fair share of resources.
Resource quotas are a tool for administrators to address this concern.
A resource quota, defined by a ResourceQuota
object, provides constraints that limit
aggregate resource consumption per namespace. It can limit the quantity of objects that can
be created in a namespace by type, as well as the total amount of compute resources that may
be consumed by resources in that namespace.
Resource quotas work like this:
-
Different teams work in different namespaces. Currently this is voluntary, but support for making this mandatory via ACLs is planned.
-
The administrator creates one ResourceQuota for each namespace.
-
Users create resources (pods, services, etc.) in the namespace, and the quota system tracks usage to ensure it does not exceed hard resource limits defined in a ResourceQuota.
-
If creating or updating a resource violates a quota constraint, the request will fail with HTTP status code
403 FORBIDDEN
with a message explaining the constraint that would have been violated. -
If quota is enabled in a namespace for compute resources like
cpu
andmemory
, users must specify requests or limits for those values; otherwise, the quota system may reject pod creation. Hint: Use theLimitRanger
admission controller to force defaults for pods that make no compute resource requirements.See the walkthrough for an example of how to avoid this problem.
The name of a ResourceQuota object must be a valid DNS subdomain name.
Examples of policies that could be created using namespaces and quotas are:
- In a cluster with a capacity of 32 GiB RAM, and 16 cores, let team A use 20 GiB and 10 cores, let B use 10GiB and 4 cores, and hold 2GiB and 2 cores in reserve for future allocation.
- Limit the "testing" namespace to using 1 core and 1GiB RAM. Let the "production" namespace use any amount.
In the case where the total capacity of the cluster is less than the sum of the quotas of the namespaces, there may be contention for resources. This is handled on a first-come-first-served basis.
Neither contention nor changes to quota will affect already created resources.
Enabling Resource Quota
Resource Quota support is enabled by default for many Kubernetes distributions. It is
enabled when the API server
--enable-admission-plugins=
flag has ResourceQuota
as
one of its arguments.
A resource quota is enforced in a particular namespace when there is a ResourceQuota in that namespace.
Compute Resource Quota
You can limit the total sum of compute resources that can be requested in a given namespace.
The following resource types are supported:
Resource Name | Description |
---|---|
limits.cpu |
Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value. |
limits.memory |
Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value. |
requests.cpu |
Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
requests.memory |
Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
hugepages-<size> |
Across all pods in a non-terminal state, the number of huge page requests of the specified size cannot exceed this value. |
cpu |
Same as requests.cpu |
memory |
Same as requests.memory |
Resource Quota For Extended Resources
In addition to the resources mentioned above, in release 1.10, quota support for extended resources is added.
As overcommit is not allowed for extended resources, it makes no sense to specify both requests
and limits
for the same extended resource in a quota. So for extended resources, only quota items
with prefix requests.
is allowed for now.
Take the GPU resource as an example, if the resource name is nvidia.com/gpu
, and you want to
limit the total number of GPUs requested in a namespace to 4, you can define a quota as follows:
requests.nvidia.com/gpu: 4
See Viewing and Setting Quotas for more detail information.
Storage Resource Quota
You can limit the total sum of storage resources that can be requested in a given namespace.
In addition, you can limit consumption of storage resources based on associated storage-class.
Resource Name | Description |
---|---|
requests.storage |
Across all persistent volume claims, the sum of storage requests cannot exceed this value. |
persistentvolumeclaims |
The total number of PersistentVolumeClaims that can exist in the namespace. |
<storage-class-name>.storageclass.storage.k8s.io/requests.storage |
Across all persistent volume claims associated with the <storage-class-name> , the sum of storage requests cannot exceed this value. |
<storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims |
Across all persistent volume claims associated with the storage-class-name, the total number of persistent volume claims that can exist in the namespace. |
For example, if an operator wants to quota storage with gold
storage class separate from bronze
storage class, the operator can
define a quota as follows:
gold.storageclass.storage.k8s.io/requests.storage: 500Gi
bronze.storageclass.storage.k8s.io/requests.storage: 100Gi
In release 1.8, quota support for local ephemeral storage is added as an alpha feature:
Resource Name | Description |
---|---|
requests.ephemeral-storage |
Across all pods in the namespace, the sum of local ephemeral storage requests cannot exceed this value. |
limits.ephemeral-storage |
Across all pods in the namespace, the sum of local ephemeral storage limits cannot exceed this value. |
ephemeral-storage |
Same as requests.ephemeral-storage . |
Object Count Quota
You can set quota for the total number of certain resources of all standard, namespaced resource types using the following syntax:
count/<resource>.<group>
for resources from non-core groupscount/<resource>
for resources from the core group
Here is an example set of resources users may want to put under object count quota:
count/persistentvolumeclaims
count/services
count/secrets
count/configmaps
count/replicationcontrollers
count/deployments.apps
count/replicasets.apps
count/statefulsets.apps
count/jobs.batch
count/cronjobs.batch
The same syntax can be used for custom resources.
For example, to create a quota on a widgets
custom resource in the example.com
API group, use count/widgets.example.com
.
When using count/*
resource quota, an object is charged against the quota if it exists in server storage.
These types of quotas are useful to protect against exhaustion of storage resources. For example, you may
want to limit the number of Secrets in a server given their large size. Too many Secrets in a cluster can
actually prevent servers and controllers from starting. You can set a quota for Jobs to protect against
a poorly configured CronJob. CronJobs that create too many Jobs in a namespace can lead to a denial of service.
It is also possible to do generic object count quota on a limited set of resources. The following types are supported:
Resource Name | Description |
---|---|
configmaps |
The total number of ConfigMaps that can exist in the namespace. |
persistentvolumeclaims |
The total number of PersistentVolumeClaims that can exist in the namespace. |
pods |
The total number of Pods in a non-terminal state that can exist in the namespace. A pod is in a terminal state if .status.phase in (Failed, Succeeded) is true. |
replicationcontrollers |
The total number of ReplicationControllers that can exist in the namespace. |
resourcequotas |
The total number of ResourceQuotas that can exist in the namespace. |
services |
The total number of Services that can exist in the namespace. |
services.loadbalancers |
The total number of Services of type LoadBalancer that can exist in the namespace. |
services.nodeports |
The total number of Services of type NodePort that can exist in the namespace. |
secrets |
The total number of Secrets that can exist in the namespace. |
For example, pods
quota counts and enforces a maximum on the number of pods
created in a single namespace that are not terminal. You might want to set a pods
quota on a namespace to avoid the case where a user creates many small pods and
exhausts the cluster's supply of Pod IPs.
Quota Scopes
Each quota can have an associated set of scopes
. A quota will only measure usage for a resource if it matches
the intersection of enumerated scopes.
When a scope is added to the quota, it limits the number of resources it supports to those that pertain to the scope. Resources specified on the quota outside of the allowed set results in a validation error.
Scope | Description |
---|---|
Terminating |
Match pods where .spec.activeDeadlineSeconds >= 0 |
NotTerminating |
Match pods where .spec.activeDeadlineSeconds is nil |
BestEffort |
Match pods that have best effort quality of service. |
NotBestEffort |
Match pods that do not have best effort quality of service. |
PriorityClass |
Match pods that references the specified priority class. |
CrossNamespacePodAffinity |
Match pods that have cross-namespace pod (anti)affinity terms. |
The BestEffort
scope restricts a quota to tracking the following resource:
pods
The Terminating
, NotTerminating
, NotBestEffort
and PriorityClass
scopes restrict a quota to tracking the following resources:
pods
cpu
memory
requests.cpu
requests.memory
limits.cpu
limits.memory
Note that you cannot specify both the Terminating
and the NotTerminating
scopes in the same quota, and you cannot specify both the BestEffort
and
NotBestEffort
scopes in the same quota either.
The scopeSelector
supports the following values in the operator
field:
In
NotIn
Exists
DoesNotExist
When using one of the following values as the scopeName
when defining the
scopeSelector
, the operator
must be Exists
.
Terminating
NotTerminating
BestEffort
NotBestEffort
If the operator
is In
or NotIn
, the values
field must have at least
one value. For example:
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values:
- middle
If the operator
is Exists
or DoesNotExist
, the values
field must NOT be
specified.
Resource Quota Per PriorityClass
Kubernetes v1.17 [stable]
Pods can be created at a specific priority.
You can control a pod's consumption of system resources based on a pod's priority, by using the scopeSelector
field in the quota spec.
A quota is matched and consumed only if scopeSelector
in the quota spec selects the pod.
When quota is scoped for priority class using scopeSelector
field, quota object
is restricted to track only following resources:
pods
cpu
memory
ephemeral-storage
limits.cpu
limits.memory
limits.ephemeral-storage
requests.cpu
requests.memory
requests.ephemeral-storage
This example creates a quota object and matches it with pods at specific priorities. The example works as follows:
- Pods in the cluster have one of the three priority classes, "low", "medium", "high".
- One quota object is created for each priority.
Save the following YAML to a file quota.yml
.
apiVersion: v1
kind: List
items:
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-high
spec:
hard:
cpu: "1000"
memory: 200Gi
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["high"]
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-medium
spec:
hard:
cpu: "10"
memory: 20Gi
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["medium"]
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-low
spec:
hard:
cpu: "5"
memory: 10Gi
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["low"]
Apply the YAML using kubectl create
.
kubectl create -f ./quota.yml
resourcequota/pods-high created
resourcequota/pods-medium created
resourcequota/pods-low created
Verify that Used
quota is 0
using kubectl describe quota
.
kubectl describe quota
Name: pods-high
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 1k
memory 0 200Gi
pods 0 10
Name: pods-low
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 5
memory 0 10Gi
pods 0 10
Name: pods-medium
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 10
memory 0 20Gi
pods 0 10
Create a pod with priority "high". Save the following YAML to a
file high-priority-pod.yml
.
apiVersion: v1
kind: Pod
metadata:
name: high-priority
spec:
containers:
- name: high-priority
image: ubuntu
command: ["/bin/sh"]
args: ["-c", "while true; do echo hello; sleep 10;done"]
resources:
requests:
memory: "10Gi"
cpu: "500m"
limits:
memory: "10Gi"
cpu: "500m"
priorityClassName: high
Apply it with kubectl create
.
kubectl create -f ./high-priority-pod.yml
Verify that "Used" stats for "high" priority quota, pods-high
, has changed and that
the other two quotas are unchanged.
kubectl describe quota
Name: pods-high
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 500m 1k
memory 10Gi 200Gi
pods 1 10
Name: pods-low
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 5
memory 0 10Gi
pods 0 10
Name: pods-medium
Namespace: default
Resource Used Hard
-------- ---- ----
cpu 0 10
memory 0 20Gi
pods 0 10
Cross-namespace Pod Affinity Quota
Kubernetes v1.22 [beta]
Operators can use CrossNamespacePodAffinity
quota scope to limit which namespaces are allowed to
have pods with affinity terms that cross namespaces. Specifically, it controls which pods are allowed
to set namespaces
or namespaceSelector
fields in pod affinity terms.
Preventing users from using cross-namespace affinity terms might be desired since a pod with anti-affinity constraints can block pods from all other namespaces from getting scheduled in a failure domain.
Using this scope operators can prevent certain namespaces (foo-ns
in the example below)
from having pods that use cross-namespace pod affinity by creating a resource quota object in
that namespace with CrossNamespaceAffinity
scope and hard limit of 0:
apiVersion: v1
kind: ResourceQuota
metadata:
name: disable-cross-namespace-affinity
namespace: foo-ns
spec:
hard:
pods: "0"
scopeSelector:
matchExpressions:
- scopeName: CrossNamespaceAffinity
If operators want to disallow using namespaces
and namespaceSelector
by default, and
only allow it for specific namespaces, they could configure CrossNamespaceAffinity
as a limited resource by setting the kube-apiserver flag --admission-control-config-file
to the path of the following configuration file:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: "ResourceQuota"
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: ResourceQuotaConfiguration
limitedResources:
- resource: pods
matchScopes:
- scopeName: CrossNamespaceAffinity
With the above configuration, pods can use namespaces
and namespaceSelector
in pod affinity only
if the namespace where they are created have a resource quota object with
CrossNamespaceAffinity
scope and a hard limit greater than or equal to the number of pods using those fields.
This feature is beta and enabled by default. You can disable it using the
feature gate
PodAffinityNamespaceSelector
in both kube-apiserver and kube-scheduler.
Requests compared to Limits
When allocating compute resources, each container may specify a request and a limit value for either CPU or memory. The quota can be configured to quota either value.
If the quota has a value specified for requests.cpu
or requests.memory
, then it requires that every incoming
container makes an explicit request for those resources. If the quota has a value specified for limits.cpu
or limits.memory
,
then it requires that every incoming container specifies an explicit limit for those resources.
Viewing and Setting Quotas
Kubectl supports creating, updating, and viewing quotas:
kubectl create namespace myspace
cat <<EOF > compute-resources.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
requests.cpu: "1"
requests.memory: 1Gi
limits.cpu: "2"
limits.memory: 2Gi
requests.nvidia.com/gpu: 4
EOF
kubectl create -f ./compute-resources.yaml --namespace=myspace
cat <<EOF > object-counts.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: object-counts
spec:
hard:
configmaps: "10"
persistentvolumeclaims: "4"
pods: "4"
replicationcontrollers: "20"
secrets: "10"
services: "10"
services.loadbalancers: "2"
EOF
kubectl create -f ./object-counts.yaml --namespace=myspace
kubectl get quota --namespace=myspace
NAME AGE
compute-resources 30s
object-counts 32s
kubectl describe quota compute-resources --namespace=myspace
Name: compute-resources
Namespace: myspace
Resource Used Hard
-------- ---- ----
limits.cpu 0 2
limits.memory 0 2Gi
requests.cpu 0 1
requests.memory 0 1Gi
requests.nvidia.com/gpu 0 4
kubectl describe quota object-counts --namespace=myspace
Name: object-counts
Namespace: myspace
Resource Used Hard
-------- ---- ----
configmaps 0 10
persistentvolumeclaims 0 4
pods 0 4
replicationcontrollers 0 20
secrets 1 10
services 0 10
services.loadbalancers 0 2
Kubectl also supports object count quota for all standard namespaced resources
using the syntax count/<resource>.<group>
:
kubectl create namespace myspace
kubectl create quota test --hard=count/deployments.apps=2,count/replicasets.apps=4,count/pods=3,count/secrets=4 --namespace=myspace
kubectl create deployment nginx --image=nginx --namespace=myspace --replicas=2
kubectl describe quota --namespace=myspace
Name: test
Namespace: myspace
Resource Used Hard
-------- ---- ----
count/deployments.apps 1 2
count/pods 2 3
count/replicasets.apps 1 4
count/secrets 1 4
Quota and Cluster Capacity
ResourceQuotas are independent of the cluster capacity. They are expressed in absolute units. So, if you add nodes to your cluster, this does not automatically give each namespace the ability to consume more resources.
Sometimes more complex policies may be desired, such as:
- Proportionally divide total cluster resources among several teams.
- Allow each tenant to grow resource usage as needed, but have a generous limit to prevent accidental resource exhaustion.
- Detect demand from one namespace, add nodes, and increase quota.
Such policies could be implemented using ResourceQuotas
as building blocks, by
writing a "controller" that watches the quota usage and adjusts the quota
hard limits of each namespace according to other signals.
Note that resource quota divides up aggregate cluster resources, but it creates no restrictions around nodes: pods from several namespaces may run on the same node.
Limit Priority Class consumption by default
It may be desired that pods at a particular priority, eg. "cluster-services", should be allowed in a namespace, if and only if, a matching quota object exists.
With this mechanism, operators are able to restrict usage of certain high priority classes to a limited number of namespaces and not every namespace will be able to consume these priority classes by default.
To enforce this, kube-apiserver
flag --admission-control-config-file
should be
used to pass path to the following configuration file:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: "ResourceQuota"
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: ResourceQuotaConfiguration
limitedResources:
- resource: pods
matchScopes:
- scopeName: PriorityClass
operator: In
values: ["cluster-services"]
Then, create a resource quota object in the kube-system
namespace:
apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-cluster-services
spec:
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["cluster-services"]
kubectl apply -f https://k8s.io/examples/policy/priority-class-resourcequota.yaml -n kube-system
resourcequota/pods-cluster-services created
In this case, a pod creation will be allowed if:
- the Pod's
priorityClassName
is not specified. - the Pod's
priorityClassName
is specified to a value other thancluster-services
. - the Pod's
priorityClassName
is set tocluster-services
, it is to be created in thekube-system
namespace, and it has passed the resource quota check.
A Pod creation request is rejected if its priorityClassName
is set to cluster-services
and it is to be created in a namespace other than kube-system
.
What's next
- See ResourceQuota design doc for more information.
- See a detailed example for how to use resource quota.
- Read Quota support for priority class design doc.
- See LimitedResources
3.9.3 - Pod Security Policies
Kubernetes v1.21 [deprecated]
Pod Security Policies enable fine-grained authorization of pod creation and updates.
What is a Pod Security Policy?
A Pod Security Policy is a cluster-level resource that controls security sensitive aspects of the pod specification. The PodSecurityPolicy objects define a set of conditions that a pod must run with in order to be accepted into the system, as well as defaults for the related fields. They allow an administrator to control the following:
Control Aspect | Field Names |
---|---|
Running of privileged containers | privileged |
Usage of host namespaces | hostPID , hostIPC |
Usage of host networking and ports | hostNetwork , hostPorts |
Usage of volume types | volumes |
Usage of the host filesystem | allowedHostPaths |
Allow specific FlexVolume drivers | allowedFlexVolumes |
Allocating an FSGroup that owns the pod's volumes | fsGroup |
Requiring the use of a read only root file system | readOnlyRootFilesystem |
The user and group IDs of the container | runAsUser , runAsGroup , supplementalGroups |
Restricting escalation to root privileges | allowPrivilegeEscalation , defaultAllowPrivilegeEscalation |
Linux capabilities | defaultAddCapabilities , requiredDropCapabilities , allowedCapabilities |
The SELinux context of the container | seLinux |
The Allowed Proc Mount types for the container | allowedProcMountTypes |
The AppArmor profile used by containers | annotations |
The seccomp profile used by containers | annotations |
The sysctl profile used by containers | forbiddenSysctls ,allowedUnsafeSysctls |
Enabling Pod Security Policies
Pod security policy control is implemented as an optional admission controller. PodSecurityPolicies are enforced by enabling the admission controller, but doing so without authorizing any policies will prevent any pods from being created in the cluster.
Since the pod security policy API (policy/v1beta1/podsecuritypolicy
) is
enabled independently of the admission controller, for existing clusters it is
recommended that policies are added and authorized before enabling the admission
controller.
Authorizing Policies
When a PodSecurityPolicy resource is created, it does nothing. In order to use
it, the requesting user or target pod's
service account
must be authorized to use the policy, by allowing the use
verb on the policy.
Most Kubernetes pods are not created directly by users. Instead, they are typically created indirectly as part of a Deployment, ReplicaSet, or other templated controller via the controller manager. Granting the controller access to the policy would grant access for all pods created by that controller, so the preferred method for authorizing policies is to grant access to the pod's service account (see example).
Via RBAC
RBAC is a standard Kubernetes authorization mode, and can easily be used to authorize use of policies.
First, a Role
or ClusterRole
needs to grant access to use
the desired
policies. The rules to grant access look like this:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: <role name>
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames:
- <list of policies to authorize>
Then the (Cluster)Role
is bound to the authorized user(s):
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: <binding name>
roleRef:
kind: ClusterRole
name: <role name>
apiGroup: rbac.authorization.k8s.io
subjects:
# Authorize all service accounts in a namespace (recommended):
- kind: Group
apiGroup: rbac.authorization.k8s.io
name: system:serviceaccounts:<authorized namespace>
# Authorize specific service accounts (not recommended):
- kind: ServiceAccount
name: <authorized service account name>
namespace: <authorized pod namespace>
# Authorize specific users (not recommended):
- kind: User
apiGroup: rbac.authorization.k8s.io
name: <authorized user name>
If a RoleBinding
(not a ClusterRoleBinding
) is used, it will only grant
usage for pods being run in the same namespace as the binding. This can be
paired with system groups to grant access to all pods run in the namespace:
# Authorize all service accounts in a namespace:
- kind: Group
apiGroup: rbac.authorization.k8s.io
name: system:serviceaccounts
# Or equivalently, all authenticated users in a namespace:
- kind: Group
apiGroup: rbac.authorization.k8s.io
name: system:authenticated
For more examples of RBAC bindings, see RoleBinding examples. For a complete example of authorizing a PodSecurityPolicy, see below.
Recommended Practice
PodSecurityPolicy is being replaced by a new, simplified PodSecurity
admission controller.
For more details on this change, see
PodSecurityPolicy Deprecation: Past, Present, and Future.
Follow these guidelines to simplify migration from PodSecurityPolicy to the
new admission controller:
-
Limit your PodSecurityPolicies to the policies defined by the Pod Security Standards:
-
Only bind PSPs to entire namespaces, by using the
system:serviceaccounts:<namespace>
group (where<namespace>
is the target namespace). For example:apiVersion: rbac.authorization.k8s.io/v1 # This cluster role binding allows all pods in the "development" namespace to use the baseline PSP. kind: ClusterRoleBinding metadata: name: psp-baseline-namespaces roleRef: kind: ClusterRole name: psp-baseline apiGroup: rbac.authorization.k8s.io subjects: - kind: Group name: system:serviceaccounts:development apiGroup: rbac.authorization.k8s.io - kind: Group name: system:serviceaccounts:canary apiGroup: rbac.authorization.k8s.io
Troubleshooting
-
The controller manager must be run against the secured API port and must not have superuser permissions. See Controlling Access to the Kubernetes API to learn about API server access controls.
If the controller manager connected through the trusted API port (also known as thelocalhost
listener), requests would bypass authentication and authorization modules; all PodSecurityPolicy objects would be allowed, and users would be able to create grant themselves the ability to create privileged containers.For more details on configuring controller manager authorization, see Controller Roles.
Policy Order
In addition to restricting pod creation and update, pod security policies can also be used to provide default values for many of the fields that it controls. When multiple policies are available, the pod security policy controller selects policies according to the following criteria:
- PodSecurityPolicies which allow the pod as-is, without changing defaults or mutating the pod, are preferred. The order of these non-mutating PodSecurityPolicies doesn't matter.
- If the pod must be defaulted or mutated, the first PodSecurityPolicy (ordered by name) to allow the pod is selected.
Example
This example assumes you have a running cluster with the PodSecurityPolicy admission controller enabled and you have cluster admin privileges.
Set up
Set up a namespace and a service account to act as for this example. We'll use this service account to mock a non-admin user.
kubectl create namespace psp-example
kubectl create serviceaccount -n psp-example fake-user
kubectl create rolebinding -n psp-example fake-editor --clusterrole=edit --serviceaccount=psp-example:fake-user
To make it clear which user we're acting as and save some typing, create 2 aliases:
alias kubectl-admin='kubectl -n psp-example'
alias kubectl-user='kubectl --as=system:serviceaccount:psp-example:fake-user -n psp-example'
Create a policy and a pod
Define the example PodSecurityPolicy object in a file. This is a policy that prevents the creation of privileged pods. The name of a PodSecurityPolicy object must be a valid DNS subdomain name.
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: example
spec:
privileged: false # Don't allow privileged pods!
# The rest fills in some required fields.
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
runAsUser:
rule: RunAsAny
fsGroup:
rule: RunAsAny
volumes:
- '*'
And create it with kubectl:
kubectl-admin create -f example-psp.yaml
Now, as the unprivileged user, try to create a simple pod:
kubectl-user create -f- <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pause
spec:
containers:
- name: pause
image: k8s.gcr.io/pause
EOF
The output is similar to this:
Error from server (Forbidden): error when creating "STDIN": pods "pause" is forbidden: unable to validate against any pod security policy: []
What happened? Although the PodSecurityPolicy was created, neither the
pod's service account nor fake-user
have permission to use the new policy:
kubectl-user auth can-i use podsecuritypolicy/example
no
Create the rolebinding to grant fake-user
the use
verb on the example
policy:
kubectl-admin create role psp:unprivileged \
--verb=use \
--resource=podsecuritypolicy \
--resource-name=example
role "psp:unprivileged" created
kubectl-admin create rolebinding fake-user:psp:unprivileged \
--role=psp:unprivileged \
--serviceaccount=psp-example:fake-user
rolebinding "fake-user:psp:unprivileged" created
kubectl-user auth can-i use podsecuritypolicy/example
yes
Now retry creating the pod:
kubectl-user create -f- <<EOF
apiVersion: v1
kind: Pod
metadata:
name: pause
spec:
containers:
- name: pause
image: k8s.gcr.io/pause
EOF
The output is similar to this
pod "pause" created
It works as expected! But any attempts to create a privileged pod should still be denied:
kubectl-user create -f- <<EOF
apiVersion: v1
kind: Pod
metadata:
name: privileged
spec:
containers:
- name: pause
image: k8s.gcr.io/pause
securityContext:
privileged: true
EOF
The output is similar to this:
Error from server (Forbidden): error when creating "STDIN": pods "privileged" is forbidden: unable to validate against any pod security policy: [spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]
Delete the pod before moving on:
kubectl-user delete pod pause
Run another pod
Let's try that again, slightly differently:
kubectl-user create deployment pause --image=k8s.gcr.io/pause
deployment "pause" created
kubectl-user get pods
No resources found.
kubectl-user get events | head -n 2
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
1m 2m 15 pause-7774d79b5 ReplicaSet Warning FailedCreate replicaset-controller Error creating: pods "pause-7774d79b5-" is forbidden: no providers available to validate pod request
What happened? We already bound the psp:unprivileged
role for our fake-user
,
why are we getting the error Error creating: pods "pause-7774d79b5-" is forbidden: no providers available to validate pod request
? The answer lies in
the source - replicaset-controller
. Fake-user successfully created the
deployment (which successfully created a replicaset), but when the replicaset
went to create the pod it was not authorized to use the example
podsecuritypolicy.
In order to fix this, bind the psp:unprivileged
role to the pod's service
account instead. In this case (since we didn't specify it) the service account
is default
:
kubectl-admin create rolebinding default:psp:unprivileged \
--role=psp:unprivileged \
--serviceaccount=psp-example:default
rolebinding "default:psp:unprivileged" created
Now if you give it a minute to retry, the replicaset-controller should eventually succeed in creating the pod:
kubectl-user get pods --watch
NAME READY STATUS RESTARTS AGE
pause-7774d79b5-qrgcb 0/1 Pending 0 1s
pause-7774d79b5-qrgcb 0/1 Pending 0 1s
pause-7774d79b5-qrgcb 0/1 ContainerCreating 0 1s
pause-7774d79b5-qrgcb 1/1 Running 0 2s
Clean up
Delete the namespace to clean up most of the example resources:
kubectl-admin delete ns psp-example
namespace "psp-example" deleted
Note that PodSecurityPolicy
resources are not namespaced, and must be cleaned
up separately:
kubectl-admin delete psp example
podsecuritypolicy "example" deleted
Example Policies
This is the least restrictive policy you can create, equivalent to not using the pod security policy admission controller:
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: privileged
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
spec:
privileged: true
allowPrivilegeEscalation: true
allowedCapabilities:
- '*'
volumes:
- '*'
hostNetwork: true
hostPorts:
- min: 0
max: 65535
hostIPC: true
hostPID: true
runAsUser:
rule: 'RunAsAny'
seLinux:
rule: 'RunAsAny'
supplementalGroups:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'
This is an example of a restrictive policy that requires users to run as an unprivileged user, blocks possible escalations to root, and requires use of several security mechanisms.
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default'
apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default'
apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default'
spec:
privileged: false
# Required to prevent escalations to root.
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
# Allow core volume types.
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
# Assume that ephemeral CSI drivers & persistentVolumes set up by the cluster admin are safe to use.
- 'csi'
- 'persistentVolumeClaim'
- 'ephemeral'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
# Require the container to run without root privileges.
rule: 'MustRunAsNonRoot'
seLinux:
# This policy assumes the nodes are using AppArmor rather than SELinux.
rule: 'RunAsAny'
supplementalGroups:
rule: 'MustRunAs'
ranges:
# Forbid adding the root group.
- min: 1
max: 65535
fsGroup:
rule: 'MustRunAs'
ranges:
# Forbid adding the root group.
- min: 1
max: 65535
readOnlyRootFilesystem: false
See Pod Security Standards for more examples.
Policy Reference
Privileged
Privileged - determines if any container in a pod can enable privileged mode. By default a container is not allowed to access any devices on the host, but a "privileged" container is given access to all devices on the host. This allows the container nearly all the same access as processes running on the host. This is useful for containers that want to use linux capabilities like manipulating the network stack and accessing devices.
Host namespaces
HostPID - Controls whether the pod containers can share the host process ID namespace. Note that when paired with ptrace this can be used to escalate privileges outside of the container (ptrace is forbidden by default).
HostIPC - Controls whether the pod containers can share the host IPC namespace.
HostNetwork - Controls whether the pod may use the node network namespace. Doing so gives the pod access to the loopback device, services listening on localhost, and could be used to snoop on network activity of other pods on the same node.
HostPorts - Provides a list of ranges of allowable ports in the host
network namespace. Defined as a list of HostPortRange
, with min
(inclusive)
and max
(inclusive). Defaults to no allowed host ports.
Volumes and file systems
Volumes - Provides a list of allowed volume types. The allowable values
correspond to the volume sources that are defined when creating a volume. For
the complete list of volume types, see Types of
Volumes. Additionally,
*
may be used to allow all volume types.
The recommended minimum set of allowed volumes for new PSPs are:
configMap
downwardAPI
emptyDir
persistentVolumeClaim
secret
projected
PersistentVolume
objects that
may be referenced by a PersistentVolumeClaim
, and hostPath type
PersistentVolumes
do not support read-only access mode. Only trusted users
should be granted permission to create PersistentVolume
objects.
FSGroup - Controls the supplemental group applied to some volumes.
- MustRunAs - Requires at least one
range
to be specified. Uses the minimum value of the first range as the default. Validates against all ranges. - MayRunAs - Requires at least one
range
to be specified. AllowsFSGroups
to be left unset without providing a default. Validates against all ranges ifFSGroups
is set. - RunAsAny - No default provided. Allows any
fsGroup
ID to be specified.
AllowedHostPaths - This specifies a list of host paths that are allowed
to be used by hostPath volumes. An empty list means there is no restriction on
host paths used. This is defined as a list of objects with a single pathPrefix
field, which allows hostPath volumes to mount a path that begins with an
allowed prefix, and a readOnly
field indicating it must be mounted read-only.
For example:
allowedHostPaths:
# This allows "/foo", "/foo/", "/foo/bar" etc., but
# disallows "/fool", "/etc/foo" etc.
# "/foo/../" is never valid.
- pathPrefix: "/foo"
readOnly: true # only allow read-only mounts
There are many ways a container with unrestricted access to the host filesystem can escalate privileges, including reading data from other containers, and abusing the credentials of system services, such as Kubelet.
Writeable hostPath directory volumes allow containers to write
to the filesystem in ways that let them traverse the host filesystem outside the pathPrefix
.
readOnly: true
, available in Kubernetes 1.11+, must be used on all allowedHostPaths
to effectively limit access to the specified pathPrefix
.
ReadOnlyRootFilesystem - Requires that containers must run with a read-only root filesystem (i.e. no writable layer).
FlexVolume drivers
This specifies a list of FlexVolume drivers that are allowed to be used
by flexvolume. An empty list or nil means there is no restriction on the drivers.
Please make sure volumes
field contains the
flexVolume
volume type; no FlexVolume driver is allowed otherwise.
For example:
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: allow-flex-volumes
spec:
# ... other spec fields
volumes:
- flexVolume
allowedFlexVolumes:
- driver: example/lvm
- driver: example/cifs
Users and groups
RunAsUser - Controls which user ID the containers are run with.
- MustRunAs - Requires at least one
range
to be specified. Uses the minimum value of the first range as the default. Validates against all ranges. - MustRunAsNonRoot - Requires that the pod be submitted with a non-zero
runAsUser
or have theUSER
directive defined (using a numeric UID) in the image. Pods which have specified neitherrunAsNonRoot
norrunAsUser
settings will be mutated to setrunAsNonRoot=true
, thus requiring a defined non-zero numericUSER
directive in the container. No default provided. SettingallowPrivilegeEscalation=false
is strongly recommended with this strategy. - RunAsAny - No default provided. Allows any
runAsUser
to be specified.
RunAsGroup - Controls which primary group ID the containers are run with.
- MustRunAs - Requires at least one
range
to be specified. Uses the minimum value of the first range as the default. Validates against all ranges. - MayRunAs - Does not require that RunAsGroup be specified. However, when RunAsGroup is specified, they have to fall in the defined range.
- RunAsAny - No default provided. Allows any
runAsGroup
to be specified.
SupplementalGroups - Controls which group IDs containers add.
- MustRunAs - Requires at least one
range
to be specified. Uses the minimum value of the first range as the default. Validates against all ranges. - MayRunAs - Requires at least one
range
to be specified. AllowssupplementalGroups
to be left unset without providing a default. Validates against all ranges ifsupplementalGroups
is set. - RunAsAny - No default provided. Allows any
supplementalGroups
to be specified.
Privilege Escalation
These options control the allowPrivilegeEscalation
container option. This bool
directly controls whether the
no_new_privs
flag gets set on the container process. This flag will prevent setuid
binaries
from changing the effective user ID, and prevent files from enabling extra
capabilities (e.g. it will prevent the use of the ping
tool). This behavior is
required to effectively enforce MustRunAsNonRoot
.
AllowPrivilegeEscalation - Gates whether or not a user is allowed to set the
security context of a container to allowPrivilegeEscalation=true
. This
defaults to allowed so as to not break setuid binaries. Setting it to false
ensures that no child process of a container can gain more privileges than its parent.
DefaultAllowPrivilegeEscalation - Sets the default for the
allowPrivilegeEscalation
option. The default behavior without this is to allow
privilege escalation so as to not break setuid binaries. If that behavior is not
desired, this field can be used to default to disallow, while still permitting
pods to request allowPrivilegeEscalation
explicitly.
Capabilities
Linux capabilities provide a finer grained breakdown of the privileges traditionally associated with the superuser. Some of these capabilities can be used to escalate privileges or for container breakout, and may be restricted by the PodSecurityPolicy. For more details on Linux capabilities, see capabilities(7).
The following fields take a list of capabilities, specified as the capability
name in ALL_CAPS without the CAP_
prefix.
AllowedCapabilities - Provides a list of capabilities that are allowed to be added
to a container. The default set of capabilities are implicitly allowed. The
empty set means that no additional capabilities may be added beyond the default
set. *
can be used to allow all capabilities.
RequiredDropCapabilities - The capabilities which must be dropped from
containers. These capabilities are removed from the default set, and must not be
added. Capabilities listed in RequiredDropCapabilities
must not be included in
AllowedCapabilities
or DefaultAddCapabilities
.
DefaultAddCapabilities - The capabilities which are added to containers by default, in addition to the runtime defaults. See the Docker documentation for the default list of capabilities when using the Docker runtime.
SELinux
- MustRunAs - Requires
seLinuxOptions
to be configured. UsesseLinuxOptions
as the default. Validates againstseLinuxOptions
. - RunAsAny - No default provided. Allows any
seLinuxOptions
to be specified.
AllowedProcMountTypes
allowedProcMountTypes
is a list of allowed ProcMountTypes.
Empty or nil indicates that only the DefaultProcMountType
may be used.
DefaultProcMount
uses the container runtime defaults for readonly and masked
paths for /proc. Most container runtimes mask certain paths in /proc to avoid
accidental security exposure of special devices or information. This is denoted
as the string Default
.
The only other ProcMountType is UnmaskedProcMount
, which bypasses the
default masking behavior of the container runtime and ensures the newly
created /proc the container stays intact with no modifications. This is
denoted as the string Unmasked
.
AppArmor
Controlled via annotations on the PodSecurityPolicy. Refer to the AppArmor documentation.
Seccomp
As of Kubernetes v1.19, you can use the seccompProfile
field in the
securityContext
of Pods or containers to
control use of seccomp profiles.
In prior versions, seccomp was controlled by adding annotations to a Pod. The
same PodSecurityPolicies can be used with either version to enforce how these
fields or annotations are applied.
seccomp.security.alpha.kubernetes.io/defaultProfileName - Annotation that specifies the default seccomp profile to apply to containers. Possible values are:
-
unconfined
- Seccomp is not applied to the container processes (this is the default in Kubernetes), if no alternative is provided. -
runtime/default
- The default container runtime profile is used. -
docker/default
- The Docker default seccomp profile is used. Deprecated as of Kubernetes 1.11. Useruntime/default
instead. -
localhost/<path>
- Specify a profile as a file on the node located at<seccomp_root>/<path>
, where<seccomp_root>
is defined via the--seccomp-profile-root
flag on the Kubelet. If the--seccomp-profile-root
flag is not defined, the default path will be used, which is<root-dir>/seccomp
where<root-dir>
is specified by the--root-dir
flag.Note: The--seccomp-profile-root
flag is deprecated since Kubernetes v1.19. Users are encouraged to use the default path.
seccomp.security.alpha.kubernetes.io/allowedProfileNames - Annotation that
specifies which values are allowed for the pod seccomp annotations. Specified as
a comma-delimited list of allowed values. Possible values are those listed
above, plus *
to allow all profiles. Absence of this annotation means that the
default cannot be changed.
Sysctl
By default, all safe sysctls are allowed.
forbiddenSysctls
- excludes specific sysctls. You can forbid a combination of safe and unsafe sysctls in the list. To forbid setting any sysctls, use*
on its own.allowedUnsafeSysctls
- allows specific sysctls that had been disallowed by the default list, so long as these are not listed inforbiddenSysctls
.
Refer to the Sysctl documentation.
What's next
-
See PodSecurityPolicy Deprecation: Past, Present, and Future to learn about the future of pod security policy.
-
See Pod Security Standards for policy recommendations.
-
Refer to PodSecurityPolicy reference for the API details.
3.9.4 - Process ID Limits And Reservations
Kubernetes v1.20 [stable]
Kubernetes allow you to limit the number of process IDs (PIDs) that a Pod can use. You can also reserve a number of allocatable PIDs for each node for use by the operating system and daemons (rather than by Pods).
Process IDs (PIDs) are a fundamental resource on nodes. It is trivial to hit the task limit without hitting any other resource limits, which can then cause instability to a host machine.
Cluster administrators require mechanisms to ensure that Pods running in the cluster cannot induce PID exhaustion that prevents host daemons (such as the kubelet or kube-proxy, and potentially also the container runtime) from running. In addition, it is important to ensure that PIDs are limited among Pods in order to ensure they have limited impact on other workloads on the same node.
32768
. Consider raising the value of /proc/sys/kernel/pid_max
.
You can configure a kubelet to limit the number of PIDs a given Pod can consume.
For example, if your node's host OS is set to use a maximum of 262144
PIDs and
expect to host less than 250
Pods, one can give each Pod a budget of 1000
PIDs to prevent using up that node's overall number of available PIDs. If the
admin wants to overcommit PIDs similar to CPU or memory, they may do so as well
with some additional risks. Either way, a single Pod will not be able to bring
the whole machine down. This kind of resource limiting helps to prevent simple
fork bombs from affecting operation of an entire cluster.
Per-Pod PID limiting allows administrators to protect one Pod from another, but does not ensure that all Pods scheduled onto that host are unable to impact the node overall. Per-Pod limiting also does not protect the node agents themselves from PID exhaustion.
You can also reserve an amount of PIDs for node overhead, separate from the allocation to Pods. This is similar to how you can reserve CPU, memory, or other resources for use by the operating system and other facilities outside of Pods and their containers.
PID limiting is a an important sibling to compute
resource requests
and limits. However, you specify it in a different way: rather than defining a
Pod's resource limit in the .spec
for a Pod, you configure the limit as a
setting on the kubelet. Pod-defined PID limits are not currently supported.
Node PID limits
Kubernetes allows you to reserve a number of process IDs for the system use. To
configure the reservation, use the parameter pid=<number>
in the
--system-reserved
and --kube-reserved
command line options to the kubelet.
The value you specified declares that the specified number of process IDs will
be reserved for the system as a whole and for Kubernetes system daemons
respectively.
SupportNodePidsLimit
to work.
Pod PID limits
Kubernetes allows you to limit the number of processes running in a Pod. You
specify this limit at the node level, rather than configuring it as a resource
limit for a particular Pod. Each Node can have a different PID limit.
To configure the limit, you can specify the command line parameter --pod-max-pids
to the kubelet, or set PodPidsLimit
in the kubelet
configuration file.
SupportPodPidsLimit
to work.
PID based eviction
You can configure kubelet to start terminating a Pod when it is misbehaving and consuming abnormal amount of resources.
This feature is called eviction. You can
Configure Out of Resource Handling
for various eviction signals.
Use pid.available
eviction signal to configure the threshold for number of PIDs used by Pod.
You can set soft and hard eviction policies.
However, even with the hard eviction policy, if the number of PIDs growing very fast,
node can still get into unstable state by hitting the node PIDs limit.
Eviction signal value is calculated periodically and does NOT enforce the limit.
PID limiting - per Pod and per Node sets the hard limit. Once the limit is hit, workload will start experiencing failures when trying to get a new PID. It may or may not lead to rescheduling of a Pod, depending on how workload reacts on these failures and how liveleness and readiness probes are configured for the Pod. However, if limits were set correctly, you can guarantee that other Pods workload and system processes will not run out of PIDs when one Pod is misbehaving.
What's next
- Refer to the PID Limiting enhancement document for more information.
- For historical context, read Process ID Limiting for Stability Improvements in Kubernetes 1.14.
- Read Managing Resources for Containers.
- Learn how to Configure Out of Resource Handling.
3.9.5 - Node Resource Managers
In order to support latency-critical and high-throughput workloads, Kubernetes offers a suite of Resource Managers. The managers aim to co-ordinate and optimise node's resources alignment for pods configured with a specific requirement for CPUs, devices, and memory (hugepages) resources.
The main manager, the Topology Manager, is a Kubelet component that co-ordinates the overall resource management process through its policy.
The configuration of individual managers is elaborated in dedicated documents:
3.10 - Scheduling, Preemption and Eviction
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the kubelet can run them. Preemption is the process of terminating Pods with lower Priority so that Pods with higher Priority can schedule on Nodes. Eviction is the process of terminating one or more Pods on Nodes.
Scheduling
- Kubernetes Scheduler
- Assigning Pods to Nodes
- Pod Overhead
- Taints and Tolerations
- Scheduling Framework
- Scheduler Performance Tuning
- Resource Bin Packing for Extended Resources
Pod Disruption
Pod disruption is the process by which Pods on Nodes are terminated either voluntarily or involuntarily.
Voluntary disruptions are started intentionally by application owners or cluster administrators. Involuntary disruptions are unintentional and can be triggered by unavoidable issues like Nodes running out of resources, or by accidental deletions.
3.10.1 - Kubernetes Scheduler
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that Kubelet can run them.
Scheduling overview
A scheduler watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on. The scheduler reaches this placement decision taking into account the scheduling principles described below.
If you want to understand why Pods are placed onto a particular Node, or if you're planning to implement a custom scheduler yourself, this page will help you learn about scheduling.
kube-scheduler
kube-scheduler is the default scheduler for Kubernetes and runs as part of the control plane. kube-scheduler is designed so that, if you want and need to, you can write your own scheduling component and use that instead.
For every newly created pod or other unscheduled pods, kube-scheduler selects an optimal node for them to run on. However, every container in pods has different requirements for resources and every pod also has different requirements. Therefore, existing nodes need to be filtered according to the specific scheduling requirements.
In a cluster, Nodes that meet the scheduling requirements for a Pod are called feasible nodes. If none of the nodes are suitable, the pod remains unscheduled until the scheduler is able to place it.
The scheduler finds feasible Nodes for a Pod and then runs a set of functions to score the feasible Nodes and picks a Node with the highest score among the feasible ones to run the Pod. The scheduler then notifies the API server about this decision in a process called binding.
Factors that need to be taken into account for scheduling decisions include individual and collective resource requirements, hardware / software / policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, and so on.
Node selection in kube-scheduler
kube-scheduler selects a node for the pod in a 2-step operation:
- Filtering
- Scoring
The filtering step finds the set of Nodes where it's feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod's specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.
In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.
Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random.
There are two supported ways to configure the filtering and scoring behavior of the scheduler:
- Scheduling Policies allow you to configure Predicates for filtering and Priorities for scoring.
- Scheduling Profiles allow you to configure Plugins that implement different scheduling stages, including:
QueueSort
,Filter
,Score
,Bind
,Reserve
,Permit
, and others. You can also configure the kube-scheduler to run different profiles.
What's next
- Read about scheduler performance tuning
- Read about Pod topology spread constraints
- Read the reference documentation for kube-scheduler
- Read the kube-scheduler config (v1beta3) reference
- Learn about configuring multiple schedulers
- Learn about topology management policies
- Learn about Pod Overhead
- Learn about scheduling of Pods that use volumes in:
3.10.2 - Assigning Pods to Nodes
You can constrain a Pod so that it can only run on particular set of node(s). There are several ways to do this and the recommended approaches all use label selectors to facilitate the selection. Generally such constraints are unnecessary, as the scheduler will automatically do a reasonable placement (for example, spreading your Pods across nodes so as not place Pods on a node with insufficient free resources). However, there are some circumstances where you may want to control which node the Pod deploys to, for example, to ensure that a Pod ends up on a node with an SSD attached to it, or to co-locate Pods from two different services that communicate a lot into the same availability zone.
You can use any of the following methods to choose where Kubernetes schedules specific Pods:
- nodeSelector field matching against node labels
- Affinity and anti-affinity
- nodeName field
Node labels
Like many other Kubernetes objects, nodes have labels. You can attach labels manually. Kubernetes also populates a standard set of labels on all nodes in a cluster. See Well-Known Labels, Annotations and Taints for a list of common node labels.
kubernetes.io/hostname
may be the same as the node name in some environments
and a different value in other environments.
Node isolation/restriction
Adding labels to nodes allows you to target Pods for scheduling on specific nodes or groups of nodes. You can use this functionality to ensure that specific Pods only run on nodes with certain isolation, security, or regulatory properties.
If you use labels for node isolation, choose label keys that the kubelet cannot modify. This prevents a compromised node from setting those labels on itself so that the scheduler schedules workloads onto the compromised node.
The NodeRestriction
admission plugin
prevents the kubelet from setting or modifying labels with a
node-restriction.kubernetes.io/
prefix.
To make use of that label prefix for node isolation:
- Ensure you are using the Node authorizer and have enabled the
NodeRestriction
admission plugin. - Add labels with the
node-restriction.kubernetes.io/
prefix to your nodes, and use those labels in your node selectors. For example,example.com.node-restriction.kubernetes.io/fips=true
orexample.com.node-restriction.kubernetes.io/pci-dss=true
.
nodeSelector
nodeSelector
is the simplest recommended form of node selection constraint.
You can add the nodeSelector
field to your Pod specification and specify the
node labels you want the target node to have.
Kubernetes only schedules the Pod onto nodes that have each of the labels you
specify.
See Assign Pods to Nodes for more information.
Affinity and anti-affinity
nodeSelector
is the simplest way to constrain Pods to nodes with specific
labels. Affinity and anti-affinity expands the types of constraints you can
define. Some of the benefits of affinity and anti-affinity include:
- The affinity/anti-affinity language is more expressive.
nodeSelector
only selects nodes with all the specified labels. Affinity/anti-affinity gives you more control over the selection logic. - You can indicate that a rule is soft or preferred, so that the scheduler still schedules the Pod even if it can't find a matching node.
- You can constrain a Pod using labels on other Pods running on the node (or other topological domain), instead of just node labels, which allows you to define rules for which Pods can be co-located on a node.
The affinity feature consists of two types of affinity:
- Node affinity functions like the
nodeSelector
field but is more expressive and allows you to specify soft rules. - Inter-pod affinity/anti-affinity allows you to constrain Pods against labels on other Pods.
Node affinity
Node affinity is conceptually similar to nodeSelector
, allowing you to constrain which nodes your
Pod can be scheduled on based on node labels. There are two types of node
affinity:
requiredDuringSchedulingIgnoredDuringExecution
: The scheduler can't schedule the Pod unless the rule is met. This functions likenodeSelector
, but with a more expressive syntax.preferredDuringSchedulingIgnoredDuringExecution
: The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod.
IgnoredDuringExecution
means that if the node labels
change after Kubernetes schedules the Pod, the Pod continues to run.
You can specify node affinities using the .spec.affinity.nodeAffinity
field in
your Pod spec.
For example, consider the following Pod spec:
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
In this example, the following rules apply:
- The node must have a label with the key
kubernetes.io/e2e-az-name
and the value is eithere2e-az1
ore2e-az2
. - The node preferably has a label with the key
another-node-label-key
and the valueanother-node-label-value
.
You can use the operator
field to specify a logical operator for Kubernetes to use when
interpreting the rules. You can use In
, NotIn
, Exists
, DoesNotExist
,
Gt
and Lt
.
NotIn
and DoesNotExist
allow you to define node anti-affinity behavior.
Alternatively, you can use node taints
to repel Pods from specific nodes.
If you specify both nodeSelector
and nodeAffinity
, both must be satisfied
for the Pod to be scheduled onto a node.
If you specify multiple nodeSelectorTerms
associated with nodeAffinity
types, then the Pod can be scheduled onto a node if one of the specified nodeSelectorTerms
can be
satisfied.
If you specify multiple matchExpressions
associated with a single nodeSelectorTerms
,
then the Pod can be scheduled onto a node only if all the matchExpressions
are
satisfied.
See Assign Pods to Nodes using Node Affinity for more information.
Node affinity weight
You can specify a weight
between 1 and 100 for each instance of the
preferredDuringSchedulingIgnoredDuringExecution
affinity type. When the
scheduler finds nodes that meet all the other scheduling requirements of the Pod, the
scheduler iterates through every preferred rule that the node satisfies and adds the
value of the weight
for that expression to a sum.
The final sum is added to the score of other priority functions for the node. Nodes with the highest total score are prioritized when the scheduler makes a scheduling decision for the Pod.
For example, consider the following Pod spec:
apiVersion: v1
kind: Pod
metadata:
name: with-affinity-anti-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: label-1
operator: In
values:
- key-1
- weight: 50
preference:
matchExpressions:
- key: label-2
operator: In
values:
- key-2
containers:
- name: with-node-affinity
image: k8s.gcr.io/pause:2.0
If there are two possible nodes that match the
requiredDuringSchedulingIgnoredDuringExecution
rule, one with the
label-1:key-1
label and another with the label-2:key-2
label, the scheduler
considers the weight
of each node and adds the weight to the other scores for
that node, and schedules the Pod onto the node with the highest final score.
kubernetes.io/os=linux
label.
Node affinity per scheduling profile
Kubernetes v1.20 [beta]
When configuring multiple scheduling profiles, you can associate
a profile with a node affinity, which is useful if a profile only applies to a specific set of nodes.
To do so, add an addedAffinity
to the args
field of the NodeAffinity
plugin
in the scheduler configuration. For example:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
- schedulerName: foo-scheduler
pluginConfig:
- name: NodeAffinity
args:
addedAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: scheduler-profile
operator: In
values:
- foo
The addedAffinity
is applied to all Pods that set .spec.schedulerName
to foo-scheduler
, in addition to the
NodeAffinity specified in the PodSpec.
That is, in order to match the Pod, nodes need to satisfy addedAffinity
and
the Pod's .spec.NodeAffinity
.
Since the addedAffinity
is not visible to end users, its behavior might be
unexpected to them. Use node labels that have a clear correlation to the
scheduler profile name.
nodeAffinity
rules in the DaemonSet controller.
Inter-pod affinity and anti-affinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your Pods can be scheduled on based on the labels of Pods already running on that node, instead of the node labels.
Inter-pod affinity and anti-affinity rules take the form "this Pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more Pods that meet rule Y", where X is a topology domain like node, rack, cloud provider zone or region, or similar and Y is the rule Kubernetes tries to satisfy.
You express these rules (Y) as label selectors with an optional associated list of namespaces. Pods are namespaced objects in Kubernetes, so Pod labels also implicitly have namespaces. Any label selectors for Pod labels should specify the namespaces in which Kubernetes should look for those labels.
You express the topology domain (X) using a topologyKey
, which is the key for
the node label that the system uses to denote the domain. For examples, see
Well-Known Labels, Annotations and Taints.
topologyKey
.
If some or all nodes are missing the specified topologyKey
label, it can lead
to unintended behavior.
Types of inter-pod affinity and anti-affinity
Similar to node affinity are two types of Pod affinity and anti-affinity as follows:
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution
For example, you could use
requiredDuringSchedulingIgnoredDuringExecution
affinity to tell the scheduler to
co-locate Pods of two services in the same cloud provider zone because they
communicate with each other a lot. Similarly, you could use
preferredDuringSchedulingIgnoredDuringExecution
anti-affinity to spread Pods
from a service across multiple cloud provider zones.
To use inter-pod affinity, use the affinity.podAffinity
field in the Pod spec.
For inter-pod anti-affinity, use the affinity.podAntiAffinity
field in the Pod
spec.
Pod affinity example
Consider the following Pod spec:
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: k8s.gcr.io/pause:2.0
This example defines one Pod affinity rule and one Pod anti-affinity rule. The
Pod affinity rule uses the "hard"
requiredDuringSchedulingIgnoredDuringExecution
, while the anti-affinity rule
uses the "soft" preferredDuringSchedulingIgnoredDuringExecution
.
The affinity rule says that the scheduler can only schedule a Pod onto a node if
the node is in the same zone as one or more existing Pods with the label
security=S1
. More precisely, the scheduler must place the Pod on a node that has the
topology.kubernetes.io/zone=V
label, as long as there is at least one node in
that zone that currently has one or more Pods with the Pod label security=S1
.
The anti-affinity rule says that the scheduler should try to avoid scheduling
the Pod onto a node that is in the same zone as one or more Pods with the label
security=S2
. More precisely, the scheduler should try to avoid placing the Pod on a node that has the
topology.kubernetes.io/zone=R
label if there are other nodes in the
same zone currently running Pods with the Security=S2
Pod label.
See the design doc for many more examples of Pod affinity and anti-affinity.
You can use the In
, NotIn
, Exists
and DoesNotExist
values in the
operator
field for Pod affinity and anti-affinity.
In principle, the topologyKey
can be any allowed label key with the following
exceptions for performance and security reasons:
- For Pod affinity and anti-affinity, an empty
topologyKey
field is not allowed in bothrequiredDuringSchedulingIgnoredDuringExecution
andpreferredDuringSchedulingIgnoredDuringExecution
. - For
requiredDuringSchedulingIgnoredDuringExecution
Pod anti-affinity rules, the admission controllerLimitPodHardAntiAffinityTopology
limitstopologyKey
tokubernetes.io/hostname
. You can modify or disable the admission controller if you want to allow custom topologies.
In addition to labelSelector
and topologyKey
, you can optionally specify a list
of namespaces which the labelSelector
should match against using the
namespaces
field at the same level as labelSelector
and topologyKey
.
If omitted or empty, namespaces
defaults to the namespace of the Pod where the
affinity/anti-affinity definition appears.
Namespace selector
Kubernetes v1.22 [beta]
You can also select matching namespaces using namespaceSelector
, which is a label query over the set of namespaces.
The affinity term is applied to namespaces selected by both namespaceSelector
and the namespaces
field.
Note that an empty namespaceSelector
({}) matches all namespaces, while a null or empty namespaces
list and
null namespaceSelector
matches the namespace of the Pod where the rule is defined.
PodAffinityNamespaceSelector
in both kube-apiserver and kube-scheduler.
More practical use-cases
Inter-pod affinity and anti-affinity can be even more useful when they are used with higher level collections such as ReplicaSets, StatefulSets, Deployments, etc. These rules allow you to configure that a set of workloads should be co-located in the same defined topology, eg., the same node.
Take, for example, a three-node cluster running a web application with an in-memory cache like redis. You could use inter-pod affinity and anti-affinity to co-locate the web servers with the cache as much as possible.
In the following example Deployment for the redis cache, the replicas get the label app=store
. The
podAntiAffinity
rule tells the scheduler to avoid placing multiple replicas
with the app=store
label on a single node. This creates each cache in a
separate node.
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
The following Deployment for the web servers creates replicas with the label app=web-store
. The
Pod affinity rule tells the scheduler to place each replica on a node that has a
Pod with the label app=store
. The Pod anti-affinity rule tells the scheduler
to avoid placing multiple app=web-store
servers on a single node.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 3
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpine
Creating the two preceding Deployments results in the following cluster layout, where each web server is co-located with a cache, on three separate nodes.
node-1 | node-2 | node-3 |
---|---|---|
webserver-1 | webserver-2 | webserver-3 |
cache-1 | cache-2 | cache-3 |
See the ZooKeeper tutorial for an example of a StatefulSet configured with anti-affinity for high availability, using the same technique as this example.
nodeName
nodeName
is a more direct form of node selection than affinity or
nodeSelector
. nodeName
is a field in the Pod spec. If the nodeName
field
is not empty, the scheduler ignores the Pod and the kubelet on the named node
tries to place the Pod on that node. Using nodeName
overrules using
nodeSelector
or affinity and anti-affinity rules.
Some of the limitations of using nodeName
to select nodes are:
- If the named node does not exist, the Pod will not run, and in some cases may be automatically deleted.
- If the named node does not have the resources to accommodate the Pod, the Pod will fail and its reason will indicate why, for example OutOfmemory or OutOfcpu.
- Node names in cloud environments are not always predictable or stable.
Here is an example of a Pod spec using the nodeName
field:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: kube-01
The above Pod will only run on the node kube-01
.
What's next
- Read more about taints and tolerations .
- Read the design docs for node affinity and for inter-pod affinity/anti-affinity.
- Learn about how the topology manager takes part in node-level resource allocation decisions.
- Learn how to use nodeSelector.
- Learn how to use affinity and anti-affinity.
3.10.3 - Pod Overhead
Kubernetes v1.18 [beta]
When you run a Pod on a Node, the Pod itself takes an amount of system resources. These resources are additional to the resources needed to run the container(s) inside the Pod. Pod Overhead is a feature for accounting for the resources consumed by the Pod infrastructure on top of the container requests & limits.
In Kubernetes, the Pod's overhead is set at admission time according to the overhead associated with the Pod's RuntimeClass.
When Pod Overhead is enabled, the overhead is considered in addition to the sum of container resource requests when scheduling a Pod. Similarly, the kubelet will include the Pod overhead when sizing the Pod cgroup, and when carrying out Pod eviction ranking.
Enabling Pod Overhead
You need to make sure that the PodOverhead
feature gate is enabled (it is on by default as of 1.18)
across your cluster, and a RuntimeClass
is utilized which defines the overhead
field.
Usage example
To use the PodOverhead feature, you need a RuntimeClass that defines the overhead
field. As
an example, you could use the following RuntimeClass definition with a virtualizing container runtime
that uses around 120MiB per Pod for the virtual machine and the guest OS:
---
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: kata-fc
handler: kata-fc
overhead:
podFixed:
memory: "120Mi"
cpu: "250m"
Workloads which are created which specify the kata-fc
RuntimeClass handler will take the memory and
cpu overheads into account for resource quota calculations, node scheduling, as well as Pod cgroup sizing.
Consider running the given example workload, test-pod:
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
runtimeClassName: kata-fc
containers:
- name: busybox-ctr
image: busybox:1.28
stdin: true
tty: true
resources:
limits:
cpu: 500m
memory: 100Mi
- name: nginx-ctr
image: nginx
resources:
limits:
cpu: 1500m
memory: 100Mi
At admission time the RuntimeClass admission controller
updates the workload's PodSpec to include the overhead
as described in the RuntimeClass. If the PodSpec already has this field defined,
the Pod will be rejected. In the given example, since only the RuntimeClass name is specified, the admission controller mutates the Pod
to include an overhead
.
After the RuntimeClass admission controller, you can check the updated PodSpec:
kubectl get pod test-pod -o jsonpath='{.spec.overhead}'
The output is:
map[cpu:250m memory:120Mi]
If a ResourceQuota is defined, the sum of container requests as well as the
overhead
field are counted.
When the kube-scheduler is deciding which node should run a new Pod, the scheduler considers that Pod's
overhead
as well as the sum of container requests for that Pod. For this example, the scheduler adds the
requests and the overhead, then looks for a node that has 2.25 CPU and 320 MiB of memory available.
Once a Pod is scheduled to a node, the kubelet on that node creates a new cgroup for the Pod. It is within this pod that the underlying container runtime will create containers.
If the resource has a limit defined for each container (Guaranteed QoS or Bustrable QoS with limits defined),
the kubelet will set an upper limit for the pod cgroup associated with that resource (cpu.cfs_quota_us for CPU
and memory.limit_in_bytes memory). This upper limit is based on the sum of the container limits plus the overhead
defined in the PodSpec.
For CPU, if the Pod is Guaranteed or Burstable QoS, the kubelet will set cpu.shares
based on the sum of container
requests plus the overhead
defined in the PodSpec.
Looking at our example, verify the container requests for the workload:
kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resources.limits}'
The total container requests are 2000m CPU and 200MiB of memory:
map[cpu: 500m memory:100Mi] map[cpu:1500m memory:100Mi]
Check this against what is observed by the node:
kubectl describe node | grep test-pod -B2
The output shows 2250m CPU and 320MiB of memory are requested, which includes PodOverhead:
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default test-pod 2250m (56%) 2250m (56%) 320Mi (1%) 320Mi (1%) 36m
Verify Pod cgroup limits
Check the Pod's memory cgroups on the node where the workload is running. In the following example, crictl
is used on the node, which provides a CLI for CRI-compatible container runtimes. This is an
advanced example to show PodOverhead behavior, and it is not expected that users should need to check
cgroups directly on the node.
First, on the particular node, determine the Pod identifier:
# Run this on the node where the Pod is scheduled
POD_ID="$(sudo crictl pods --name test-pod -q)"
From this, you can determine the cgroup path for the Pod:
# Run this on the node where the Pod is scheduled
sudo crictl inspectp -o=json $POD_ID | grep cgroupsPath
The resulting cgroup path includes the Pod's pause
container. The Pod level cgroup is one directory above.
"cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620e206e7d0c27a"
In this specific case, the pod cgroup path is kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2
. Verify the Pod level cgroup setting for memory:
# Run this on the node where the Pod is scheduled.
# Also, change the name of the cgroup to match the cgroup allocated for your pod.
cat /sys/fs/cgroup/memory/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/memory.limit_in_bytes
This is 320 MiB, as expected:
335544320
Observability
A kube_pod_overhead
metric is available in kube-state-metrics
to help identify when PodOverhead is being utilized and to help observe stability of workloads
running with a defined Overhead. This functionality is not available in the 1.9 release of
kube-state-metrics, but is expected in a following release. Users will need to build kube-state-metrics
from source in the meantime.
What's next
3.10.4 - Taints and Tolerations
Node affinity is a property of Pods that attracts them to a set of nodes (either as a preference or a hard requirement). Taints are the opposite -- they allow a node to repel a set of pods.
Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.
Concepts
You add a taint to a node using kubectl taint. For example,
kubectl taint nodes node1 key1=value1:NoSchedule
places a taint on node node1
. The taint has key key1
, value value1
, and taint effect NoSchedule
.
This means that no pod will be able to schedule onto node1
unless it has a matching toleration.
To remove the taint added by the command above, you can run:
kubectl taint nodes node1 key1=value1:NoSchedule-
You specify a toleration for a pod in the PodSpec. Both of the following tolerations "match" the
taint created by the kubectl taint
line above, and thus a pod with either toleration would be able
to schedule onto node1
:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
tolerations:
- key: "key1"
operator: "Exists"
effect: "NoSchedule"
Here's an example of a pod that uses tolerations:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "example-key"
operator: "Exists"
effect: "NoSchedule"
The default value for operator
is Equal
.
A toleration "matches" a taint if the keys are the same and the effects are the same, and:
- the
operator
isExists
(in which case novalue
should be specified), or - the
operator
isEqual
and thevalue
s are equal.
There are two special cases:
An empty key
with operator Exists
matches all keys, values and effects which means this
will tolerate everything.
An empty effect
matches all effects with key key1
.
The above example used effect
of NoSchedule
. Alternatively, you can use effect
of PreferNoSchedule
.
This is a "preference" or "soft" version of NoSchedule
-- the system will try to avoid placing a
pod that does not tolerate the taint on the node, but it is not required. The third kind of effect
is
NoExecute
, described later.
You can put multiple taints on the same node and multiple tolerations on the same pod. The way Kubernetes processes multiple taints and tolerations is like a filter: start with all of a node's taints, then ignore the ones for which the pod has a matching toleration; the remaining un-ignored taints have the indicated effects on the pod. In particular,
- if there is at least one un-ignored taint with effect
NoSchedule
then Kubernetes will not schedule the pod onto that node - if there is no un-ignored taint with effect
NoSchedule
but there is at least one un-ignored taint with effectPreferNoSchedule
then Kubernetes will try to not schedule the pod onto the node - if there is at least one un-ignored taint with effect
NoExecute
then the pod will be evicted from the node (if it is already running on the node), and will not be scheduled onto the node (if it is not yet running on the node).
For example, imagine you taint a node like this
kubectl taint nodes node1 key1=value1:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key2=value2:NoSchedule
And a pod has two tolerations:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
In this case, the pod will not be able to schedule onto the node, because there is no toleration matching the third taint. But it will be able to continue running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.
Normally, if a taint with effect NoExecute
is added to a node, then any pods that do
not tolerate the taint will be evicted immediately, and pods that do tolerate the
taint will never be evicted. However, a toleration with NoExecute
effect can specify
an optional tolerationSeconds
field that dictates how long the pod will stay bound
to the node after the taint is added. For example,
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
tolerationSeconds: 3600
means that if this pod is running and a matching taint is added to the node, then the pod will stay bound to the node for 3600 seconds, and then be evicted. If the taint is removed before that time, the pod will not be evicted.
Example Use Cases
Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that shouldn't be running. A few of the use cases are
-
Dedicated Nodes: If you want to dedicate a set of nodes for exclusive use by a particular set of users, you can add a taint to those nodes (say,
kubectl taint nodes nodename dedicated=groupName:NoSchedule
) and then add a corresponding toleration to their pods (this would be done most easily by writing a custom admission controller). The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes as well as any other nodes in the cluster. If you want to dedicate the nodes to them and ensure they only use the dedicated nodes, then you should additionally add a label similar to the taint to the same set of nodes (e.g.dedicated=groupName
), and the admission controller should additionally add a node affinity to require that the pods can only schedule onto nodes labeled withdedicated=groupName
. -
Nodes with Special Hardware: In a cluster where a small subset of nodes have specialized hardware (for example GPUs), it is desirable to keep pods that don't need the specialized hardware off of those nodes, thus leaving room for later-arriving pods that do need the specialized hardware. This can be done by tainting the nodes that have the specialized hardware (e.g.
kubectl taint nodes nodename special=true:NoSchedule
orkubectl taint nodes nodename special=true:PreferNoSchedule
) and adding a corresponding toleration to pods that use the special hardware. As in the dedicated nodes use case, it is probably easiest to apply the tolerations using a custom admission controller. For example, it is recommended to use Extended Resources to represent the special hardware, taint your special hardware nodes with the extended resource name and run the ExtendedResourceToleration admission controller. Now, because the nodes are tainted, no pods without the toleration will schedule on them. But when you submit a pod that requests the extended resource, theExtendedResourceToleration
admission controller will automatically add the correct toleration to the pod and that pod will schedule on the special hardware nodes. This will make sure that these special hardware nodes are dedicated for pods requesting such hardware and you don't have to manually add tolerations to your pods. -
Taint based Evictions: A per-pod-configurable eviction behavior when there are node problems, which is described in the next section.
Taint based Evictions
Kubernetes v1.18 [stable]
The NoExecute
taint effect, mentioned above, affects pods that are already
running on the node as follows
- pods that do not tolerate the taint are evicted immediately
- pods that tolerate the taint without specifying
tolerationSeconds
in their toleration specification remain bound forever - pods that tolerate the taint with a specified
tolerationSeconds
remain bound for the specified amount of time
The node controller automatically taints a Node when certain conditions are true. The following taints are built in:
node.kubernetes.io/not-ready
: Node is not ready. This corresponds to the NodeConditionReady
being "False
".node.kubernetes.io/unreachable
: Node is unreachable from the node controller. This corresponds to the NodeConditionReady
being "Unknown
".node.kubernetes.io/memory-pressure
: Node has memory pressure.node.kubernetes.io/disk-pressure
: Node has disk pressure.node.kubernetes.io/pid-pressure
: Node has PID pressure.node.kubernetes.io/network-unavailable
: Node's network is unavailable.node.kubernetes.io/unschedulable
: Node is unschedulable.node.cloudprovider.kubernetes.io/uninitialized
: When the kubelet is started with "external" cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.
In case a node is to be evicted, the node controller or the kubelet adds relevant taints
with NoExecute
effect. If the fault condition returns to normal the kubelet or node
controller can remove the relevant taint(s).
You can specify tolerationSeconds
for a Pod to define how long that Pod stays bound
to a failing or unresponsive Node.
For example, you might want to keep an application with a lot of local state bound to node for a long time in the event of network partition, hoping that the partition will recover and thus the pod eviction can be avoided. The toleration you set for that Pod might look like:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 6000
Kubernetes automatically adds a toleration for
node.kubernetes.io/not-ready
and node.kubernetes.io/unreachable
with tolerationSeconds=300
,
unless you, or a controller, set those tolerations explicitly.
These automatically-added tolerations mean that Pods remain bound to Nodes for 5 minutes after one of these problems is detected.
DaemonSet pods are created with
NoExecute
tolerations for the following taints with no tolerationSeconds
:
node.kubernetes.io/unreachable
node.kubernetes.io/not-ready
This ensures that DaemonSet pods are never evicted due to these problems.
Taint Nodes by Condition
The control plane, using the node controller,
automatically creates taints with a NoSchedule
effect for node conditions.
The scheduler checks taints, not node conditions, when it makes scheduling
decisions. This ensures that node conditions don't directly affect scheduling.
For example, if the DiskPressure
node condition is active, the control plane
adds the node.kubernetes.io/disk-pressure
taint and does not schedule new pods
onto the affected node. If the MemoryPressure
node condition is active, the
control plane adds the node.kubernetes.io/memory-pressure
taint.
You can ignore node conditions for newly created pods by adding the corresponding
Pod tolerations. The control plane also adds the node.kubernetes.io/memory-pressure
toleration on pods that have a QoS class
other than BestEffort
. This is because Kubernetes treats pods in the Guaranteed
or Burstable
QoS classes (even pods with no memory request set) as if they are
able to cope with memory pressure, while new BestEffort
pods are not scheduled
onto the affected node.
The DaemonSet controller automatically adds the following NoSchedule
tolerations to all daemons, to prevent DaemonSets from breaking.
node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
node.kubernetes.io/pid-pressure
(1.14 or later)node.kubernetes.io/unschedulable
(1.10 or later)node.kubernetes.io/network-unavailable
(host network only)
Adding these tolerations ensures backward compatibility. You can also add arbitrary tolerations to DaemonSets.
What's next
- Read about Node-pressure Eviction and how you can configure it
- Read about Pod Priority
3.10.5 - Pod Priority and Preemption
Kubernetes v1.14 [stable]
Pods can have priority. Priority indicates the importance of a Pod relative to other Pods. If a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible.
In a cluster where not all users are trusted, a malicious user could create Pods at the highest possible priorities, causing other Pods to be evicted/not get scheduled. An administrator can use ResourceQuota to prevent users from creating pods at high priorities.
See limit Priority Class consumption by default for details.
How to use priority and preemption
To use priority and preemption:
-
Add one or more PriorityClasses.
-
Create Pods with
priorityClassName
set to one of the added PriorityClasses. Of course you do not need to create the Pods directly; normally you would addpriorityClassName
to the Pod template of a collection object like a Deployment.
Keep reading for more information about these steps.
system-cluster-critical
and system-node-critical
.
These are common classes and are used to ensure that critical components are always scheduled first.
PriorityClass
A PriorityClass is a non-namespaced object that defines a mapping from a
priority class name to the integer value of the priority. The name is specified
in the name
field of the PriorityClass object's metadata. The value is
specified in the required value
field. The higher the value, the higher the
priority.
The name of a PriorityClass object must be a valid
DNS subdomain name,
and it cannot be prefixed with system-
.
A PriorityClass object can have any 32-bit integer value smaller than or equal to 1 billion. Larger numbers are reserved for critical system Pods that should not normally be preempted or evicted. A cluster admin should create one PriorityClass object for each such mapping that they want.
PriorityClass also has two optional fields: globalDefault
and description
.
The globalDefault
field indicates that the value of this PriorityClass should
be used for Pods without a priorityClassName
. Only one PriorityClass with
globalDefault
set to true can exist in the system. If there is no
PriorityClass with globalDefault
set, the priority of Pods with no
priorityClassName
is zero.
The description
field is an arbitrary string. It is meant to tell users of the
cluster when they should use this PriorityClass.
Notes about PodPriority and existing clusters
-
If you upgrade an existing cluster without this feature, the priority of your existing Pods is effectively zero.
-
Addition of a PriorityClass with
globalDefault
set totrue
does not change the priorities of existing Pods. The value of such a PriorityClass is used only for Pods created after the PriorityClass is added. -
If you delete a PriorityClass, existing Pods that use the name of the deleted PriorityClass remain unchanged, but you cannot create more Pods that use the name of the deleted PriorityClass.
Example PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
Non-preempting PriorityClass
Kubernetes v1.19 [beta]
Pods with preemptionPolicy: Never
will be placed in the scheduling queue
ahead of lower-priority pods,
but they cannot preempt other pods.
A non-preempting pod waiting to be scheduled will stay in the scheduling queue,
until sufficient resources are free,
and it can be scheduled.
Non-preempting pods,
like other pods,
are subject to scheduler back-off.
This means that if the scheduler tries these pods and they cannot be scheduled,
they will be retried with lower frequency,
allowing other pods with lower priority to be scheduled before them.
Non-preempting pods may still be preempted by other, high-priority pods.
preemptionPolicy
defaults to PreemptLowerPriority
,
which will allow pods of that PriorityClass to preempt lower-priority pods
(as is existing default behavior).
If preemptionPolicy
is set to Never
,
pods in that PriorityClass will be non-preempting.
An example use case is for data science workloads.
A user may submit a job that they want to be prioritized above other workloads,
but do not wish to discard existing work by preempting running pods.
The high priority job with preemptionPolicy: Never
will be scheduled
ahead of other queued pods,
as soon as sufficient cluster resources "naturally" become free.
Example Non-preempting PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not cause other pods to be preempted."
Pod priority
After you have one or more PriorityClasses, you can create Pods that specify one
of those PriorityClass names in their specifications. The priority admission
controller uses the priorityClassName
field and populates the integer value of
the priority. If the priority class is not found, the Pod is rejected.
The following YAML is an example of a Pod configuration that uses the PriorityClass created in the preceding example. The priority admission controller checks the specification and resolves the priority of the Pod to 1000000.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority
Effect of Pod priority on scheduling order
When Pod priority is enabled, the scheduler orders pending Pods by their priority and a pending Pod is placed ahead of other pending Pods with lower priority in the scheduling queue. As a result, the higher priority Pod may be scheduled sooner than Pods with lower priority if its scheduling requirements are met. If such Pod cannot be scheduled, scheduler will continue and tries to schedule other lower priority Pods.
Preemption
When Pods are created, they go to a queue and wait to be scheduled. The scheduler picks a Pod from the queue and tries to schedule it on a Node. If no Node is found that satisfies all the specified requirements of the Pod, preemption logic is triggered for the pending Pod. Let's call the pending Pod P. Preemption logic tries to find a Node where removal of one or more Pods with lower priority than P would enable P to be scheduled on that Node. If such a Node is found, one or more lower priority Pods get evicted from the Node. After the Pods are gone, P can be scheduled on the Node.
User exposed information
When Pod P preempts one or more Pods on Node N, nominatedNodeName
field of Pod
P's status is set to the name of Node N. This field helps scheduler track
resources reserved for Pod P and also gives users information about preemptions
in their clusters.
Please note that Pod P is not necessarily scheduled to the "nominated Node".
After victim Pods are preempted, they get their graceful termination period. If
another node becomes available while scheduler is waiting for the victim Pods to
terminate, scheduler will use the other node to schedule Pod P. As a result
nominatedNodeName
and nodeName
of Pod spec are not always the same. Also, if
scheduler preempts Pods on Node N, but then a higher priority Pod than Pod P
arrives, scheduler may give Node N to the new higher priority Pod. In such a
case, scheduler clears nominatedNodeName
of Pod P. By doing this, scheduler
makes Pod P eligible to preempt Pods on another Node.
Limitations of preemption
Graceful termination of preemption victims
When Pods are preempted, the victims get their graceful termination period. They have that much time to finish their work and exit. If they don't, they are killed. This graceful termination period creates a time gap between the point that the scheduler preempts Pods and the time when the pending Pod (P) can be scheduled on the Node (N). In the meantime, the scheduler keeps scheduling other pending Pods. As victims exit or get terminated, the scheduler tries to schedule Pods in the pending queue. Therefore, there is usually a time gap between the point that scheduler preempts victims and the time that Pod P is scheduled. In order to minimize this gap, one can set graceful termination period of lower priority Pods to zero or a small number.
PodDisruptionBudget is supported, but not guaranteed
A PodDisruptionBudget (PDB) allows application owners to limit the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. Kubernetes supports PDB when preempting Pods, but respecting PDB is best effort. The scheduler tries to find victims whose PDB are not violated by preemption, but if no such victims are found, preemption will still happen, and lower priority Pods will be removed despite their PDBs being violated.
Inter-Pod affinity on lower-priority Pods
A Node is considered for preemption only when the answer to this question is yes: "If all the Pods with lower priority than the pending Pod are removed from the Node, can the pending Pod be scheduled on the Node?"
If a pending Pod has inter-pod affinity to one or more of the lower-priority Pods on the Node, the inter-Pod affinity rule cannot be satisfied in the absence of those lower-priority Pods. In this case, the scheduler does not preempt any Pods on the Node. Instead, it looks for another Node. The scheduler might find a suitable Node or it might not. There is no guarantee that the pending Pod can be scheduled.
Our recommended solution for this problem is to create inter-Pod affinity only towards equal or higher priority Pods.
Cross node preemption
Suppose a Node N is being considered for preemption so that a pending Pod P can be scheduled on N. P might become feasible on N only if a Pod on another Node is preempted. Here's an example:
- Pod P is being considered for Node N.
- Pod Q is running on another Node in the same Zone as Node N.
- Pod P has Zone-wide anti-affinity with Pod Q (
topologyKey: topology.kubernetes.io/zone
). - There are no other cases of anti-affinity between Pod P and other Pods in the Zone.
- In order to schedule Pod P on Node N, Pod Q can be preempted, but scheduler does not perform cross-node preemption. So, Pod P will be deemed unschedulable on Node N.
If Pod Q were removed from its Node, the Pod anti-affinity violation would be gone, and Pod P could possibly be scheduled on Node N.
We may consider adding cross Node preemption in future versions if there is enough demand and if we find an algorithm with reasonable performance.
Troubleshooting
Pod priority and pre-emption can have unwanted side effects. Here are some examples of potential problems and ways to deal with them.
Pods are preempted unnecessarily
Preemption removes existing Pods from a cluster under resource pressure to make
room for higher priority pending Pods. If you give high priorities to
certain Pods by mistake, these unintentionally high priority Pods may cause
preemption in your cluster. Pod priority is specified by setting the
priorityClassName
field in the Pod's specification. The integer value for
priority is then resolved and populated to the priority
field of podSpec
.
To address the problem, you can change the priorityClassName
for those Pods
to use lower priority classes, or leave that field empty. An empty
priorityClassName
is resolved to zero by default.
When a Pod is preempted, there will be events recorded for the preempted Pod. Preemption should happen only when a cluster does not have enough resources for a Pod. In such cases, preemption happens only when the priority of the pending Pod (preemptor) is higher than the victim Pods. Preemption must not happen when there is no pending Pod, or when the pending Pods have equal or lower priority than the victims. If preemption happens in such scenarios, please file an issue.
Pods are preempted, but the preemptor is not scheduled
When pods are preempted, they receive their requested graceful termination period, which is by default 30 seconds. If the victim Pods do not terminate within this period, they are forcibly terminated. Once all the victims go away, the preemptor Pod can be scheduled.
While the preemptor Pod is waiting for the victims to go away, a higher priority Pod may be created that fits on the same Node. In this case, the scheduler will schedule the higher priority Pod instead of the preemptor.
This is expected behavior: the Pod with the higher priority should take the place of a Pod with a lower priority.
Higher priority Pods are preempted before lower priority pods
The scheduler tries to find nodes that can run a pending Pod. If no node is found, the scheduler tries to remove Pods with lower priority from an arbitrary node in order to make room for the pending pod. If a node with low priority Pods is not feasible to run the pending Pod, the scheduler may choose another node with higher priority Pods (compared to the Pods on the other node) for preemption. The victims must still have lower priority than the preemptor Pod.
When there are multiple nodes available for preemption, the scheduler tries to choose the node with a set of Pods with lowest priority. However, if such Pods have PodDisruptionBudget that would be violated if they are preempted then the scheduler may choose another node with higher priority Pods.
When multiple nodes exist for preemption and none of the above scenarios apply, the scheduler chooses a node with the lowest priority.
Interactions between Pod priority and quality of service
Pod priority and QoS class
are two orthogonal features with few interactions and no default restrictions on
setting the priority of a Pod based on its QoS classes. The scheduler's
preemption logic does not consider QoS when choosing preemption targets.
Preemption considers Pod priority and attempts to choose a set of targets with
the lowest priority. Higher-priority Pods are considered for preemption only if
the removal of the lowest priority Pods is not sufficient to allow the scheduler
to schedule the preemptor Pod, or if the lowest priority Pods are protected by
PodDisruptionBudget
.
The kubelet uses Priority to determine pod order for node-pressure eviction. You can use the QoS class to estimate the order in which pods are most likely to get evicted. The kubelet ranks pods for eviction based on the following factors:
- Whether the starved resource usage exceeds requests
- Pod Priority
- Amount of resource usage relative to requests
See Pod selection for kubelet eviction for more details.
kubelet node-pressure eviction does not evict Pods when their usage does not exceed their requests. If a Pod with lower priority is not exceeding its requests, it won't be evicted. Another Pod with higher priority that exceeds its requests may be evicted.
What's next
- Read about using ResourceQuotas in connection with PriorityClasses: limit Priority Class consumption by default
- Learn about Pod Disruption
- Learn about API-initiated Eviction
- Learn about Node-pressure Eviction
3.10.6 - Node-pressure Eviction
Node-pressure eviction is the process by which the kubelet proactively terminates pods to reclaim resources on nodes.
The kubelet monitors resources like CPU, memory, disk space, and filesystem inodes on your cluster's nodes. When one or more of these resources reach specific consumption levels, the kubelet can proactively fail one or more pods on the node to reclaim resources and prevent starvation.
During a node-pressure eviction, the kubelet sets the PodPhase
for the
selected pods to Failed
. This terminates the pods.
Node-pressure eviction is not the same as API-initiated eviction.
The kubelet does not respect your configured PodDisruptionBudget
or the pod's
terminationGracePeriodSeconds
. If you use soft eviction thresholds,
the kubelet respects your configured eviction-max-pod-grace-period
. If you use
hard eviction thresholds, it uses a 0s
grace period for termination.
If the pods are managed by a workload
resource (such as StatefulSet
or Deployment) that
replaces failed pods, the control plane or kube-controller-manager
creates new
pods in place of the evicted pods.
The kubelet uses various parameters to make eviction decisions, like the following:
- Eviction signals
- Eviction thresholds
- Monitoring intervals
Eviction signals
Eviction signals are the current state of a particular resource at a specific point in time. Kubelet uses eviction signals to make eviction decisions by comparing the signals to eviction thresholds, which are the minimum amount of the resource that should be available on the node.
Kubelet uses the following eviction signals:
Eviction Signal | Description |
---|---|
memory.available |
memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
nodefs.available |
nodefs.available := node.stats.fs.available |
nodefs.inodesFree |
nodefs.inodesFree := node.stats.fs.inodesFree |
imagefs.available |
imagefs.available := node.stats.runtime.imagefs.available |
imagefs.inodesFree |
imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree |
pid.available |
pid.available := node.stats.rlimit.maxpid - node.stats.rlimit.curproc |
In this table, the Description
column shows how kubelet gets the value of the
signal. Each signal supports either a percentage or a literal value. Kubelet
calculates the percentage value relative to the total capacity associated with
the signal.
The value for memory.available
is derived from the cgroupfs instead of tools
like free -m
. This is important because free -m
does not work in a
container, and if users use the node
allocatable feature, out of resource decisions
are made local to the end user Pod part of the cgroup hierarchy as well as the
root node. This script
reproduces the same set of steps that the kubelet performs to calculate
memory.available
. The kubelet excludes inactive_file (i.e. # of bytes of
file-backed memory on inactive LRU list) from its calculation as it assumes that
memory is reclaimable under pressure.
The kubelet supports the following filesystem partitions:
nodefs
: The node's main filesystem, used for local disk volumes, emptyDir, log storage, and more. For example,nodefs
contains/var/lib/kubelet/
.imagefs
: An optional filesystem that container runtimes use to store container images and container writable layers.
Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet does not support other configurations.
Eviction thresholds
You can specify custom eviction thresholds for the kubelet to use when it makes eviction decisions.
Eviction thresholds have the form [eviction-signal][operator][quantity]
, where:
eviction-signal
is the eviction signal to use.operator
is the relational operator you want, such as<
(less than).quantity
is the eviction threshold amount, such as1Gi
. The value ofquantity
must match the quantity representation used by Kubernetes. You can use either literal values or percentages (%
).
For example, if a node has 10Gi
of total memory and you want trigger eviction if
the available memory falls below 1Gi
, you can define the eviction threshold as
either memory.available<10%
or memory.available<1Gi
. You cannot use both.
You can configure soft and hard eviction thresholds.
Soft eviction thresholds
A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The kubelet does not evict pods until the grace period is exceeded. The kubelet returns an error on startup if there is no specified grace period.
You can specify both a soft eviction threshold grace period and a maximum allowed pod termination grace period for kubelet to use during evictions. If you specify a maximum allowed grace period and the soft eviction threshold is met, the kubelet uses the lesser of the two grace periods. If you do not specify a maximum allowed grace period, the kubelet kills evicted pods immediately without graceful termination.
You can use the following flags to configure soft eviction thresholds:
eviction-soft
: A set of eviction thresholds likememory.available<1.5Gi
that can trigger pod eviction if held over the specified grace period.eviction-soft-grace-period
: A set of eviction grace periods likememory.available=1m30s
that define how long a soft eviction threshold must hold before triggering a Pod eviction.eviction-max-pod-grace-period
: The maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
Hard eviction thresholds
A hard eviction threshold has no grace period. When a hard eviction threshold is met, the kubelet kills pods immediately without graceful termination to reclaim the starved resource.
You can use the eviction-hard
flag to configure a set of hard eviction
thresholds like memory.available<1Gi
.
The kubelet has the following default hard eviction thresholds:
memory.available<100Mi
nodefs.available<10%
imagefs.available<15%
nodefs.inodesFree<5%
(Linux nodes)
Eviction monitoring interval
The kubelet evaluates eviction thresholds based on its configured housekeeping-interval
which defaults to 10s
.
Node conditions
The kubelet reports node conditions to reflect that the node is under pressure because hard or soft eviction threshold is met, independent of configured grace periods.
The kubelet maps eviction signals to node conditions as follows:
Node Condition | Eviction Signal | Description |
---|---|---|
MemoryPressure |
memory.available |
Available memory on the node has satisfied an eviction threshold |
DiskPressure |
nodefs.available , nodefs.inodesFree , imagefs.available , or imagefs.inodesFree |
Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
PIDPressure |
pid.available |
Available processes identifiers on the (Linux) node has fallen below an eviction threshold |
The kubelet updates the node conditions based on the configured
--node-status-update-frequency
, which defaults to 10s
.
Node condition oscillation
In some cases, nodes oscillate above and below soft eviction thresholds without
holding for the defined grace periods. This causes the reported node condition
to constantly switch between true
and false
, leading to bad eviction decisions.
To protect against oscillation, you can use the eviction-pressure-transition-period
flag, which controls how long the kubelet must wait before transitioning a node
condition to a different state. The transition period has a default value of 5m
.
Reclaiming node level resources
The kubelet tries to reclaim node-level resources before it evicts end-user pods.
When a DiskPressure
node condition is reported, the kubelet reclaims node-level
resources based on the filesystems on the node.
With imagefs
If the node has a dedicated imagefs
filesystem for container runtimes to use,
the kubelet does the following:
- If the
nodefs
filesystem meets the eviction thresholds, the kubelet garbage collects dead pods and containers. - If the
imagefs
filesystem meets the eviction thresholds, the kubelet deletes all unused images.
Without imagefs
If the node only has a nodefs
filesystem that meets eviction thresholds,
the kubelet frees up disk space in the following order:
- Garbage collect dead pods and containers
- Delete unused images
Pod selection for kubelet eviction
If the kubelet's attempts to reclaim node-level resources don't bring the eviction signal below the threshold, the kubelet begins to evict end-user pods.
The kubelet uses the following parameters to determine pod eviction order:
- Whether the pod's resource usage exceeds requests
- Pod Priority
- The pod's resource usage relative to requests
As a result, kubelet ranks and evicts pods in the following order:
BestEffort
orBurstable
pods where the usage exceeds requests. These pods are evicted based on their Priority and then by how much their usage level exceeds the request.Guaranteed
pods andBurstable
pods where the usage is less than requests are evicted last, based on their Priority.
DiskPressure
.
Guaranteed
pods are guaranteed only when requests and limits are specified for
all the containers and they are equal. These pods will never be evicted because
of another pod's resource consumption. If a system daemon (such as kubelet
and journald
) is consuming more resources than were reserved via
system-reserved
or kube-reserved
allocations, and the node only has
Guaranteed
or Burstable
pods using less resources than requests left on it,
then the kubelet must choose to evict one of these pods to preserve node stability
and to limit the impact of resource starvation on other pods. In this case, it
will choose to evict pods of lowest Priority first.
When the kubelet evicts pods in response to inode
or PID
starvation, it uses
the Priority to determine the eviction order, because inodes
and PIDs
have no
requests.
The kubelet sorts pods differently based on whether the node has a dedicated
imagefs
filesystem:
With imagefs
If nodefs
is triggering evictions, the kubelet sorts pods based on nodefs
usage (local volumes + logs of all containers
).
If imagefs
is triggering evictions, the kubelet sorts pods based on the
writable layer usage of all containers.
Without imagefs
If nodefs
is triggering evictions, the kubelet sorts pods based on their total
disk usage (local volumes + logs & writable layer of all containers
)
Minimum eviction reclaim
In some cases, pod eviction only reclaims a small amount of the starved resource. This can lead to the kubelet repeatedly hitting the configured eviction thresholds and triggering multiple evictions.
You can use the --eviction-minimum-reclaim
flag or a kubelet config file
to configure a minimum reclaim amount for each resource. When the kubelet notices
that a resource is starved, it continues to reclaim that resource until it
reclaims the quantity you specify.
For example, the following configuration sets minimum reclaim amounts:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
memory.available: "500Mi"
nodefs.available: "1Gi"
imagefs.available: "100Gi"
evictionMinimumReclaim:
memory.available: "0Mi"
nodefs.available: "500Mi"
imagefs.available: "2Gi"
In this example, if the nodefs.available
signal meets the eviction threshold,
the kubelet reclaims the resource until the signal reaches the threshold of 1Gi
,
and then continues to reclaim the minimum amount of 500Mi
it until the signal
reaches 1.5Gi
.
Similarly, the kubelet reclaims the imagefs
resource until the imagefs.available
signal reaches 102Gi
.
The default eviction-minimum-reclaim
is 0
for all resources.
Node out of memory behavior
If the node experiences an out of memory (OOM) event prior to the kubelet being able to reclaim memory, the node depends on the oom_killer to respond.
The kubelet sets an oom_score_adj
value for each container based on the QoS for the pod.
Quality of Service | oom_score_adj |
---|---|
Guaranteed |
-997 |
BestEffort |
1000 |
Burstable |
min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
oom_score_adj
value of -997
for containers in Pods that have
system-node-critical
Priority
If the kubelet can't reclaim memory before a node experiences OOM, the
oom_killer
calculates an oom_score
based on the percentage of memory it's
using on the node, and then adds the oom_score_adj
to get an effective oom_score
for each container. It then kills the container with the highest score.
This means that containers in low QoS pods that consume a large amount of memory relative to their scheduling requests are killed first.
Unlike pod eviction, if a container is OOM killed, the kubelet
can restart it
based on its RestartPolicy
.
Best practices
The following sections describe best practices for eviction configuration.
Schedulable resources and eviction policies
When you configure the kubelet with an eviction policy, you should make sure that the scheduler will not schedule pods if they will trigger eviction because they immediately induce memory pressure.
Consider the following scenario:
- Node memory capacity:
10Gi
- Operator wants to reserve 10% of memory capacity for system daemons (kernel,
kubelet
, etc.) - Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM.
For this to work, the kubelet is launched as follows:
--eviction-hard=memory.available<500Mi
--system-reserved=memory=1.5Gi
In this configuration, the --system-reserved
flag reserves 1.5Gi
of memory
for the system, which is 10% of the total memory + the eviction threshold amount
.
The node can reach the eviction threshold if a pod is using more than its request,
or if the system is using more than 1Gi
of memory, which makes the memory.available
signal fall below 500Mi
and triggers the threshold.
DaemonSet
Pod Priority is a major factor in making eviction decisions. If you do not want
the kubelet to evict pods that belong to a DaemonSet
, give those pods a high
enough priorityClass
in the pod spec. You can also use a lower priorityClass
or the default to only allow DaemonSet
pods to run when there are enough
resources.
Known issues
The following sections describe known issues related to out of resource handling.
kubelet may not observe memory pressure right away
By default, the kubelet polls cAdvisor
to collect memory usage stats at a
regular interval. If memory usage increases within that window rapidly, the
kubelet may not observe MemoryPressure
fast enough, and the OOMKiller
will still be invoked.
You can use the --kernel-memcg-notification
flag to enable the memcg
notification API on the kubelet to get notified immediately when a threshold
is crossed.
If you are not trying to achieve extreme utilization, but a sensible measure of
overcommit, a viable workaround for this issue is to use the --kube-reserved
and --system-reserved
flags to allocate memory for the system.
active_file memory is not considered as available memory
On Linux, the kernel tracks the number of bytes of file-backed memory on active
LRU list as the active_file
statistic. The kubelet treats active_file
memory
areas as not reclaimable. For workloads that make intensive use of block-backed
local storage, including ephemeral local storage, kernel-level caches of file
and block data means that many recently accessed cache pages are likely to be
counted as active_file
. If enough of these kernel block buffers are on the
active LRU list, the kubelet is liable to observe this as high resource use and
taint the node as experiencing memory pressure - triggering pod eviction.
For more more details, see https://github.com/kubernetes/kubernetes/issues/43916
You can work around that behavior by setting the memory limit and memory request the same for containers likely to perform intensive I/O activity. You will need to estimate or measure an optimal memory limit value for that container.
What's next
- Learn about API-initiated Eviction
- Learn about Pod Priority and Preemption
- Learn about PodDisruptionBudgets
- Learn about Quality of Service (QoS)
- Check out the Eviction API
3.10.7 - API-initiated Eviction
API-initiated eviction is the process by which you use the Eviction API
to create an Eviction
object that triggers graceful pod termination.
You can request eviction by calling the Eviction API directly, or programmatically
using a client of the API server, like the kubectl drain
command. This
creates an Eviction
object, which causes the API server to terminate the Pod.
API-initiated evictions respect your configured PodDisruptionBudgets
and terminationGracePeriodSeconds
.
Using the API to create an Eviction object for a Pod is like performing a
policy-controlled DELETE
operation
on the Pod.
Calling the Eviction API
You can use a Kubernetes language client
to access the Kubernetes API and create an Eviction
object. To do this, you
POST the attempted operation, similar to the following example:
policy/v1
Eviction is available in v1.22+. Use policy/v1beta1
with prior releases.
{
"apiVersion": "policy/v1",
"kind": "Eviction",
"metadata": {
"name": "quux",
"namespace": "default"
}
}
policy/v1
{
"apiVersion": "policy/v1beta1",
"kind": "Eviction",
"metadata": {
"name": "quux",
"namespace": "default"
}
}
Alternatively, you can attempt an eviction operation by accessing the API using
curl
or wget
, similar to the following example:
curl -v -H 'Content-type: application/json' https://your-cluster-api-endpoint.example/api/v1/namespaces/default/pods/quux/eviction -d @eviction.json
How API-initiated eviction works
When you request an eviction using the API, the API server performs admission checks and responds in one of the following ways:
200 OK
: the eviction is allowed, theEviction
subresource is created, and the Pod is deleted, similar to sending aDELETE
request to the Pod URL.429 Too Many Requests
: the eviction is not currently allowed because of the configured PodDisruptionBudget. You may be able to attempt the eviction again later. You might also see this response because of API rate limiting.500 Internal Server Error
: the eviction is not allowed because there is a misconfiguration, like if multiple PodDisruptionBudgets reference the same Pod.
If the Pod you want to evict isn't part of a workload that has a
PodDisruptionBudget, the API server always returns 200 OK
and allows the
eviction.
If the API server allows the eviction, the Pod is deleted as follows:
- The
Pod
resource in the API server is updated with a deletion timestamp, after which the API server considers thePod
resource to be terminated. ThePod
resource is also marked with the configured grace period. - The kubelet on the node where the local Pod is running notices that the
Pod
resource is marked for termination and starts to gracefully shut down the local Pod. - While the kubelet is shutting the Pod down, the control plane removes the Pod from Endpoint and EndpointSlice objects. As a result, controllers no longer consider the Pod as a valid object.
- After the grace period for the Pod expires, the kubelet forcefully terminates the local Pod.
- The kubelet tells the API server to remove the
Pod
resource. - The API server deletes the
Pod
resource.
Troubleshooting stuck evictions
In some cases, your applications may enter a broken state, where the Eviction
API will only return 429
or 500
responses until you intervene. This can
happen if, for example, a ReplicaSet creates pods for your application but new
pods do not enter a Ready
state. You may also notice this behavior in cases
where the last evicted Pod had a long termination grace period.
If you notice stuck evictions, try one of the following solutions:
- Abort or pause the automated operation causing the issue. Investigate the stuck application before you restart the operation.
- Wait a while, then directly delete the Pod from your cluster control plane instead of using the Eviction API.
What's next
- Learn how to protect your applications with a Pod Disruption Budget.
- Learn about Node-pressure Eviction.
- Learn about Pod Priority and Preemption.
3.10.8 - Resource Bin Packing for Extended Resources
Kubernetes v1.16 [alpha]
The kube-scheduler can be configured to enable bin packing of resources along
with extended resources using RequestedToCapacityRatioResourceAllocation
priority function. Priority functions can be used to fine-tune the
kube-scheduler as per custom needs.
Enabling Bin Packing using RequestedToCapacityRatioResourceAllocation
Kubernetes allows the users to specify the resources along with weights for
each resource to score nodes based on the request to capacity ratio. This
allows users to bin pack extended resources by using appropriate parameters
and improves the utilization of scarce resources in large clusters. The
behavior of the RequestedToCapacityRatioResourceAllocation
priority function
can be controlled by a configuration option called RequestedToCapacityRatioArgs
.
This argument consists of two parameters shape
and resources
. The shape
parameter allows the user to tune the function as least requested or most
requested based on utilization
and score
values. The resources
parameter
consists of name
of the resource to be considered during scoring and weight
specify the weight of each resource.
Below is an example configuration that sets
requestedToCapacityRatioArguments
to bin packing behavior for extended
resources intel.com/foo
and intel.com/bar
.
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
# ...
pluginConfig:
- name: RequestedToCapacityRatio
args:
shape:
- utilization: 0
score: 10
- utilization: 100
score: 0
resources:
- name: intel.com/foo
weight: 3
- name: intel.com/bar
weight: 5
Referencing the KubeSchedulerConfiguration
file with the kube-scheduler
flag --config=/path/to/config/file
will pass the configuration to the
scheduler.
This feature is disabled by default
Tuning the Priority Function
shape
is used to specify the behavior of the
RequestedToCapacityRatioPriority
function.
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
The above arguments give the node a score
of 0 if utilization
is 0% and 10 for
utilization
100%, thus enabling bin packing behavior. To enable least
requested the score value must be reversed as follows.
shape:
- utilization: 0
score: 10
- utilization: 100
score: 0
resources
is an optional parameter which defaults to:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
It can be used to add extended resources as follows:
resources:
- name: intel.com/foo
weight: 5
- name: cpu
weight: 3
- name: memory
weight: 1
The weight
parameter is optional and is set to 1 if not specified. Also, the
weight
cannot be set to a negative value.
Node scoring for capacity allocation
This section is intended for those who want to understand the internal details of this feature. Below is an example of how the node score is calculated for a given set of values.
Requested resources:
intel.com/foo : 2
memory: 256MB
cpu: 2
Resource weights:
intel.com/foo : 5
memory: 1
cpu: 3
FunctionShapePoint {{0, 0}, {100, 10}}
Node 1 spec:
Available:
intel.com/foo: 4
memory: 1 GB
cpu: 8
Used:
intel.com/foo: 1
memory: 256MB
cpu: 1
Node score:
intel.com/foo = resourceScoringFunction((2+1),4)
= (100 - ((4-3)*100/4)
= (100 - 25)
= 75 # requested + used = 75% * available
= rawScoringFunction(75)
= 7 # floor(75/10)
memory = resourceScoringFunction((256+256),1024)
= (100 -((1024-512)*100/1024))
= 50 # requested + used = 50% * available
= rawScoringFunction(50)
= 5 # floor(50/10)
cpu = resourceScoringFunction((2+1),8)
= (100 -((8-3)*100/8))
= 37.5 # requested + used = 37.5% * available
= rawScoringFunction(37.5)
= 3 # floor(37.5/10)
NodeScore = (7 * 5) + (5 * 1) + (3 * 3) / (5 + 1 + 3)
= 5
Node 2 spec:
Available:
intel.com/foo: 8
memory: 1GB
cpu: 8
Used:
intel.com/foo: 2
memory: 512MB
cpu: 6
Node score:
intel.com/foo = resourceScoringFunction((2+2),8)
= (100 - ((8-4)*100/8)
= (100 - 50)
= 50
= rawScoringFunction(50)
= 5
memory = resourceScoringFunction((256+512),1024)
= (100 -((1024-768)*100/1024))
= 75
= rawScoringFunction(75)
= 7
cpu = resourceScoringFunction((2+6),8)
= (100 -((8-8)*100/8))
= 100
= rawScoringFunction(100)
= 10
NodeScore = (5 * 5) + (7 * 1) + (10 * 3) / (5 + 1 + 3)
= 7
What's next
- Read more about the scheduling framework
- Read more about scheduler configuration
3.10.9 - Scheduling Framework
Kubernetes v1.19 [stable]
The scheduling framework is a pluggable architecture for the Kubernetes scheduler. It adds a new set of "plugin" APIs to the existing scheduler. Plugins are compiled into the scheduler. The APIs allow most scheduling features to be implemented as plugins, while keeping the scheduling "core" lightweight and maintainable. Refer to the design proposal of the scheduling framework for more technical information on the design of the framework.
Framework workflow
The Scheduling Framework defines a few extension points. Scheduler plugins register to be invoked at one or more extension points. Some of these plugins can change the scheduling decisions and some are informational only.
Each attempt to schedule one Pod is split into two phases, the scheduling cycle and the binding cycle.
Scheduling Cycle & Binding Cycle
The scheduling cycle selects a node for the Pod, and the binding cycle applies that decision to the cluster. Together, a scheduling cycle and binding cycle are referred to as a "scheduling context".
Scheduling cycles are run serially, while binding cycles may run concurrently.
A scheduling or binding cycle can be aborted if the Pod is determined to be unschedulable or if there is an internal error. The Pod will be returned to the queue and retried.
Extension points
The following picture shows the scheduling context of a Pod and the extension points that the scheduling framework exposes. In this picture "Filter" is equivalent to "Predicate" and "Scoring" is equivalent to "Priority function".
One plugin may register at multiple extension points to perform more complex or stateful tasks.
QueueSort
These plugins are used to sort Pods in the scheduling queue. A queue sort plugin
essentially provides a Less(Pod1, Pod2)
function. Only one queue sort
plugin may be enabled at a time.
PreFilter
These plugins are used to pre-process info about the Pod, or to check certain conditions that the cluster or the Pod must meet. If a PreFilter plugin returns an error, the scheduling cycle is aborted.
Filter
These plugins are used to filter out nodes that cannot run the Pod. For each node, the scheduler will call filter plugins in their configured order. If any filter plugin marks the node as infeasible, the remaining plugins will not be called for that node. Nodes may be evaluated concurrently.
PostFilter
These plugins are called after Filter phase, but only when no feasible nodes
were found for the pod. Plugins are called in their configured order. If
any postFilter plugin marks the node as Schedulable
, the remaining plugins
will not be called. A typical PostFilter implementation is preemption, which
tries to make the pod schedulable by preempting other Pods.
PreScore
These plugins are used to perform "pre-scoring" work, which generates a sharable state for Score plugins to use. If a PreScore plugin returns an error, the scheduling cycle is aborted.
Score
These plugins are used to rank nodes that have passed the filtering phase. The scheduler will call each scoring plugin for each node. There will be a well defined range of integers representing the minimum and maximum scores. After the NormalizeScore phase, the scheduler will combine node scores from all plugins according to the configured plugin weights.
NormalizeScore
These plugins are used to modify scores before the scheduler computes a final ranking of Nodes. A plugin that registers for this extension point will be called with the Score results from the same plugin. This is called once per plugin per scheduling cycle.
For example, suppose a plugin BlinkingLightScorer
ranks Nodes based on how
many blinking lights they have.
func ScoreNode(_ *v1.pod, n *v1.Node) (int, error) {
return getBlinkingLightCount(n)
}
However, the maximum count of blinking lights may be small compared to
NodeScoreMax
. To fix this, BlinkingLightScorer
should also register for this
extension point.
func NormalizeScores(scores map[string]int) {
highest := 0
for _, score := range scores {
highest = max(highest, score)
}
for node, score := range scores {
scores[node] = score*NodeScoreMax/highest
}
}
If any NormalizeScore plugin returns an error, the scheduling cycle is aborted.
Reserve
A plugin that implements the Reserve extension has two methods, namely Reserve
and Unreserve
, that back two informational scheduling phases called Reserve
and Unreserve, respectively. Plugins which maintain runtime state (aka "stateful
plugins") should use these phases to be notified by the scheduler when resources
on a node are being reserved and unreserved for a given Pod.
The Reserve phase happens before the scheduler actually binds a Pod to its
designated node. It exists to prevent race conditions while the scheduler waits
for the bind to succeed. The Reserve
method of each Reserve plugin may succeed
or fail; if one Reserve
method call fails, subsequent plugins are not executed
and the Reserve phase is considered to have failed. If the Reserve
method of
all plugins succeed, the Reserve phase is considered to be successful and the
rest of the scheduling cycle and the binding cycle are executed.
The Unreserve phase is triggered if the Reserve phase or a later phase fails.
When this happens, the Unreserve
method of all Reserve plugins will be
executed in the reverse order of Reserve
method calls. This phase exists to
clean up the state associated with the reserved Pod.
Unreserve
method in Reserve plugins must be
idempotent and may not fail.
Permit
Permit plugins are invoked at the end of the scheduling cycle for each Pod, to prevent or delay the binding to the candidate node. A permit plugin can do one of the three things:
-
approve
Once all Permit plugins approve a Pod, it is sent for binding. -
deny
If any Permit plugin denies a Pod, it is returned to the scheduling queue. This will trigger the Unreserve phase in Reserve plugins. -
wait (with a timeout)
If a Permit plugin returns "wait", then the Pod is kept in an internal "waiting" Pods list, and the binding cycle of this Pod starts but directly blocks until it gets approved. If a timeout occurs, wait becomes deny and the Pod is returned to the scheduling queue, triggering the Unreserve phase in Reserve plugins.
FrameworkHandle
), we expect only the permit
plugins to approve binding of reserved Pods that are in "waiting" state. Once a Pod
is approved, it is sent to the PreBind phase.
PreBind
These plugins are used to perform any work required before a Pod is bound. For example, a pre-bind plugin may provision a network volume and mount it on the target node before allowing the Pod to run there.
If any PreBind plugin returns an error, the Pod is rejected and returned to the scheduling queue.
Bind
These plugins are used to bind a Pod to a Node. Bind plugins will not be called until all PreBind plugins have completed. Each bind plugin is called in the configured order. A bind plugin may choose whether or not to handle the given Pod. If a bind plugin chooses to handle a Pod, the remaining bind plugins are skipped.
PostBind
This is an informational extension point. Post-bind plugins are called after a Pod is successfully bound. This is the end of a binding cycle, and can be used to clean up associated resources.
Plugin API
There are two steps to the plugin API. First, plugins must register and get configured, then they use the extension point interfaces. Extension point interfaces have the following form.
type Plugin interface {
Name() string
}
type QueueSortPlugin interface {
Plugin
Less(*v1.pod, *v1.pod) bool
}
type PreFilterPlugin interface {
Plugin
PreFilter(context.Context, *framework.CycleState, *v1.pod) error
}
// ...
Plugin configuration
You can enable or disable plugins in the scheduler configuration. If you are using Kubernetes v1.18 or later, most scheduling plugins are in use and enabled by default.
In addition to default plugins, you can also implement your own scheduling plugins and get them configured along with default plugins. You can visit scheduler-plugins for more details.
If you are using Kubernetes v1.18 or later, you can configure a set of plugins as a scheduler profile and then define multiple profiles to fit various kinds of workload. Learn more at multiple profiles.
3.10.10 - Scheduler Performance Tuning
Kubernetes v1.14 [beta]
kube-scheduler is the Kubernetes default scheduler. It is responsible for placement of Pods on Nodes in a cluster.
Nodes in a cluster that meet the scheduling requirements of a Pod are called feasible Nodes for the Pod. The scheduler finds feasible Nodes for a Pod and then runs a set of functions to score the feasible Nodes, picking a Node with the highest score among the feasible ones to run the Pod. The scheduler then notifies the API server about this decision in a process called Binding.
This page explains performance tuning optimizations that are relevant for large Kubernetes clusters.
In large clusters, you can tune the scheduler's behaviour balancing scheduling outcomes between latency (new Pods are placed quickly) and accuracy (the scheduler rarely makes poor placement decisions).
You configure this tuning setting via kube-scheduler setting
percentageOfNodesToScore
. This KubeSchedulerConfiguration setting determines
a threshold for scheduling nodes in your cluster.
Setting the threshold
The percentageOfNodesToScore
option accepts whole numeric values between 0
and 100. The value 0 is a special number which indicates that the kube-scheduler
should use its compiled-in default.
If you set percentageOfNodesToScore
above 100, kube-scheduler acts as if you
had set a value of 100.
To change the value, edit the
kube-scheduler configuration file
and then restart the scheduler.
In many cases, the configuration file can be found at /etc/kubernetes/config/kube-scheduler.yaml
.
After you have made this change, you can run
kubectl get pods -n kube-system | grep kube-scheduler
to verify that the kube-scheduler component is healthy.
Node scoring threshold
To improve scheduling performance, the kube-scheduler can stop looking for feasible nodes once it has found enough of them. In large clusters, this saves time compared to a naive approach that would consider every node.
You specify a threshold for how many nodes are enough, as a whole number percentage of all the nodes in your cluster. The kube-scheduler converts this into an integer number of nodes. During scheduling, if the kube-scheduler has identified enough feasible nodes to exceed the configured percentage, the kube-scheduler stops searching for more feasible nodes and moves on to the scoring phase.
How the scheduler iterates over Nodes describes the process in detail.
Default threshold
If you don't specify a threshold, Kubernetes calculates a figure using a linear formula that yields 50% for a 100-node cluster and yields 10% for a 5000-node cluster. The lower bound for the automatic value is 5%.
This means that, the kube-scheduler always scores at least 5% of your cluster no
matter how large the cluster is, unless you have explicitly set
percentageOfNodesToScore
to be smaller than 5.
If you want the scheduler to score all nodes in your cluster, set
percentageOfNodesToScore
to 100.
Example
Below is an example configuration that sets percentageOfNodesToScore
to 50%.
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
algorithmSource:
provider: DefaultProvider
...
percentageOfNodesToScore: 50
Tuning percentageOfNodesToScore
percentageOfNodesToScore
must be a value between 1 and 100 with the default
value being calculated based on the cluster size. There is also a hardcoded
minimum value of 50 nodes.
In clusters with less than 50 feasible nodes, the scheduler still checks all the nodes because there are not enough feasible nodes to stop the scheduler's search early.
In a small cluster, if you set a low value for percentageOfNodesToScore
, your
change will have no or little effect, for a similar reason.
If your cluster has several hundred Nodes or fewer, leave this configuration option at its default value. Making changes is unlikely to improve the scheduler's performance significantly.
An important detail to consider when setting this value is that when a smaller number of nodes in a cluster are checked for feasibility, some nodes are not sent to be scored for a given Pod. As a result, a Node which could possibly score a higher value for running the given Pod might not even be passed to the scoring phase. This would result in a less than ideal placement of the Pod.
You should avoid setting percentageOfNodesToScore
very low so that kube-scheduler
does not make frequent, poor Pod placement decisions. Avoid setting the
percentage to anything below 10%, unless the scheduler's throughput is critical
for your application and the score of nodes is not important. In other words, you
prefer to run the Pod on any Node as long as it is feasible.
How the scheduler iterates over Nodes
This section is intended for those who want to understand the internal details of this feature.
In order to give all the Nodes in a cluster a fair chance of being considered
for running Pods, the scheduler iterates over the nodes in a round robin
fashion. You can imagine that Nodes are in an array. The scheduler starts from
the start of the array and checks feasibility of the nodes until it finds enough
Nodes as specified by percentageOfNodesToScore
. For the next Pod, the
scheduler continues from the point in the Node array that it stopped at when
checking feasibility of Nodes for the previous Pod.
If Nodes are in multiple zones, the scheduler iterates over Nodes in various zones to ensure that Nodes from different zones are considered in the feasibility checks. As an example, consider six nodes in two zones:
Zone 1: Node 1, Node 2, Node 3, Node 4
Zone 2: Node 5, Node 6
The Scheduler evaluates feasibility of the nodes in this order:
Node 1, Node 5, Node 2, Node 6, Node 3, Node 4
After going over all the Nodes, it goes back to Node 1.
What's next
3.11 - Cluster Administration
The cluster administration overview is for anyone creating or administering a Kubernetes cluster. It assumes some familiarity with core Kubernetes concepts.
Planning a cluster
See the guides in Setup for examples of how to plan, set up, and configure Kubernetes clusters. The solutions listed in this article are called distros.
Before choosing a guide, here are some considerations:
- Do you want to try out Kubernetes on your computer, or do you want to build a high-availability, multi-node cluster? Choose distros best suited for your needs.
- Will you be using a hosted Kubernetes cluster, such as Google Kubernetes Engine, or hosting your own cluster?
- Will your cluster be on-premises, or in the cloud (IaaS)? Kubernetes does not directly support hybrid clusters. Instead, you can set up multiple clusters.
- If you are configuring Kubernetes on-premises, consider which networking model fits best.
- Will you be running Kubernetes on "bare metal" hardware or on virtual machines (VMs)?
- Do you want to run a cluster, or do you expect to do active development of Kubernetes project code? If the latter, choose an actively-developed distro. Some distros only use binary releases, but offer a greater variety of choices.
- Familiarize yourself with the components needed to run a cluster.
Managing a cluster
-
Learn how to manage nodes.
-
Learn how to set up and manage the resource quota for shared clusters.
Securing a cluster
-
Generate Certificates describes the steps to generate certificates using different tool chains.
-
Kubernetes Container Environment describes the environment for Kubelet managed containers on a Kubernetes node.
-
Controlling Access to the Kubernetes API describes how Kubernetes implements access control for its own API.
-
Authenticating explains authentication in Kubernetes, including the various authentication options.
-
Authorization is separate from authentication, and controls how HTTP calls are handled.
-
Using Admission Controllers explains plug-ins which intercepts requests to the Kubernetes API server after authentication and authorization.
-
Using Sysctls in a Kubernetes Cluster describes to an administrator how to use the
sysctl
command-line tool to set kernel parameters . -
Auditing describes how to interact with Kubernetes' audit logs.
Securing the kubelet
Optional Cluster Services
-
DNS Integration describes how to resolve a DNS name directly to a Kubernetes service.
-
Logging and Monitoring Cluster Activity explains how logging in Kubernetes works and how to implement it.
3.11.1 - Certificates
To learn how to generate certificates for your cluster, see Certificates.
3.11.2 - Managing Resources
You've deployed your application and exposed it via a service. Now what? Kubernetes provides a number of tools to help you manage your application deployment, including scaling and updating. Among the features that we will discuss in more depth are configuration files and labels.
Organizing resource configurations
Many applications require multiple resources to be created, such as a Deployment and a Service. Management of multiple resources can be simplified by grouping them together in the same file (separated by ---
in YAML). For example:
apiVersion: v1
kind: Service
metadata:
name: my-nginx-svc
labels:
app: nginx
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Multiple resources can be created the same way as a single resource:
kubectl apply -f https://k8s.io/examples/application/nginx-app.yaml
service/my-nginx-svc created
deployment.apps/my-nginx created
The resources will be created in the order they appear in the file. Therefore, it's best to specify the service first, since that will ensure the scheduler can spread the pods associated with the service as they are created by the controller(s), such as Deployment.
kubectl apply
also accepts multiple -f
arguments:
kubectl apply -f https://k8s.io/examples/application/nginx/nginx-svc.yaml -f https://k8s.io/examples/application/nginx/nginx-deployment.yaml
And a directory can be specified rather than or in addition to individual files:
kubectl apply -f https://k8s.io/examples/application/nginx/
kubectl
will read any files with suffixes .yaml
, .yml
, or .json
.
It is a recommended practice to put resources related to the same microservice or application tier into the same file, and to group all of the files associated with your application in the same directory. If the tiers of your application bind to each other using DNS, you can deploy all of the components of your stack together.
A URL can also be specified as a configuration source, which is handy for deploying directly from configuration files checked into GitHub:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/website/main/content/en/examples/application/nginx/nginx-deployment.yaml
deployment.apps/my-nginx created
Bulk operations in kubectl
Resource creation isn't the only operation that kubectl
can perform in bulk. It can also extract resource names from configuration files in order to perform other operations, in particular to delete the same resources you created:
kubectl delete -f https://k8s.io/examples/application/nginx-app.yaml
deployment.apps "my-nginx" deleted
service "my-nginx-svc" deleted
In the case of two resources, you can specify both resources on the command line using the resource/name syntax:
kubectl delete deployments/my-nginx services/my-nginx-svc
For larger numbers of resources, you'll find it easier to specify the selector (label query) specified using -l
or --selector
, to filter resources by their labels:
kubectl delete deployment,services -l app=nginx
deployment.apps "my-nginx" deleted
service "my-nginx-svc" deleted
Because kubectl
outputs resource names in the same syntax it accepts, you can chain operations using $()
or xargs
:
kubectl get $(kubectl create -f docs/concepts/cluster-administration/nginx/ -o name | grep service)
kubectl create -f docs/concepts/cluster-administration/nginx/ -o name | grep service | xargs -i kubectl get {}
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-nginx-svc LoadBalancer 10.0.0.208 <pending> 80/TCP 0s
With the above commands, we first create resources under examples/application/nginx/
and print the resources created with -o name
output format
(print each resource as resource/name). Then we grep
only the "service", and then print it with kubectl get
.
If you happen to organize your resources across several subdirectories within a particular directory, you can recursively perform the operations on the subdirectories also, by specifying --recursive
or -R
alongside the --filename,-f
flag.
For instance, assume there is a directory project/k8s/development
that holds all of the manifests needed for the development environment, organized by resource type:
project/k8s/development
├── configmap
│ └── my-configmap.yaml
├── deployment
│ └── my-deployment.yaml
└── pvc
└── my-pvc.yaml
By default, performing a bulk operation on project/k8s/development
will stop at the first level of the directory, not processing any subdirectories. If we had tried to create the resources in this directory using the following command, we would have encountered an error:
kubectl apply -f project/k8s/development
error: you must provide one or more resources by argument or filename (.json|.yaml|.yml|stdin)
Instead, specify the --recursive
or -R
flag with the --filename,-f
flag as such:
kubectl apply -f project/k8s/development --recursive
configmap/my-config created
deployment.apps/my-deployment created
persistentvolumeclaim/my-pvc created
The --recursive
flag works with any operation that accepts the --filename,-f
flag such as: kubectl {create,get,delete,describe,rollout}
etc.
The --recursive
flag also works when multiple -f
arguments are provided:
kubectl apply -f project/k8s/namespaces -f project/k8s/development --recursive
namespace/development created
namespace/staging created
configmap/my-config created
deployment.apps/my-deployment created
persistentvolumeclaim/my-pvc created
If you're interested in learning more about kubectl
, go ahead and read Command line tool (kubectl).
Using labels effectively
The examples we've used so far apply at most a single label to any resource. There are many scenarios where multiple labels should be used to distinguish sets from one another.
For instance, different applications would use different values for the app
label, but a multi-tier application, such as the guestbook example, would additionally need to distinguish each tier. The frontend could carry the following labels:
labels:
app: guestbook
tier: frontend
while the Redis master and slave would have different tier
labels, and perhaps even an additional role
label:
labels:
app: guestbook
tier: backend
role: master
and
labels:
app: guestbook
tier: backend
role: slave
The labels allow us to slice and dice our resources along any dimension specified by a label:
kubectl apply -f examples/guestbook/all-in-one/guestbook-all-in-one.yaml
kubectl get pods -Lapp -Ltier -Lrole
NAME READY STATUS RESTARTS AGE APP TIER ROLE
guestbook-fe-4nlpb 1/1 Running 0 1m guestbook frontend <none>
guestbook-fe-ght6d 1/1 Running 0 1m guestbook frontend <none>
guestbook-fe-jpy62 1/1 Running 0 1m guestbook frontend <none>
guestbook-redis-master-5pg3b 1/1 Running 0 1m guestbook backend master
guestbook-redis-slave-2q2yf 1/1 Running 0 1m guestbook backend slave
guestbook-redis-slave-qgazl 1/1 Running 0 1m guestbook backend slave
my-nginx-divi2 1/1 Running 0 29m nginx <none> <none>
my-nginx-o0ef1 1/1 Running 0 29m nginx <none> <none>
kubectl get pods -lapp=guestbook,role=slave
NAME READY STATUS RESTARTS AGE
guestbook-redis-slave-2q2yf 1/1 Running 0 3m
guestbook-redis-slave-qgazl 1/1 Running 0 3m
Canary deployments
Another scenario where multiple labels are needed is to distinguish deployments of different releases or configurations of the same component. It is common practice to deploy a canary of a new application release (specified via image tag in the pod template) side by side with the previous release so that the new release can receive live production traffic before fully rolling it out.
For instance, you can use a track
label to differentiate different releases.
The primary, stable release would have a track
label with value as stable
:
name: frontend
replicas: 3
...
labels:
app: guestbook
tier: frontend
track: stable
...
image: gb-frontend:v3
and then you can create a new release of the guestbook frontend that carries the track
label with different value (i.e. canary
), so that two sets of pods would not overlap:
name: frontend-canary
replicas: 1
...
labels:
app: guestbook
tier: frontend
track: canary
...
image: gb-frontend:v4
The frontend service would span both sets of replicas by selecting the common subset of their labels (i.e. omitting the track
label), so that the traffic will be redirected to both applications:
selector:
app: guestbook
tier: frontend
You can tweak the number of replicas of the stable and canary releases to determine the ratio of each release that will receive live production traffic (in this case, 3:1). Once you're confident, you can update the stable track to the new application release and remove the canary one.
For a more concrete example, check the tutorial of deploying Ghost.
Updating labels
Sometimes existing pods and other resources need to be relabeled before creating new resources. This can be done with kubectl label
.
For example, if you want to label all your nginx pods as frontend tier, run:
kubectl label pods -l app=nginx tier=fe
pod/my-nginx-2035384211-j5fhi labeled
pod/my-nginx-2035384211-u2c7e labeled
pod/my-nginx-2035384211-u3t6x labeled
This first filters all pods with the label "app=nginx", and then labels them with the "tier=fe". To see the pods you labeled, run:
kubectl get pods -l app=nginx -L tier
NAME READY STATUS RESTARTS AGE TIER
my-nginx-2035384211-j5fhi 1/1 Running 0 23m fe
my-nginx-2035384211-u2c7e 1/1 Running 0 23m fe
my-nginx-2035384211-u3t6x 1/1 Running 0 23m fe
This outputs all "app=nginx" pods, with an additional label column of pods' tier (specified with -L
or --label-columns
).
For more information, please see labels and kubectl label.
Updating annotations
Sometimes you would want to attach annotations to resources. Annotations are arbitrary non-identifying metadata for retrieval by API clients such as tools, libraries, etc. This can be done with kubectl annotate
. For example:
kubectl annotate pods my-nginx-v4-9gw19 description='my frontend running nginx'
kubectl get pods my-nginx-v4-9gw19 -o yaml
apiVersion: v1
kind: pod
metadata:
annotations:
description: my frontend running nginx
...
For more information, please see annotations and kubectl annotate document.
Scaling your application
When load on your application grows or shrinks, use kubectl
to scale your application. For instance, to decrease the number of nginx replicas from 3 to 1, do:
kubectl scale deployment/my-nginx --replicas=1
deployment.apps/my-nginx scaled
Now you only have one pod managed by the deployment.
kubectl get pods -l app=nginx
NAME READY STATUS RESTARTS AGE
my-nginx-2035384211-j5fhi 1/1 Running 0 30m
To have the system automatically choose the number of nginx replicas as needed, ranging from 1 to 3, do:
kubectl autoscale deployment/my-nginx --min=1 --max=3
horizontalpodautoscaler.autoscaling/my-nginx autoscaled
Now your nginx replicas will be scaled up and down as needed, automatically.
For more information, please see kubectl scale, kubectl autoscale and horizontal pod autoscaler document.
In-place updates of resources
Sometimes it's necessary to make narrow, non-disruptive updates to resources you've created.
kubectl apply
It is suggested to maintain a set of configuration files in source control
(see configuration as code),
so that they can be maintained and versioned along with the code for the resources they configure.
Then, you can use kubectl apply
to push your configuration changes to the cluster.
This command will compare the version of the configuration that you're pushing with the previous version and apply the changes you've made, without overwriting any automated changes to properties you haven't specified.
kubectl apply -f https://k8s.io/examples/application/nginx/nginx-deployment.yaml
deployment.apps/my-nginx configured
Note that kubectl apply
attaches an annotation to the resource in order to determine the changes to the configuration since the previous invocation. When it's invoked, kubectl apply
does a three-way diff between the previous configuration, the provided input and the current configuration of the resource, in order to determine how to modify the resource.
Currently, resources are created without this annotation, so the first invocation of kubectl apply
will fall back to a two-way diff between the provided input and the current configuration of the resource. During this first invocation, it cannot detect the deletion of properties set when the resource was created. For this reason, it will not remove them.
All subsequent calls to kubectl apply
, and other commands that modify the configuration, such as kubectl replace
and kubectl edit
, will update the annotation, allowing subsequent calls to kubectl apply
to detect and perform deletions using a three-way diff.
kubectl edit
Alternatively, you may also update resources with kubectl edit
:
kubectl edit deployment/my-nginx
This is equivalent to first get
the resource, edit it in text editor, and then apply
the resource with the updated version:
kubectl get deployment my-nginx -o yaml > /tmp/nginx.yaml
vi /tmp/nginx.yaml
# do some edit, and then save the file
kubectl apply -f /tmp/nginx.yaml
deployment.apps/my-nginx configured
rm /tmp/nginx.yaml
This allows you to do more significant changes more easily. Note that you can specify the editor with your EDITOR
or KUBE_EDITOR
environment variables.
For more information, please see kubectl edit document.
kubectl patch
You can use kubectl patch
to update API objects in place. This command supports JSON patch,
JSON merge patch, and strategic merge patch. See
Update API Objects in Place Using kubectl patch
and
kubectl patch.
Disruptive updates
In some cases, you may need to update resource fields that cannot be updated once initialized, or you may want to make a recursive change immediately, such as to fix broken pods created by a Deployment. To change such fields, use replace --force
, which deletes and re-creates the resource. In this case, you can modify your original configuration file:
kubectl replace -f https://k8s.io/examples/application/nginx/nginx-deployment.yaml --force
deployment.apps/my-nginx deleted
deployment.apps/my-nginx replaced
Updating your application without a service outage
At some point, you'll eventually need to update your deployed application, typically by specifying a new image or image tag, as in the canary deployment scenario above. kubectl
supports several update operations, each of which is applicable to different scenarios.
We'll guide you through how to create and update applications with Deployments.
Let's say you were running version 1.14.2 of nginx:
kubectl create deployment my-nginx --image=nginx:1.14.2
deployment.apps/my-nginx created
with 3 replicas (so the old and new revisions can coexist):
kubectl scale deployment my-nginx --current-replicas=1 --replicas=3
deployment.apps/my-nginx scaled
To update to version 1.16.1, change .spec.template.spec.containers[0].image
from nginx:1.14.2
to nginx:1.16.1
using the previous kubectl commands.
kubectl edit deployment/my-nginx
That's it! The Deployment will declaratively update the deployed nginx application progressively behind the scene. It ensures that only a certain number of old replicas may be down while they are being updated, and only a certain number of new replicas may be created above the desired number of pods. To learn more details about it, visit Deployment page.
What's next
3.11.3 - Cluster Networking
Networking is a central part of Kubernetes, but it can be challenging to understand exactly how it is expected to work. There are 4 distinct networking problems to address:
- Highly-coupled container-to-container communications: this is solved by
Pods and
localhost
communications. - Pod-to-Pod communications: this is the primary focus of this document.
- Pod-to-Service communications: this is covered by services.
- External-to-Service communications: this is covered by services.
Kubernetes is all about sharing machines between applications. Typically, sharing machines requires ensuring that two applications do not try to use the same ports. Coordinating ports across multiple developers is very difficult to do at scale and exposes users to cluster-level issues outside of their control.
Dynamic port allocation brings a lot of complications to the system - every application has to take ports as flags, the API servers have to know how to insert dynamic port numbers into configuration blocks, services have to know how to find each other, etc. Rather than deal with this, Kubernetes takes a different approach.
To learn about the Kubernetes networking model, see here.
How to implement the Kubernetes networking model
There are a number of ways that this network model can be implemented. This document is not an exhaustive study of the various methods, but hopefully serves as an introduction to various technologies and serves as a jumping-off point.
The following networking options are sorted alphabetically - the order does not imply any preferential status.
ACI
Cisco Application Centric Infrastructure offers an integrated overlay and underlay SDN solution that supports containers, virtual machines, and bare metal servers. ACI provides container networking integration for ACI. An overview of the integration is provided here.
Antrea
Project Antrea is an opensource Kubernetes networking solution intended to be Kubernetes native. It leverages Open vSwitch as the networking data plane. Open vSwitch is a high-performance programmable virtual switch that supports both Linux and Windows. Open vSwitch enables Antrea to implement Kubernetes Network Policies in a high-performance and efficient manner. Thanks to the "programmable" characteristic of Open vSwitch, Antrea is able to implement an extensive set of networking and security features and services on top of Open vSwitch.
AWS VPC CNI for Kubernetes
The AWS VPC CNI offers integrated AWS Virtual Private Cloud (VPC) networking for Kubernetes clusters. This CNI plugin offers high throughput and availability, low latency, and minimal network jitter. Additionally, users can apply existing AWS VPC networking and security best practices for building Kubernetes clusters. This includes the ability to use VPC flow logs, VPC routing policies, and security groups for network traffic isolation.
Using this CNI plugin allows Kubernetes pods to have the same IP address inside the pod as they do on the VPC network. The CNI allocates AWS Elastic Networking Interfaces (ENIs) to each Kubernetes node and using the secondary IP range from each ENI for pods on the node. The CNI includes controls for pre-allocation of ENIs and IP addresses for fast pod startup times and enables large clusters of up to 2,000 nodes.
Additionally, the CNI can be run alongside Calico for network policy enforcement. The AWS VPC CNI project is open source with documentation on GitHub.
Azure CNI for Kubernetes
Azure CNI is an open source plugin that integrates Kubernetes Pods with an Azure Virtual Network (also known as VNet) providing network performance at par with VMs. Pods can connect to peered VNet and to on-premises over Express Route or site-to-site VPN and are also directly reachable from these networks. Pods can access Azure services, such as storage and SQL, that are protected by Service Endpoints or Private Link. You can use VNet security policies and routing to filter Pod traffic. The plugin assigns VNet IPs to Pods by utilizing a pool of secondary IPs pre-configured on the Network Interface of a Kubernetes node.
Azure CNI is available natively in the Azure Kubernetes Service (AKS).
Calico
Calico is an open source networking and network security solution for containers, virtual machines, and native host-based workloads. Calico supports multiple data planes including: a pure Linux eBPF dataplane, a standard Linux networking dataplane, and a Windows HNS dataplane. Calico provides a full networking stack but can also be used in conjunction with cloud provider CNIs to provide network policy enforcement.
Cilium
Cilium is open source software for providing and transparently securing network connectivity between application containers. Cilium is L7/HTTP aware and can enforce network policies on L3-L7 using an identity based security model that is decoupled from network addressing, and it can be used in combination with other CNI plugins.
CNI-Genie from Huawei
CNI-Genie is a CNI plugin that enables Kubernetes to simultaneously have access to different implementations of the Kubernetes network model in runtime. This includes any implementation that runs as a CNI plugin, such as Flannel, Calico, Weave-net.
CNI-Genie also supports assigning multiple IP addresses to a pod, each from a different CNI plugin.
cni-ipvlan-vpc-k8s
cni-ipvlan-vpc-k8s contains a set of CNI and IPAM plugins to provide a simple, host-local, low latency, high throughput, and compliant networking stack for Kubernetes within Amazon Virtual Private Cloud (VPC) environments by making use of Amazon Elastic Network Interfaces (ENI) and binding AWS-managed IPs into Pods using the Linux kernel's IPvlan driver in L2 mode.
The plugins are designed to be straightforward to configure and deploy within a VPC. Kubelets boot and then self-configure and scale their IP usage as needed without requiring the often recommended complexities of administering overlay networks, BGP, disabling source/destination checks, or adjusting VPC route tables to provide per-instance subnets to each host (which is limited to 50-100 entries per VPC). In short, cni-ipvlan-vpc-k8s significantly reduces the network complexity required to deploy Kubernetes at scale within AWS.
Coil
Coil is a CNI plugin designed for ease of integration, providing flexible egress networking. Coil operates with a low overhead compared to bare metal, and allows you to define arbitrary egress NAT gateways for external networks.
Contiv-VPP
Contiv-VPP is a user-space, performance-oriented network plugin for Kubernetes, using the fd.io data plane.
Contrail / Tungsten Fabric
Contrail, based on Tungsten Fabric, is a truly open, multi-cloud network virtualization and policy management platform. Contrail and Tungsten Fabric are integrated with various orchestration systems such as Kubernetes, OpenShift, OpenStack and Mesos, and provide different isolation modes for virtual machines, containers/pods and bare metal workloads.
DANM
DANM is a networking solution for telco workloads running in a Kubernetes cluster. It's built up from the following components:
- A CNI plugin capable of provisioning IPVLAN interfaces with advanced features
- An in-built IPAM module with the capability of managing multiple, cluster-wide, discontinuous L3 networks and provide a dynamic, static, or no IP allocation scheme on-demand
- A CNI metaplugin capable of attaching multiple network interfaces to a container, either through its own CNI, or through delegating the job to any of the popular CNI solution like SRI-OV, or Flannel in parallel
- A Kubernetes controller capable of centrally managing both VxLAN and VLAN interfaces of all Kubernetes hosts
- Another Kubernetes controller extending Kubernetes' Service-based service discovery concept to work over all network interfaces of a Pod
With this toolset DANM is able to provide multiple separated network interfaces, the possibility to use different networking back ends and advanced IPAM features for the pods.
Flannel
Flannel is a very simple overlay network that satisfies the Kubernetes requirements. Many people have reported success with Flannel and Kubernetes.
Hybridnet
Hybridnet is an open source CNI plugin designed for hybrid clouds which provides both overlay and underlay networking for containers in one or more clusters. Overlay and underlay containers can run on the same node and have cluster-wide bidirectional network connectivity.
Jaguar
Jaguar is an open source solution for Kubernetes's network based on OpenDaylight. Jaguar provides overlay network using vxlan and Jaguar CNIPlugin provides one IP address per pod.
k-vswitch
k-vswitch is a simple Kubernetes networking plugin based on Open vSwitch. It leverages existing functionality in Open vSwitch to provide a robust networking plugin that is easy-to-operate, performant and secure.
Knitter
Knitter is a network solution which supports multiple networking in Kubernetes. It provides the ability of tenant management and network management. Knitter includes a set of end-to-end NFV container networking solutions besides multiple network planes, such as keeping IP address for applications, IP address migration, etc.
Kube-OVN
Kube-OVN is an OVN-based kubernetes network fabric for enterprises. With the help of OVN/OVS, it provides some advanced overlay network features like subnet, QoS, static IP allocation, traffic mirroring, gateway, openflow-based network policy and service proxy.
Kube-router
Kube-router is a purpose-built networking solution for Kubernetes that aims to provide high performance and operational simplicity. Kube-router provides a Linux LVS/IPVS-based service proxy, a Linux kernel forwarding-based pod-to-pod networking solution with no overlays, and iptables/ipset-based network policy enforcer.
L2 networks and linux bridging
If you have a "dumb" L2 network, such as a simple switch in a "bare-metal" environment, you should be able to do something similar to the above GCE setup. Note that these instructions have only been tried very casually - it seems to work, but has not been thoroughly tested. If you use this technique and perfect the process, please let us know.
Follow the "With Linux Bridge devices" section of this very nice tutorial from Lars Kellogg-Stedman.
Multus (a Multi Network plugin)
Multus is a Multi CNI plugin to support the Multi Networking feature in Kubernetes using CRD based network objects in Kubernetes.
Multus supports all reference plugins (eg. Flannel, DHCP, Macvlan) that implement the CNI specification and 3rd party plugins (eg. Calico, Weave, Cilium, Contiv). In addition to it, Multus supports SRIOV, DPDK, OVS-DPDK & VPP workloads in Kubernetes with both cloud native and NFV based applications in Kubernetes.
OVN4NFV-K8s-Plugin (OVN based CNI controller & plugin)
OVN4NFV-K8S-Plugin is OVN based CNI controller plugin to provide cloud native based Service function chaining(SFC), Multiple OVN overlay networking, dynamic subnet creation, dynamic creation of virtual networks, VLAN Provider network, Direct provider network and pluggable with other Multi-network plugins, ideal for edge based cloud native workloads in Multi-cluster networking
NSX-T
VMware NSX-T is a network virtualization and security platform. NSX-T can provide network virtualization for a multi-cloud and multi-hypervisor environment and is focused on emerging application frameworks and architectures that have heterogeneous endpoints and technology stacks. In addition to vSphere hypervisors, these environments include other hypervisors such as KVM, containers, and bare metal.
NSX-T Container Plug-in (NCP) provides integration between NSX-T and container orchestrators such as Kubernetes, as well as integration between NSX-T and container-based CaaS/PaaS platforms such as Pivotal Container Service (PKS) and OpenShift.
OVN (Open Virtual Networking)
OVN is an opensource network virtualization solution developed by the Open vSwitch community. It lets one create logical switches, logical routers, stateful ACLs, load-balancers etc to build different virtual networking topologies. The project has a specific Kubernetes plugin and documentation at ovn-kubernetes.
Weave Net from Weaveworks
Weave Net is a resilient and simple to use network for Kubernetes and its hosted applications. Weave Net runs as a CNI plug-in or stand-alone. In either version, it doesn't require any configuration or extra code to run, and in both cases, the network provides one IP address per pod - as is standard for Kubernetes.
What's next
The early design of the networking model and its rationale, and some future plans are described in more detail in the networking design document.
3.11.4 - Logging Architecture
Application logs can help you understand what is happening inside your application. The logs are particularly useful for debugging problems and monitoring cluster activity. Most modern applications have some kind of logging mechanism. Likewise, container engines are designed to support logging. The easiest and most adopted logging method for containerized applications is writing to standard output and standard error streams.
However, the native functionality provided by a container engine or runtime is usually not enough for a complete logging solution.
For example, you may want to access your application's logs if a container crashes, a pod gets evicted, or a node dies.
In a cluster, logs should have a separate storage and lifecycle independent of nodes, pods, or containers. This concept is called cluster-level logging.
Cluster-level logging architectures require a separate backend to store, analyze, and query logs. Kubernetes does not provide a native storage solution for log data. Instead, there are many logging solutions that integrate with Kubernetes. The following sections describe how to handle and store logs on nodes.
Basic logging in Kubernetes
This example uses a Pod
specification with a container
to write text to the standard output stream once per second.
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox:1.28
args: [/bin/sh, -c,
'i=0; while true; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done']
To run this pod, use the following command:
kubectl apply -f https://k8s.io/examples/debug/counter-pod.yaml
The output is:
pod/counter created
To fetch the logs, use the kubectl logs
command, as follows:
kubectl logs counter
The output is:
0: Mon Jan 1 00:00:00 UTC 2001
1: Mon Jan 1 00:00:01 UTC 2001
2: Mon Jan 1 00:00:02 UTC 2001
...
You can use kubectl logs --previous
to retrieve logs from a previous instantiation of a container.
If your pod has multiple containers, specify which container's logs you want to access by
appending a container name to the command, with a -c
flag, like so:
kubectl logs counter -c count
See the kubectl logs
documentation for more details.
Logging at the node level
A container engine handles and redirects any output generated to a containerized application's stdout
and stderr
streams.
For example, the Docker container engine redirects those two streams to a logging driver, which is configured in Kubernetes to write to a file in JSON format.
By default, if a container restarts, the kubelet keeps one terminated container with its logs. If a pod is evicted from the node, all corresponding containers are also evicted, along with their logs.
An important consideration in node-level logging is implementing log rotation,
so that logs don't consume all available storage on the node. Kubernetes
is not responsible for rotating logs, but rather a deployment tool
should set up a solution to address that.
For example, in Kubernetes clusters, deployed by the kube-up.sh
script,
there is a logrotate
tool configured to run each hour. You can also set up a container runtime to
rotate an application's logs automatically.
As an example, you can find detailed information about how kube-up.sh
sets
up logging for COS image on GCP in the corresponding
configure-helper
script.
When using a CRI container runtime, the kubelet is responsible for rotating the logs and managing the logging directory structure.
The kubelet sends this information to the CRI container runtime and the runtime writes the container logs to the given location.
The two kubelet parameters containerLogMaxSize
and containerLogMaxFiles
in kubelet config file
can be used to configure the maximum size for each log file and the maximum number of files allowed for each container respectively.
When you run kubectl logs
as in
the basic logging example, the kubelet on the node handles the request and
reads directly from the log file. The kubelet returns the content of the log file.
kubectl logs
. For example, if there's a 10MB file, logrotate
performs
the rotation and there are two files: one file that is 10MB in size and a second file that is empty.
kubectl logs
returns the latest log file which in this example is an empty response.
System component logs
There are two types of system components: those that run in a container and those that do not run in a container. For example:
- The Kubernetes scheduler and kube-proxy run in a container.
- The kubelet and container runtime do not run in containers.
On machines with systemd, the kubelet and container runtime write to journald. If
systemd is not present, the kubelet and container runtime write to .log
files
in the /var/log
directory. System components inside containers always write
to the /var/log
directory, bypassing the default logging mechanism.
They use the klog
logging library. You can find the conventions for logging severity for those
components in the development docs on logging.
Similar to the container logs, system component logs in the /var/log
directory should be rotated. In Kubernetes clusters brought up by
the kube-up.sh
script, those logs are configured to be rotated by
the logrotate
tool daily or once the size exceeds 100MB.
Cluster-level logging architectures
While Kubernetes does not provide a native solution for cluster-level logging, there are several common approaches you can consider. Here are some options:
- Use a node-level logging agent that runs on every node.
- Include a dedicated sidecar container for logging in an application pod.
- Push logs directly to a backend from within an application.
Using a node logging agent
You can implement cluster-level logging by including a node-level logging agent on each node. The logging agent is a dedicated tool that exposes logs or pushes logs to a backend. Commonly, the logging agent is a container that has access to a directory with log files from all of the application containers on that node.
Because the logging agent must run on every node, it is recommended to run the agent
as a DaemonSet
.
Node-level logging creates only one agent per node and doesn't require any changes to the applications running on the node.
Containers write to stdout and stderr, but with no agreed format. A node-level agent collects these logs and forwards them for aggregation.
Using a sidecar container with the logging agent
You can use a sidecar container in one of the following ways:
- The sidecar container streams application logs to its own
stdout
. - The sidecar container runs a logging agent, which is configured to pick up logs from an application container.
Streaming sidecar container
By having your sidecar containers write to their own stdout
and stderr
streams, you can take advantage of the kubelet and the logging agent that
already run on each node. The sidecar containers read logs from a file, a socket,
or journald. Each sidecar container prints a log to its own stdout
or stderr
stream.
This approach allows you to separate several log streams from different
parts of your application, some of which can lack support
for writing to stdout
or stderr
. The logic behind redirecting logs
is minimal, so it's not a significant overhead. Additionally, because
stdout
and stderr
are handled by the kubelet, you can use built-in tools
like kubectl logs
.
For example, a pod runs a single container, and the container writes to two different log files using two different formats. Here's a configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox:1.28
args:
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)" >> /var/log/1.log;
echo "$(date) INFO $i" >> /var/log/2.log;
i=$((i+1));
sleep 1;
done
volumeMounts:
- name: varlog
mountPath: /var/log
volumes:
- name: varlog
emptyDir: {}
It is not recommended to write log entries with different formats to the same log
stream, even if you managed to redirect both components to the stdout
stream of
the container. Instead, you can create two sidecar containers. Each sidecar
container could tail a particular log file from a shared volume and then redirect
the logs to its own stdout
stream.
Here's a configuration file for a pod that has two sidecar containers:
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox:1.28
args:
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)" >> /var/log/1.log;
echo "$(date) INFO $i" >> /var/log/2.log;
i=$((i+1));
sleep 1;
done
volumeMounts:
- name: varlog
mountPath: /var/log
- name: count-log-1
image: busybox:1.28
args: [/bin/sh, -c, 'tail -n+1 -f /var/log/1.log']
volumeMounts:
- name: varlog
mountPath: /var/log
- name: count-log-2
image: busybox:1.28
args: [/bin/sh, -c, 'tail -n+1 -f /var/log/2.log']
volumeMounts:
- name: varlog
mountPath: /var/log
volumes:
- name: varlog
emptyDir: {}
Now when you run this pod, you can access each log stream separately by running the following commands:
kubectl logs counter count-log-1
The output is:
0: Mon Jan 1 00:00:00 UTC 2001
1: Mon Jan 1 00:00:01 UTC 2001
2: Mon Jan 1 00:00:02 UTC 2001
...
kubectl logs counter count-log-2
The output is:
Mon Jan 1 00:00:00 UTC 2001 INFO 0
Mon Jan 1 00:00:01 UTC 2001 INFO 1
Mon Jan 1 00:00:02 UTC 2001 INFO 2
...
The node-level agent installed in your cluster picks up those log streams automatically without any further configuration. If you like, you can configure the agent to parse log lines depending on the source container.
Note, that despite low CPU and memory usage (order of a couple of millicores
for cpu and order of several megabytes for memory), writing logs to a file and
then streaming them to stdout
can double disk usage. If you have
an application that writes to a single file, it's recommended to set
/dev/stdout
as the destination rather than implement the streaming sidecar
container approach.
Sidecar containers can also be used to rotate log files that cannot be
rotated by the application itself. An example of this approach is a small container running logrotate
periodically.
However, it's recommended to use stdout
and stderr
directly and leave rotation
and retention policies to the kubelet.
Sidecar container with a logging agent
If the node-level logging agent is not flexible enough for your situation, you can create a sidecar container with a separate logging agent that you have configured specifically to run with your application.
kubectl logs
because they are not controlled
by the kubelet.
Here are two configuration files that you can use to implement a sidecar container with a logging agent. The first file contains
a ConfigMap
to configure fluentd.
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluentd.conf: |
<source>
type tail
format none
path /var/log/1.log
pos_file /var/log/1.log.pos
tag count.format1
</source>
<source>
type tail
format none
path /var/log/2.log
pos_file /var/log/2.log.pos
tag count.format2
</source>
<match **>
type google_cloud
</match>
The second file describes a pod that has a sidecar container running fluentd. The pod mounts a volume where fluentd can pick up its configuration data.
apiVersion: v1
kind: Pod
metadata:
name: counter
spec:
containers:
- name: count
image: busybox:1.28
args:
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)" >> /var/log/1.log;
echo "$(date) INFO $i" >> /var/log/2.log;
i=$((i+1));
sleep 1;
done
volumeMounts:
- name: varlog
mountPath: /var/log
- name: count-agent
image: k8s.gcr.io/fluentd-gcp:1.30
env:
- name: FLUENTD_ARGS
value: -c /etc/fluentd-config/fluentd.conf
volumeMounts:
- name: varlog
mountPath: /var/log
- name: config-volume
mountPath: /etc/fluentd-config
volumes:
- name: varlog
emptyDir: {}
- name: config-volume
configMap:
name: fluentd-config
In the sample configurations, you can replace fluentd with any logging agent, reading from any source inside an application container.
Exposing logs directly from the application
Cluster-logging that exposes or pushes logs directly from every application is outside the scope of Kubernetes.
3.11.5 - Metrics For Kubernetes System Components
System component metrics can give a better look into what is happening inside them. Metrics are particularly useful for building dashboards and alerts.
Kubernetes components emit metrics in Prometheus format. This format is structured plain text, designed so that people and machines can both read it.
Metrics in Kubernetes
In most cases metrics are available on /metrics
endpoint of the HTTP server. For components that doesn't expose endpoint by default it can be enabled using --bind-address
flag.
Examples of those components:
In a production environment you may want to configure Prometheus Server or some other metrics scraper to periodically gather these metrics and make them available in some kind of time series database.
Note that kubelet also exposes metrics in /metrics/cadvisor
, /metrics/resource
and /metrics/probes
endpoints. Those metrics do not have same lifecycle.
If your cluster uses RBAC, reading metrics requires authorization via a user, group or ServiceAccount with a ClusterRole that allows accessing /metrics
.
For example:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- nonResourceURLs:
- "/metrics"
verbs:
- get
Metric lifecycle
Alpha metric → Stable metric → Deprecated metric → Hidden metric → Deleted metric
Alpha metrics have no stability guarantees. These metrics can be modified or deleted at any time.
Stable metrics are guaranteed to not change. This means:
- A stable metric without a deprecated signature will not be deleted or renamed
- A stable metric's type will not be modified
Deprecated metrics are slated for deletion, but are still available for use. These metrics include an annotation about the version in which they became deprecated.
For example:
-
Before deprecation
# HELP some_counter this counts things # TYPE some_counter counter some_counter 0
-
After deprecation
# HELP some_counter (Deprecated since 1.15.0) this counts things # TYPE some_counter counter some_counter 0
Hidden metrics are no longer published for scraping, but are still available for use. To use a hidden metric, please refer to the Show hidden metrics section.
Deleted metrics are no longer published and cannot be used.
Show hidden metrics
As described above, admins can enable hidden metrics through a command-line flag on a specific binary. This intends to be used as an escape hatch for admins if they missed the migration of the metrics deprecated in the last release.
The flag show-hidden-metrics-for-version
takes a version for which you want to show metrics deprecated in that release. The version is expressed as x.y, where x is the major version, y is the minor version. The patch version is not needed even though a metrics can be deprecated in a patch release, the reason for that is the metrics deprecation policy runs against the minor release.
The flag can only take the previous minor version as it's value. All metrics hidden in previous will be emitted if admins set the previous version to show-hidden-metrics-for-version
. The too old version is not allowed because this violates the metrics deprecated policy.
Take metric A
as an example, here assumed that A
is deprecated in 1.n. According to metrics deprecated policy, we can reach the following conclusion:
- In release
1.n
, the metric is deprecated, and it can be emitted by default. - In release
1.n+1
, the metric is hidden by default and it can be emitted by command lineshow-hidden-metrics-for-version=1.n
. - In release
1.n+2
, the metric should be removed from the codebase. No escape hatch anymore.
If you're upgrading from release 1.12
to 1.13
, but still depend on a metric A
deprecated in 1.12
, you should set hidden metrics via command line: --show-hidden-metrics=1.12
and remember to remove this metric dependency before upgrading to 1.14
Disable accelerator metrics
The kubelet collects accelerator metrics through cAdvisor. To collect these metrics, for accelerators like NVIDIA GPUs, kubelet held an open handle on the driver. This meant that in order to perform infrastructure changes (for example, updating the driver), a cluster administrator needed to stop the kubelet agent.
The responsibility for collecting accelerator metrics now belongs to the vendor rather than the kubelet. Vendors must provide a container that collects metrics and exposes them to the metrics service (for example, Prometheus).
The DisableAcceleratorUsageMetrics
feature gate disables metrics collected by the kubelet, with a timeline for enabling this feature by default.
Component metrics
kube-controller-manager metrics
Controller manager metrics provide important insight into the performance and health of the controller manager. These metrics include common Go language runtime metrics such as go_routine count and controller specific metrics such as etcd request latencies or Cloudprovider (AWS, GCE, OpenStack) API latencies that can be used to gauge the health of a cluster.
Starting from Kubernetes 1.7, detailed Cloudprovider metrics are available for storage operations for GCE, AWS, Vsphere and OpenStack. These metrics can be used to monitor health of persistent volume operations.
For example, for GCE these metrics are called:
cloudprovider_gce_api_request_duration_seconds { request = "instance_list"}
cloudprovider_gce_api_request_duration_seconds { request = "disk_insert"}
cloudprovider_gce_api_request_duration_seconds { request = "disk_delete"}
cloudprovider_gce_api_request_duration_seconds { request = "attach_disk"}
cloudprovider_gce_api_request_duration_seconds { request = "detach_disk"}
cloudprovider_gce_api_request_duration_seconds { request = "list_disk"}
kube-scheduler metrics
Kubernetes v1.21 [beta]
The scheduler exposes optional metrics that reports the requested resources and the desired limits of all running pods. These metrics can be used to build capacity planning dashboards, assess current or historical scheduling limits, quickly identify workloads that cannot schedule due to lack of resources, and compare actual usage to the pod's request.
The kube-scheduler identifies the resource requests and limits configured for each Pod; when either a request or limit is non-zero, the kube-scheduler reports a metrics timeseries. The time series is labelled by:
- namespace
- pod name
- the node where the pod is scheduled or an empty string if not yet scheduled
- priority
- the assigned scheduler for that pod
- the name of the resource (for example,
cpu
) - the unit of the resource if known (for example,
cores
)
Once a pod reaches completion (has a restartPolicy
of Never
or OnFailure
and is in the Succeeded
or Failed
pod phase, or has been deleted and all containers have a terminated state) the series is no longer reported since the scheduler is now free to schedule other pods to run. The two metrics are called kube_pod_resource_request
and kube_pod_resource_limit
.
The metrics are exposed at the HTTP endpoint /metrics/resources
and require the same authorization as the /metrics
endpoint on the scheduler. You must use the --show-hidden-metrics-for-version=1.20
flag to expose these alpha stability metrics.
Disabling metrics
You can explicitly turn off metrics via command line flag --disabled-metrics
. This may be desired if, for example, a metric is causing a performance problem. The input is a list of disabled metrics (i.e. --disabled-metrics=metric1,metric2
).
Metric cardinality enforcement
Metrics with unbounded dimensions could cause memory issues in the components they instrument. To limit resource use, you can use the --allow-label-value
command line option to dynamically configure an allow-list of label values for a metric.
In alpha stage, the flag can only take in a series of mappings as metric label allow-list.
Each mapping is of the format <metric_name>,<label_name>=<allowed_labels>
where
<allowed_labels>
is a comma-separated list of acceptable label names.
The overall format looks like:
--allow-label-value <metric_name>,<label_name>='<allow_value1>, <allow_value2>...', <metric_name2>,<label_name>='<allow_value1>, <allow_value2>...', ...
.
Here is an example:
--allow-label-value number_count_metric,odd_number='1,3,5', number_count_metric,even_number='2,4,6', date_gauge_metric,weekend='Saturday,Sunday'
What's next
- Read about the Prometheus text format for metrics
- See the list of stable Kubernetes metrics
- Read about the Kubernetes deprecation policy
3.11.6 - System Logs
System component logs record events happening in cluster, which can be very useful for debugging. You can configure log verbosity to see more or less detail. Logs can be as coarse-grained as showing errors within a component, or as fine-grained as showing step-by-step traces of events (like HTTP access logs, pod state changes, controller actions, or scheduler decisions).
Klog
klog is the Kubernetes logging library. klog generates log messages for the Kubernetes system components.
For more information about klog configuration, see the Command line tool reference.
Kubernetes is in the process of simplifying logging in its components. The following klog command line flags are deprecated starting with Kubernetes 1.23 and will be removed in a future release:
--add-dir-header
--alsologtostderr
--log-backtrace-at
--log-dir
--log-file
--log-file-max-size
--logtostderr
--one-output
--skip-headers
--skip-log-headers
--stderrthreshold
Output will always be written to stderr, regardless of the output format. Output redirection is expected to be handled by the component which invokes a Kubernetes component. This can be a POSIX shell or a tool like systemd.
In some cases, for example a distroless container or a Windows system service,
those options are not available. Then the
kube-log-runner
binary can be used as wrapper around a Kubernetes component to redirect
output. A prebuilt binary is included in several Kubernetes base images under
its traditional name as /go-runner
and as kube-log-runner
in server and
node release archives.
This table shows how kube-log-runner
invocations correspond to shell redirection:
Usage | POSIX shell (such as bash) | kube-log-runner <options> <cmd> |
---|---|---|
Merge stderr and stdout, write to stdout | 2>&1 |
kube-log-runner (default behavior) |
Redirect both into log file | 1>>/tmp/log 2>&1 |
kube-log-runner -log-file=/tmp/log |
Copy into log file and to stdout | 2>&1 | tee -a /tmp/log |
kube-log-runner -log-file=/tmp/log -also-stdout |
Redirect only stdout into log file | >/tmp/log |
kube-log-runner -log-file=/tmp/log -redirect-stderr=false |
Klog output
An example of the traditional klog native format:
I1025 00:15:15.525108 1 httplog.go:79] GET /api/v1/namespaces/kube-system/pods/metrics-server-v0.3.1-57c75779f-9p8wg: (1.512ms) 200 [pod_nanny/v0.0.0 (linux/amd64) kubernetes/$Format 10.56.1.19:51756]
The message string may contain line breaks:
I1025 00:15:15.525108 1 example.go:79] This is a message
which has a line break.
Structured Logging
Kubernetes v1.23 [beta]
Migration to structured log messages is an ongoing process. Not all log messages are structured in this version. When parsing log files, you must also handle unstructured log messages.
Log formatting and value serialization are subject to change.
Structured logging introduces a uniform structure in log messages allowing for programmatic extraction of information. You can store and process structured logs with less effort and cost. The code which generates a log message determines whether it uses the traditional unstructured klog output or structured logging.
The default formatting of structured log messages is as text, with a format that is backward compatible with traditional klog:
<klog header> "<message>" <key1>="<value1>" <key2>="<value2>" ...
Example:
I1025 00:15:15.525108 1 controller_utils.go:116] "Pod status updated" pod="kube-system/kubedns" status="ready"
Strings are quoted. Other values are formatted with
%+v
, which may cause log messages to
continue on the next line depending on the data.
I1025 00:15:15.525108 1 example.go:116] "Example" data="This is text with a line break\nand \"quotation marks\"." someInt=1 someFloat=0.1 someStruct={StringField: First line,
second line.}
JSON log format
Kubernetes v1.19 [alpha]
JSON output does not support many standard klog flags. For list of unsupported klog flags, see the Command line tool reference.
Not all logs are guaranteed to be written in JSON format (for example, during process start). If you intend to parse logs, make sure you can handle log lines that are not JSON as well.
Field names and JSON serialization are subject to change.
The --logging-format=json
flag changes the format of logs from klog native format to JSON format.
Example of JSON log format (pretty printed):
{
"ts": 1580306777.04728,
"v": 4,
"msg": "Pod status updated",
"pod":{
"name": "nginx-1",
"namespace": "default"
},
"status": "ready"
}
Keys with special meaning:
ts
- timestamp as Unix time (required, float)v
- verbosity (only for info and not for error messages, int)err
- error string (optional, string)msg
- message (required, string)
List of components currently supporting JSON format:
Log sanitization
Kubernetes v1.20 [alpha]
The --experimental-logging-sanitization
flag enables the klog sanitization filter.
If enabled all log arguments are inspected for fields tagged as sensitive data (e.g. passwords, keys, tokens) and logging of these fields will be prevented.
List of components currently supporting log sanitization:
- kube-controller-manager
- kube-apiserver
- kube-scheduler
- kubelet
Log verbosity level
The -v
flag controls log verbosity. Increasing the value increases the number of logged events. Decreasing the value decreases the number of logged events.
Increasing verbosity settings logs increasingly less severe events. A verbosity setting of 0 logs only critical events.
Log location
There are two types of system components: those that run in a container and those that do not run in a container. For example:
- The Kubernetes scheduler and kube-proxy run in a container.
- The kubelet and container runtime do not run in containers.
On machines with systemd, the kubelet and container runtime write to journald.
Otherwise, they write to .log
files in the /var/log
directory.
System components inside containers always write to .log
files in the /var/log
directory,
bypassing the default logging mechanism.
Similar to the container logs, you should rotate system component logs in the /var/log
directory.
In Kubernetes clusters created by the kube-up.sh
script, log rotation is configured by the logrotate
tool.
The logrotate
tool rotates logs daily, or once the log size is greater than 100MB.
What's next
- Read about the Kubernetes Logging Architecture
- Read about Structured Logging
- Read about deprecation of klog flags
- Read about the Conventions for logging severity
3.11.7 - Traces For Kubernetes System Components
Kubernetes v1.22 [alpha]
System component traces record the latency of and relationships between operations in the cluster.
Kubernetes components emit traces using the OpenTelemetry Protocol with the gRPC exporter and can be collected and routed to tracing backends using an OpenTelemetry Collector.
Trace Collection
For a complete guide to collecting traces and using the collector, see Getting Started with the OpenTelemetry Collector. However, there are a few things to note that are specific to Kubernetes components.
By default, Kubernetes components export traces using the grpc exporter for OTLP on the IANA OpenTelemetry port, 4317. As an example, if the collector is running as a sidecar to a Kubernetes component, the following receiver configuration will collect spans and log them to standard output:
receivers:
otlp:
protocols:
grpc:
exporters:
# Replace this exporter with the exporter for your backend
logging:
logLevel: debug
service:
pipelines:
traces:
receivers: [otlp]
exporters: [logging]
Component traces
kube-apiserver traces
The kube-apiserver generates spans for incoming HTTP requests, and for outgoing requests to webhooks, etcd, and re-entrant requests. It propagates the W3C Trace Context with outgoing requests but does not make use of the trace context attached to incoming requests, as the kube-apiserver is often a public endpoint.
Enabling tracing in the kube-apiserver
To enable tracing, enable the APIServerTracing
feature gate
on the kube-apiserver. Also, provide the kube-apiserver with a tracing configration file
with --tracing-config-file=<path-to-config>
. This is an example config that records
spans for 1 in 10000 requests, and uses the default OpenTelemetry endpoint:
apiVersion: apiserver.config.k8s.io/v1alpha1
kind: TracingConfiguration
# default value
#endpoint: localhost:4317
samplingRatePerMillion: 100
For more information about the TracingConfiguration
struct, see
API server config API (v1alpha1).
Stability
Tracing instrumentation is still under active development, and may change in a variety of ways. This includes span names, attached attributes, instrumented endpoints, etc. Until this feature graduates to stable, there are no guarantees of backwards compatibility for tracing instrumentation.
What's next
3.11.8 - Proxies in Kubernetes
This page explains proxies used with Kubernetes.
Proxies
There are several different proxies you may encounter when using Kubernetes:
-
The kubectl proxy:
- runs on a user's desktop or in a pod
- proxies from a localhost address to the Kubernetes apiserver
- client to proxy uses HTTP
- proxy to apiserver uses HTTPS
- locates apiserver
- adds authentication headers
-
The apiserver proxy:
- is a bastion built into the apiserver
- connects a user outside of the cluster to cluster IPs which otherwise might not be reachable
- runs in the apiserver processes
- client to proxy uses HTTPS (or http if apiserver so configured)
- proxy to target may use HTTP or HTTPS as chosen by proxy using available information
- can be used to reach a Node, Pod, or Service
- does load balancing when used to reach a Service
-
The kube proxy:
- runs on each node
- proxies UDP, TCP and SCTP
- does not understand HTTP
- provides load balancing
- is only used to reach services
-
A Proxy/Load-balancer in front of apiserver(s):
- existence and implementation varies from cluster to cluster (e.g. nginx)
- sits between all clients and one or more apiservers
- acts as load balancer if there are several apiservers.
-
Cloud Load Balancers on external services:
- are provided by some cloud providers (e.g. AWS ELB, Google Cloud Load Balancer)
- are created automatically when the Kubernetes service has type
LoadBalancer
- usually supports UDP/TCP only
- SCTP support is up to the load balancer implementation of the cloud provider
- implementation varies by cloud provider.
Kubernetes users will typically not need to worry about anything other than the first two types. The cluster admin will typically ensure that the latter types are setup correctly.
Requesting redirects
Proxies have replaced redirect capabilities. Redirects have been deprecated.
3.11.9 - API Priority and Fairness
Kubernetes v1.20 [beta]
Controlling the behavior of the Kubernetes API server in an overload situation
is a key task for cluster administrators. The kube-apiserver has some controls available
(i.e. the --max-requests-inflight
and --max-mutating-requests-inflight
command-line flags) to limit the amount of outstanding work that will be
accepted, preventing a flood of inbound requests from overloading and
potentially crashing the API server, but these flags are not enough to ensure
that the most important requests get through in a period of high traffic.
The API Priority and Fairness feature (APF) is an alternative that improves upon aforementioned max-inflight limitations. APF classifies and isolates requests in a more fine-grained way. It also introduces a limited amount of queuing, so that no requests are rejected in cases of very brief bursts. Requests are dispatched from queues using a fair queuing technique so that, for example, a poorly-behaved controller need not starve others (even at the same priority level).
This feature is designed to work well with standard controllers, which use informers and react to failures of API requests with exponential back-off, and other clients that also work this way.
--max-requests-inflight
flag without the API Priority and
Fairness feature enabled.
Enabling/Disabling API Priority and Fairness
The API Priority and Fairness feature is controlled by a feature gate
and is enabled by default. See Feature
Gates
for a general explanation of feature gates and how to enable and
disable them. The name of the feature gate for APF is
"APIPriorityAndFairness". This feature also involves an API Group with: (a) a
v1alpha1
version, disabled by default, and (b) v1beta1
and
v1beta2
versions, enabled by default. You can disable the feature
gate and API group beta versions by adding the following
command-line flags to your kube-apiserver
invocation:
kube-apiserver \
--feature-gates=APIPriorityAndFairness=false \
--runtime-config=flowcontrol.apiserver.k8s.io/v1beta1=false,flowcontrol.apiserver.k8s.io/v1beta2=false \
# …and other flags as usual
Alternatively, you can enable the v1alpha1 version of the API group
with --runtime-config=flowcontrol.apiserver.k8s.io/v1alpha1=true
.
The command-line flag --enable-priority-and-fairness=false
will disable the
API Priority and Fairness feature, even if other flags have enabled it.
Concepts
There are several distinct features involved in the API Priority and Fairness feature. Incoming requests are classified by attributes of the request using FlowSchemas, and assigned to priority levels. Priority levels add a degree of isolation by maintaining separate concurrency limits, so that requests assigned to different priority levels cannot starve each other. Within a priority level, a fair-queuing algorithm prevents requests from different flows from starving each other, and allows for requests to be queued to prevent bursty traffic from causing failed requests when the average load is acceptably low.
Priority Levels
Without APF enabled, overall concurrency in the API server is limited by the
kube-apiserver
flags --max-requests-inflight
and
--max-mutating-requests-inflight
. With APF enabled, the concurrency limits
defined by these flags are summed and then the sum is divided up among a
configurable set of priority levels. Each incoming request is assigned to a
single priority level, and each priority level will only dispatch as many
concurrent requests as its configuration allows.
The default configuration, for example, includes separate priority levels for leader-election requests, requests from built-in controllers, and requests from Pods. This means that an ill-behaved Pod that floods the API server with requests cannot prevent leader election or actions by the built-in controllers from succeeding.
Queuing
Even within a priority level there may be a large number of distinct sources of traffic. In an overload situation, it is valuable to prevent one stream of requests from starving others (in particular, in the relatively common case of a single buggy client flooding the kube-apiserver with requests, that buggy client would ideally not have much measurable impact on other clients at all). This is handled by use of a fair-queuing algorithm to process requests that are assigned the same priority level. Each request is assigned to a flow, identified by the name of the matching FlowSchema plus a flow distinguisher — which is either the requesting user, the target resource's namespace, or nothing — and the system attempts to give approximately equal weight to requests in different flows of the same priority level. To enable distinct handling of distinct instances, controllers that have many instances should authenticate with distinct usernames
After classifying a request into a flow, the API Priority and Fairness feature then may assign the request to a queue. This assignment uses a technique known as shuffle sharding, which makes relatively efficient use of queues to insulate low-intensity flows from high-intensity flows.
The details of the queuing algorithm are tunable for each priority level, and allow administrators to trade off memory use, fairness (the property that independent flows will all make progress when total traffic exceeds capacity), tolerance for bursty traffic, and the added latency induced by queuing.
Exempt requests
Some requests are considered sufficiently important that they are not subject to any of the limitations imposed by this feature. These exemptions prevent an improperly-configured flow control configuration from totally disabling an API server.
Resources
The flow control API involves two kinds of resources.
PriorityLevelConfigurations
define the available isolation classes, the share of the available concurrency
budget that each can handle, and allow for fine-tuning queuing behavior.
FlowSchemas
are used to classify individual inbound requests, matching each to a
single PriorityLevelConfiguration. There is also a v1alpha1
version
of the same API group, and it has the same Kinds with the same syntax and
semantics.
PriorityLevelConfiguration
A PriorityLevelConfiguration represents a single isolation class. Each PriorityLevelConfiguration has an independent limit on the number of outstanding requests, and limitations on the number of queued requests.
Concurrency limits for PriorityLevelConfigurations are not specified in absolute
number of requests, but rather in "concurrency shares." The total concurrency
limit for the API Server is distributed among the existing
PriorityLevelConfigurations in proportion with these shares. This allows a
cluster administrator to scale up or down the total amount of traffic to a
server by restarting kube-apiserver
with a different value for
--max-requests-inflight
(or --max-mutating-requests-inflight
), and all
PriorityLevelConfigurations will see their maximum allowed concurrency go up (or
down) by the same fraction.
--max-requests-inflight
and
--max-mutating-requests-inflight
. There is no longer any distinction made
between mutating and non-mutating requests; if you want to treat them
separately for a given resource, make separate FlowSchemas that match the
mutating and non-mutating verbs respectively.
When the volume of inbound requests assigned to a single
PriorityLevelConfiguration is more than its permitted concurrency level, the
type
field of its specification determines what will happen to extra requests.
A type of Reject
means that excess traffic will immediately be rejected with
an HTTP 429 (Too Many Requests) error. A type of Queue
means that requests
above the threshold will be queued, with the shuffle sharding and fair queuing techniques used
to balance progress between request flows.
The queuing configuration allows tuning the fair queuing algorithm for a priority level. Details of the algorithm can be read in the enhancement proposal, but in short:
-
Increasing
queues
reduces the rate of collisions between different flows, at the cost of increased memory usage. A value of 1 here effectively disables the fair-queuing logic, but still allows requests to be queued. -
Increasing
queueLengthLimit
allows larger bursts of traffic to be sustained without dropping any requests, at the cost of increased latency and memory usage. -
Changing
handSize
allows you to adjust the probability of collisions between different flows and the overall concurrency available to a single flow in an overload situation.Note: A largerhandSize
makes it less likely for two individual flows to collide (and therefore for one to be able to starve the other), but more likely that a small number of flows can dominate the apiserver. A largerhandSize
also potentially increases the amount of latency that a single high-traffic flow can cause. The maximum number of queued requests possible from a single flow ishandSize * queueLengthLimit
.
Following is a table showing an interesting collection of shuffle sharding configurations, showing for each the probability that a given mouse (low-intensity flow) is squished by the elephants (high-intensity flows) for an illustrative collection of numbers of elephants. See https://play.golang.org/p/Gi0PLgVHiUg , which computes this table.
HandSize | Queues | 1 elephant | 4 elephants | 16 elephants |
---|---|---|---|---|
12 | 32 | 4.428838398950118e-09 | 0.11431348830099144 | 0.9935089607656024 |
10 | 32 | 1.550093439632541e-08 | 0.0626479840223545 | 0.9753101519027554 |
10 | 64 | 6.601827268370426e-12 | 0.00045571320990370776 | 0.49999929150089345 |
9 | 64 | 3.6310049976037345e-11 | 0.00045501212304112273 | 0.4282314876454858 |
8 | 64 | 2.25929199850899e-10 | 0.0004886697053040446 | 0.35935114681123076 |
8 | 128 | 6.994461389026097e-13 | 3.4055790161620863e-06 | 0.02746173137155063 |
7 | 128 | 1.0579122850901972e-11 | 6.960839379258192e-06 | 0.02406157386340147 |
7 | 256 | 7.597695465552631e-14 | 6.728547142019406e-08 | 0.0006709661542533682 |
6 | 256 | 2.7134626662687968e-12 | 2.9516464018476436e-07 | 0.0008895654642000348 |
6 | 512 | 4.116062922897309e-14 | 4.982983350480894e-09 | 2.26025764343413e-05 |
6 | 1024 | 6.337324016514285e-16 | 8.09060164312957e-11 | 4.517408062903668e-07 |
FlowSchema
A FlowSchema matches some inbound requests and assigns them to a
priority level. Every inbound request is tested against every
FlowSchema in turn, starting with those with numerically lowest ---
which we take to be the logically highest --- matchingPrecedence
and
working onward. The first match wins.
matchingPrecedence
. If multiple FlowSchemas with equal
matchingPrecedence
match the same request, the one with lexicographically
smaller name
will win, but it's better not to rely on this, and instead to
ensure that no two FlowSchemas have the same matchingPrecedence
.
A FlowSchema matches a given request if at least one of its rules
matches. A rule matches if at least one of its subjects
and at least
one of its resourceRules
or nonResourceRules
(depending on whether the
incoming request is for a resource or non-resource URL) matches the request.
For the name
field in subjects, and the verbs
, apiGroups
, resources
,
namespaces
, and nonResourceURLs
fields of resource and non-resource rules,
the wildcard *
may be specified to match all values for the given field,
effectively removing it from consideration.
A FlowSchema's distinguisherMethod.type
determines how requests matching that
schema will be separated into flows. It may be
either ByUser
, in which case one requesting user will not be able to starve
other users of capacity, or ByNamespace
, in which case requests for resources
in one namespace will not be able to starve requests for resources in other
namespaces of capacity, or it may be blank (or distinguisherMethod
may be
omitted entirely), in which case all requests matched by this FlowSchema will be
considered part of a single flow. The correct choice for a given FlowSchema
depends on the resource and your particular environment.
Defaults
Each kube-apiserver maintains two sorts of APF configuration objects: mandatory and suggested.
Mandatory Configuration Objects
The four mandatory configuration objects reflect fixed built-in guardrail behavior. This is behavior that the servers have before those objects exist, and when those objects exist their specs reflect this behavior. The four mandatory objects are as follows.
-
The mandatory
exempt
priority level is used for requests that are not subject to flow control at all: they will always be dispatched immediately. The mandatoryexempt
FlowSchema classifies all requests from thesystem:masters
group into this priority level. You may define other FlowSchemas that direct other requests to this priority level, if appropriate. -
The mandatory
catch-all
priority level is used in combination with the mandatorycatch-all
FlowSchema to make sure that every request gets some kind of classification. Typically you should not rely on this catch-all configuration, and should create your own catch-all FlowSchema and PriorityLevelConfiguration (or use the suggestedglobal-default
priority level that is installed by default) as appropriate. Because it is not expected to be used normally, the mandatorycatch-all
priority level has a very small concurrency share and does not queue requests.
Suggested Configuration Objects
The suggested FlowSchemas and PriorityLevelConfigurations constitute a reasonable default configuration. You can modify these and/or create additional configuration objects if you want. If your cluster is likely to experience heavy load then you should consider what configuration will work best.
The suggested configuration groups requests into six priority levels:
-
The
node-high
priority level is for health updates from nodes. -
The
system
priority level is for non-health requests from thesystem:nodes
group, i.e. Kubelets, which must be able to contact the API server in order for workloads to be able to schedule on them. -
The
leader-election
priority level is for leader election requests from built-in controllers (in particular, requests forendpoints
,configmaps
, orleases
coming from thesystem:kube-controller-manager
orsystem:kube-scheduler
users and service accounts in thekube-system
namespace). These are important to isolate from other traffic because failures in leader election cause their controllers to fail and restart, which in turn causes more expensive traffic as the new controllers sync their informers. -
The
workload-high
priority level is for other requests from built-in controllers. -
The
workload-low
priority level is for requests from any other service account, which will typically include all requests from controllers running in Pods. -
The
global-default
priority level handles all other traffic, e.g. interactivekubectl
commands run by nonprivileged users.
The suggested FlowSchemas serve to steer requests into the above priority levels, and are not enumerated here.
Maintenance of the Mandatory and Suggested Configuration Objects
Each kube-apiserver
independently maintains the mandatory and
suggested configuration objects, using initial and periodic behavior.
Thus, in a situation with a mixture of servers of different versions
there may be thrashing as long as different servers have different
opinions of the proper content of these objects.
Each kube-apiserver
makes an inital maintenance pass over the
mandatory and suggested configuration objects, and after that does
periodic maintenance (once per minute) of those objects.
For the mandatory configuration objects, maintenance consists of ensuring that the object exists and, if it does, has the proper spec. The server refuses to allow a creation or update with a spec that is inconsistent with the server's guardrail behavior.
Maintenance of suggested configuration objects is designed to allow
their specs to be overridden. Deletion, on the other hand, is not
respected: maintenance will restore the object. If you do not want a
suggested configuration object then you need to keep it around but set
its spec to have minimal consequences. Maintenance of suggested
objects is also designed to support automatic migration when a new
version of the kube-apiserver
is rolled out, albeit potentially with
thrashing while there is a mixed population of servers.
Maintenance of a suggested configuration object consists of creating
it --- with the server's suggested spec --- if the object does not
exist. OTOH, if the object already exists, maintenance behavior
depends on whether the kube-apiservers
or the users control the
object. In the former case, the server ensures that the object's spec
is what the server suggests; in the latter case, the spec is left
alone.
The question of who controls the object is answered by first looking
for an annotation with key apf.kubernetes.io/autoupdate-spec
. If
there is such an annotation and its value is true
then the
kube-apiservers control the object. If there is such an annotation
and its value is false
then the users control the object. If
neither of those condtions holds then the metadata.generation
of the
object is consulted. If that is 1 then the kube-apiservers control
the object. Otherwise the users control the object. These rules were
introduced in release 1.22 and their consideration of
metadata.generation
is for the sake of migration from the simpler
earlier behavior. Users who wish to control a suggested configuration
object should set its apf.kubernetes.io/autoupdate-spec
annotation
to false
.
Maintenance of a mandatory or suggested configuration object also
includes ensuring that it has an apf.kubernetes.io/autoupdate-spec
annotation that accurately reflects whether the kube-apiservers
control the object.
Maintenance also includes deleting objects that are neither mandatory
nor suggested but are annotated
apf.kubernetes.io/autoupdate-spec=true
.
Health check concurrency exemption
The suggested configuration gives no special treatment to the health
check requests on kube-apiservers from their local kubelets --- which
tend to use the secured port but supply no credentials. With the
suggested config, these requests get assigned to the global-default
FlowSchema and the corresponding global-default
priority level,
where other traffic can crowd them out.
If you add the following additional FlowSchema, this exempts those requests from rate limiting.
apiVersion: flowcontrol.apiserver.k8s.io/v1beta2
kind: FlowSchema
metadata:
name: health-for-strangers
spec:
matchingPrecedence: 1000
priorityLevelConfiguration:
name: exempt
rules:
- nonResourceRules:
- nonResourceURLs:
- "/healthz"
- "/livez"
- "/readyz"
verbs:
- "*"
subjects:
- kind: Group
group:
name: system:unauthenticated
Diagnostics
Every HTTP response from an API server with the priority and fairness feature
enabled has two extra headers: X-Kubernetes-PF-FlowSchema-UID
and
X-Kubernetes-PF-PriorityLevel-UID
, noting the flow schema that matched the request
and the priority level to which it was assigned, respectively. The API objects'
names are not included in these headers in case the requesting user does not
have permission to view them, so when debugging you can use a command like
kubectl get flowschemas -o custom-columns="uid:{metadata.uid},name:{metadata.name}"
kubectl get prioritylevelconfigurations -o custom-columns="uid:{metadata.uid},name:{metadata.name}"
to get a mapping of UIDs to names for both FlowSchemas and PriorityLevelConfigurations.
Observability
Metrics
flow_schema
and
priority_level
were inconsistently named flowSchema
and priorityLevel
,
respectively. If you're running Kubernetes versions v1.19 and earlier, you
should refer to the documentation for your version.
When you enable the API Priority and Fairness feature, the kube-apiserver exports additional metrics. Monitoring these can help you determine whether your configuration is inappropriately throttling important traffic, or find poorly-behaved workloads that may be harming system health.
-
apiserver_flowcontrol_rejected_requests_total
is a counter vector (cumulative since server start) of requests that were rejected, broken down by the labelsflow_schema
(indicating the one that matched the request),priority_level
(indicating the one to which the request was assigned), andreason
. Thereason
label will be have one of the following values:queue-full
, indicating that too many requests were already queued,concurrency-limit
, indicating that the PriorityLevelConfiguration is configured to reject rather than queue excess requests, ortime-out
, indicating that the request was still in the queue when its queuing time limit expired.
-
apiserver_flowcontrol_dispatched_requests_total
is a counter vector (cumulative since server start) of requests that began executing, broken down by the labelsflow_schema
(indicating the one that matched the request) andpriority_level
(indicating the one to which the request was assigned). -
apiserver_current_inqueue_requests
is a gauge vector of recent high water marks of the number of queued requests, grouped by a label namedrequest_kind
whose value ismutating
orreadOnly
. These high water marks describe the largest number seen in the one second window most recently completed. These complement the olderapiserver_current_inflight_requests
gauge vector that holds the last window's high water mark of number of requests actively being served. -
apiserver_flowcontrol_read_vs_write_request_count_samples
is a histogram vector of observations of the then-current number of requests, broken down by the labelsphase
(which takes on the valueswaiting
andexecuting
) andrequest_kind
(which takes on the valuesmutating
andreadOnly
). The observations are made periodically at a high rate. -
apiserver_flowcontrol_read_vs_write_request_count_watermarks
is a histogram vector of high or low water marks of the number of requests broken down by the labelsphase
(which takes on the valueswaiting
andexecuting
) andrequest_kind
(which takes on the valuesmutating
andreadOnly
); the labelmark
takes on valueshigh
andlow
. The water marks are accumulated over windows bounded by the times when an observation was added toapiserver_flowcontrol_read_vs_write_request_count_samples
. These water marks show the range of values that occurred between samples. -
apiserver_flowcontrol_current_inqueue_requests
is a gauge vector holding the instantaneous number of queued (not executing) requests, broken down by the labelspriority_level
andflow_schema
. -
apiserver_flowcontrol_current_executing_requests
is a gauge vector holding the instantaneous number of executing (not waiting in a queue) requests, broken down by the labelspriority_level
andflow_schema
. -
apiserver_flowcontrol_request_concurrency_in_use
is a gauge vector holding the instantaneous number of occupied seats, broken down by the labelspriority_level
andflow_schema
. -
apiserver_flowcontrol_priority_level_request_count_samples
is a histogram vector of observations of the then-current number of requests broken down by the labelsphase
(which takes on the valueswaiting
andexecuting
) andpriority_level
. Each histogram gets observations taken periodically, up through the last activity of the relevant sort. The observations are made at a high rate. -
apiserver_flowcontrol_priority_level_request_count_watermarks
is a histogram vector of high or low water marks of the number of requests broken down by the labelsphase
(which takes on the valueswaiting
andexecuting
) andpriority_level
; the labelmark
takes on valueshigh
andlow
. The water marks are accumulated over windows bounded by the times when an observation was added toapiserver_flowcontrol_priority_level_request_count_samples
. These water marks show the range of values that occurred between samples. -
apiserver_flowcontrol_request_queue_length_after_enqueue
is a histogram vector of queue lengths for the queues, broken down by the labelspriority_level
andflow_schema
, as sampled by the enqueued requests. Each request that gets queued contributes one sample to its histogram, reporting the length of the queue immediately after the request was added. Note that this produces different statistics than an unbiased survey would.Note: An outlier value in a histogram here means it is likely that a single flow (i.e., requests by one user or for one namespace, depending on configuration) is flooding the API server, and being throttled. By contrast, if one priority level's histogram shows that all queues for that priority level are longer than those for other priority levels, it may be appropriate to increase that PriorityLevelConfiguration's concurrency shares. -
apiserver_flowcontrol_request_concurrency_limit
is a gauge vector holding the computed concurrency limit (based on the API server's total concurrency limit and PriorityLevelConfigurations' concurrency shares), broken down by the labelpriority_level
. -
apiserver_flowcontrol_request_wait_duration_seconds
is a histogram vector of how long requests spent queued, broken down by the labelsflow_schema
(indicating which one matched the request),priority_level
(indicating the one to which the request was assigned), andexecute
(indicating whether the request started executing).Note: Since each FlowSchema always assigns requests to a single PriorityLevelConfiguration, you can add the histograms for all the FlowSchemas for one priority level to get the effective histogram for requests assigned to that priority level. -
apiserver_flowcontrol_request_execution_seconds
is a histogram vector of how long requests took to actually execute, broken down by the labelsflow_schema
(indicating which one matched the request) andpriority_level
(indicating the one to which the request was assigned).
Debug endpoints
When you enable the API Priority and Fairness feature, the kube-apiserver
serves the following additional paths at its HTTP[S] ports.
-
/debug/api_priority_and_fairness/dump_priority_levels
- a listing of all the priority levels and the current state of each. You can fetch like this:kubectl get --raw /debug/api_priority_and_fairness/dump_priority_levels
The output is similar to this:
PriorityLevelName, ActiveQueues, IsIdle, IsQuiescing, WaitingRequests, ExecutingRequests, workload-low, 0, true, false, 0, 0, global-default, 0, true, false, 0, 0, exempt, <none>, <none>, <none>, <none>, <none>, catch-all, 0, true, false, 0, 0, system, 0, true, false, 0, 0, leader-election, 0, true, false, 0, 0, workload-high, 0, true, false, 0, 0,
-
/debug/api_priority_and_fairness/dump_queues
- a listing of all the queues and their current state. You can fetch like this:kubectl get --raw /debug/api_priority_and_fairness/dump_queues
The output is similar to this:
PriorityLevelName, Index, PendingRequests, ExecutingRequests, VirtualStart, workload-high, 0, 0, 0, 0.0000, workload-high, 1, 0, 0, 0.0000, workload-high, 2, 0, 0, 0.0000, ... leader-election, 14, 0, 0, 0.0000, leader-election, 15, 0, 0, 0.0000,
-
/debug/api_priority_and_fairness/dump_requests
- a listing of all the requests that are currently waiting in a queue. You can fetch like this:kubectl get --raw /debug/api_priority_and_fairness/dump_requests
The output is similar to this:
PriorityLevelName, FlowSchemaName, QueueIndex, RequestIndexInQueue, FlowDistingsher, ArriveTime, exempt, <none>, <none>, <none>, <none>, <none>, system, system-nodes, 12, 0, system:node:127.0.0.1, 2020-07-23T15:26:57.179170694Z,
In addition to the queued requests, the output includes one phantom line for each priority level that is exempt from limitation.
You can get a more detailed listing with a command like this:
kubectl get --raw '/debug/api_priority_and_fairness/dump_requests?includeRequestDetails=1'
The output is similar to this:
PriorityLevelName, FlowSchemaName, QueueIndex, RequestIndexInQueue, FlowDistingsher, ArriveTime, UserName, Verb, APIPath, Namespace, Name, APIVersion, Resource, SubResource, system, system-nodes, 12, 0, system:node:127.0.0.1, 2020-07-23T15:31:03.583823404Z, system:node:127.0.0.1, create, /api/v1/namespaces/scaletest/configmaps, system, system-nodes, 12, 1, system:node:127.0.0.1, 2020-07-23T15:31:03.594555947Z, system:node:127.0.0.1, create, /api/v1/namespaces/scaletest/configmaps,
What's next
For background information on design details for API priority and fairness, see the enhancement proposal. You can make suggestions and feature requests via SIG API Machinery or the feature's slack channel.
3.11.10 - Installing Addons
Add-ons extend the functionality of Kubernetes.
This page lists some of the available add-ons and links to their respective installation instructions.
Networking and Network Policy
- ACI provides integrated container networking and network security with Cisco ACI.
- Antrea operates at Layer 3/4 to provide networking and security services for Kubernetes, leveraging Open vSwitch as the networking data plane.
- Calico is a networking and network policy provider. Calico supports a flexible set of networking options so you can choose the most efficient option for your situation, including non-overlay and overlay networks, with or without BGP. Calico uses the same engine to enforce network policy for hosts, pods, and (if using Istio & Envoy) applications at the service mesh layer.
- Canal unites Flannel and Calico, providing networking and network policy.
- Cilium is a L3 network and network policy plugin that can enforce HTTP/API/L7 policies transparently. Both routing and overlay/encapsulation mode are supported, and it can work on top of other CNI plugins.
- CNI-Genie enables Kubernetes to seamlessly connect to a choice of CNI plugins, such as Calico, Canal, Flannel, Romana, or Weave.
- Contrail, based on Tungsten Fabric, is an open source, multi-cloud network virtualization and policy management platform. Contrail and Tungsten Fabric are integrated with orchestration systems such as Kubernetes, OpenShift, OpenStack and Mesos, and provide isolation modes for virtual machines, containers/pods and bare metal workloads.
- Flannel is an overlay network provider that can be used with Kubernetes.
- Knitter is a plugin to support multiple network interfaces in a Kubernetes pod.
- Multus is a Multi plugin for multiple network support in Kubernetes to support all CNI plugins (e.g. Calico, Cilium, Contiv, Flannel), in addition to SRIOV, DPDK, OVS-DPDK and VPP based workloads in Kubernetes.
- OVN-Kubernetes is a networking provider for Kubernetes based on OVN (Open Virtual Network), a virtual networking implementation that came out of the Open vSwitch (OVS) project. OVN-Kubernetes provides an overlay based networking implementation for Kubernetes, including an OVS based implementation of load balancing and network policy.
- OVN4NFV-K8S-Plugin is OVN based CNI controller plugin to provide cloud native based Service function chaining(SFC), Multiple OVN overlay networking, dynamic subnet creation, dynamic creation of virtual networks, VLAN Provider network, Direct provider network and pluggable with other Multi-network plugins, ideal for edge based cloud native workloads in Multi-cluster networking
- NSX-T Container Plug-in (NCP) provides integration between VMware NSX-T and container orchestrators such as Kubernetes, as well as integration between NSX-T and container-based CaaS/PaaS platforms such as Pivotal Container Service (PKS) and OpenShift.
- Nuage is an SDN platform that provides policy-based networking between Kubernetes Pods and non-Kubernetes environments with visibility and security monitoring.
- Romana is a Layer 3 networking solution for pod networks that also supports the NetworkPolicy API. Kubeadm add-on installation details available here.
- Weave Net provides networking and network policy, will carry on working on both sides of a network partition, and does not require an external database.
Service Discovery
Visualization & Control
- Dashboard is a dashboard web interface for Kubernetes.
- Weave Scope is a tool for graphically visualizing your containers, pods, services etc. Use it in conjunction with a Weave Cloud account or host the UI yourself.
Infrastructure
- KubeVirt is an add-on to run virtual machines on Kubernetes. Usually run on bare-metal clusters.
- The node problem detector runs on Linux nodes and reports system issues as either Events or Node conditions.
Legacy Add-ons
There are several other add-ons documented in the deprecated cluster/addons directory.
Well-maintained ones should be linked to here. PRs welcome!
3.12 - Extending Kubernetes
Kubernetes is highly configurable and extensible. As a result, there is rarely a need to fork or submit patches to the Kubernetes project code.
This guide describes the options for customizing a Kubernetes cluster. It is aimed at cluster operators who want to understand how to adapt their Kubernetes cluster to the needs of their work environment. Developers who are prospective Platform Developers or Kubernetes Project Contributors will also find it useful as an introduction to what extension points and patterns exist, and their trade-offs and limitations.
Overview
Customization approaches can be broadly divided into configuration, which only involves changing flags, local configuration files, or API resources; and extensions, which involve running additional programs or services. This document is primarily about extensions.
Configuration
Configuration files and flags are documented in the Reference section of the online documentation, under each binary:
Flags and configuration files may not always be changeable in a hosted Kubernetes service or a distribution with managed installation. When they are changeable, they are usually only changeable by the cluster administrator. Also, they are subject to change in future Kubernetes versions, and setting them may require restarting processes. For those reasons, they should be used only when there are no other options.
Built-in Policy APIs, such as ResourceQuota, PodSecurityPolicies, NetworkPolicy and Role-based Access Control (RBAC), are built-in Kubernetes APIs. APIs are typically used with hosted Kubernetes services and with managed Kubernetes installations. They are declarative and use the same conventions as other Kubernetes resources like pods, so new cluster configuration can be repeatable and be managed the same way as applications. And, where they are stable, they enjoy a defined support policy like other Kubernetes APIs. For these reasons, they are preferred over configuration files and flags where suitable.
Extensions
Extensions are software components that extend and deeply integrate with Kubernetes. They adapt it to support new types and new kinds of hardware.
Many cluster administrators use a hosted or distribution instance of Kubernetes. These clusters come with extensions pre-installed. As a result, most Kubernetes users will not need to install extensions and even fewer users will need to author new ones.
Extension Patterns
Kubernetes is designed to be automated by writing client programs. Any program that reads and/or writes to the Kubernetes API can provide useful automation. Automation can run on the cluster or off it. By following the guidance in this doc you can write highly available and robust automation. Automation generally works with any Kubernetes cluster, including hosted clusters and managed installations.
There is a specific pattern for writing client programs that work well with
Kubernetes called the Controller pattern. Controllers typically read an
object's .spec
, possibly do things, and then update the object's .status
.
A controller is a client of Kubernetes. When Kubernetes is the client and calls out to a remote service, it is called a Webhook. The remote service is called a Webhook Backend. Like Controllers, Webhooks do add a point of failure.
In the webhook model, Kubernetes makes a network request to a remote service. In the Binary Plugin model, Kubernetes executes a binary (program). Binary plugins are used by the kubelet (e.g. Flex Volume Plugins and Network Plugins) and by kubectl.
Below is a diagram showing how the extension points interact with the Kubernetes control plane.
Extension Points
This diagram shows the extension points in a Kubernetes system.
- Users often interact with the Kubernetes API using
kubectl
. Kubectl plugins extend the kubectl binary. They only affect the individual user's local environment, and so cannot enforce site-wide policies. - The apiserver handles all requests. Several types of extension points in the apiserver allow authenticating requests, or blocking them based on their content, editing content, and handling deletion. These are described in the API Access Extensions section.
- The apiserver serves various kinds of resources. Built-in resource kinds, like
pods
, are defined by the Kubernetes project and can't be changed. You can also add resources that you define, or that other projects have defined, called Custom Resources, as explained in the Custom Resources section. Custom Resources are often used with API Access Extensions. - The Kubernetes scheduler decides which nodes to place pods on. There are several ways to extend scheduling. These are described in the Scheduler Extensions section.
- Much of the behavior of Kubernetes is implemented by programs called Controllers which are clients of the API-Server. Controllers are often used in conjunction with Custom Resources.
- The kubelet runs on servers, and helps pods appear like virtual servers with their own IPs on the cluster network. Network Plugins allow for different implementations of pod networking.
- The kubelet also mounts and unmounts volumes for containers. New types of storage can be supported via Storage Plugins.
If you are unsure where to start, this flowchart can help. Note that some solutions may involve several types of extensions.
API Extensions
User-Defined Types
Consider adding a Custom Resource to Kubernetes if you want to define new controllers, application configuration objects or other declarative APIs, and to manage them using Kubernetes tools, such as kubectl
.
Do not use a Custom Resource as data storage for application, user, or monitoring data.
For more about Custom Resources, see the Custom Resources concept guide.
Combining New APIs with Automation
The combination of a custom resource API and a control loop is called the Operator pattern. The Operator pattern is used to manage specific, usually stateful, applications. These custom APIs and control loops can also be used to control other resources, such as storage or policies.
Changing Built-in Resources
When you extend the Kubernetes API by adding custom resources, the added resources always fall into a new API Groups. You cannot replace or change existing API groups. Adding an API does not directly let you affect the behavior of existing APIs (e.g. Pods), but API Access Extensions do.
API Access Extensions
When a request reaches the Kubernetes API Server, it is first Authenticated, then Authorized, then subject to various types of Admission Control. See Controlling Access to the Kubernetes API for more on this flow.
Each of these steps offers extension points.
Kubernetes has several built-in authentication methods that it supports. It can also sit behind an authenticating proxy, and it can send a token from an Authorization header to a remote service for verification (a webhook). All of these methods are covered in the Authentication documentation.
Authentication
Authentication maps headers or certificates in all requests to a username for the client making the request.
Kubernetes provides several built-in authentication methods, and an Authentication webhook method if those don't meet your needs.
Authorization
Authorization determines whether specific users can read, write, and do other operations on API resources. It works at the level of whole resources -- it doesn't discriminate based on arbitrary object fields. If the built-in authorization options don't meet your needs, Authorization webhook allows calling out to user-provided code to make an authorization decision.
Dynamic Admission Control
After a request is authorized, if it is a write operation, it also goes through Admission Control steps. In addition to the built-in steps, there are several extensions:
- The Image Policy webhook restricts what images can be run in containers.
- To make arbitrary admission control decisions, a general Admission webhook can be used. Admission Webhooks can reject creations or updates.
Infrastructure Extensions
Storage Plugins
Flex Volumes allow users to mount volume types without built-in support by having the Kubelet call a Binary Plugin to mount the volume.
FlexVolume is deprecated since Kubernetes v1.23. The Out-of-tree CSI driver is the recommended way to write volume drivers in Kubernetes. See Kubernetes Volume Plugin FAQ for Storage Vendors for more information.
Device Plugins
Device plugins allow a node to discover new Node resources (in addition to the builtin ones like cpu and memory) via a Device Plugin.
Network Plugins
Different networking fabrics can be supported via node-level Network Plugins.
Scheduler Extensions
The scheduler is a special type of controller that watches pods, and assigns pods to nodes. The default scheduler can be replaced entirely, while continuing to use other Kubernetes components, or multiple schedulers can run at the same time.
This is a significant undertaking, and almost all Kubernetes users find they do not need to modify the scheduler.
The scheduler also supports a webhook that permits a webhook backend (scheduler extension) to filter and prioritize the nodes chosen for a pod.
What's next
- Learn more about Custom Resources
- Learn about Dynamic admission control
- Learn more about Infrastructure extensions
- Learn about kubectl plugins
- Learn about the Operator pattern
3.12.1 - Extending the Kubernetes API
3.12.1.1 - Custom Resources
Custom resources are extensions of the Kubernetes API. This page discusses when to add a custom resource to your Kubernetes cluster and when to use a standalone service. It describes the two methods for adding custom resources and how to choose between them.
Custom resources
A resource is an endpoint in the Kubernetes API that stores a collection of API objects of a certain kind; for example, the built-in pods resource contains a collection of Pod objects.
A custom resource is an extension of the Kubernetes API that is not necessarily available in a default Kubernetes installation. It represents a customization of a particular Kubernetes installation. However, many core Kubernetes functions are now built using custom resources, making Kubernetes more modular.
Custom resources can appear and disappear in a running cluster through dynamic registration, and cluster admins can update custom resources independently of the cluster itself. Once a custom resource is installed, users can create and access its objects using kubectl, just as they do for built-in resources like Pods.
Custom controllers
On their own, custom resources let you store and retrieve structured data. When you combine a custom resource with a custom controller, custom resources provide a true declarative API.
The Kubernetes declarative API enforces a separation of responsibilities. You declare the desired state of your resource. The Kubernetes controller keeps the current state of Kubernetes objects in sync with your declared desired state. This is in contrast to an imperative API, where you instruct a server what to do.
You can deploy and update a custom controller on a running cluster, independently of the cluster's lifecycle. Custom controllers can work with any kind of resource, but they are especially effective when combined with custom resources. The Operator pattern combines custom resources and custom controllers. You can use custom controllers to encode domain knowledge for specific applications into an extension of the Kubernetes API.
Should I add a custom resource to my Kubernetes Cluster?
When creating a new API, consider whether to aggregate your API with the Kubernetes cluster APIs or let your API stand alone.
Consider API aggregation if: | Prefer a stand-alone API if: |
---|---|
Your API is Declarative. | Your API does not fit the Declarative model. |
You want your new types to be readable and writable using kubectl . |
kubectl support is not required |
You want to view your new types in a Kubernetes UI, such as dashboard, alongside built-in types. | Kubernetes UI support is not required. |
You are developing a new API. | You already have a program that serves your API and works well. |
You are willing to accept the format restriction that Kubernetes puts on REST resource paths, such as API Groups and Namespaces. (See the API Overview.) | You need to have specific REST paths to be compatible with an already defined REST API. |
Your resources are naturally scoped to a cluster or namespaces of a cluster. | Cluster or namespace scoped resources are a poor fit; you need control over the specifics of resource paths. |
You want to reuse Kubernetes API support features. | You don't need those features. |
Declarative APIs
In a Declarative API, typically:
- Your API consists of a relatively small number of relatively small objects (resources).
- The objects define configuration of applications or infrastructure.
- The objects are updated relatively infrequently.
- Humans often need to read and write the objects.
- The main operations on the objects are CRUD-y (creating, reading, updating and deleting).
- Transactions across objects are not required: the API represents a desired state, not an exact state.
Imperative APIs are not declarative. Signs that your API might not be declarative include:
- The client says "do this", and then gets a synchronous response back when it is done.
- The client says "do this", and then gets an operation ID back, and has to check a separate Operation object to determine completion of the request.
- You talk about Remote Procedure Calls (RPCs).
- Directly storing large amounts of data; for example, > a few kB per object, or > 1000s of objects.
- High bandwidth access (10s of requests per second sustained) needed.
- Store end-user data (such as images, PII, etc.) or other large-scale data processed by applications.
- The natural operations on the objects are not CRUD-y.
- The API is not easily modeled as objects.
- You chose to represent pending operations with an operation ID or an operation object.
Should I use a configMap or a custom resource?
Use a ConfigMap if any of the following apply:
- There is an existing, well-documented config file format, such as a
mysql.cnf
orpom.xml
. - You want to put the entire config file into one key of a configMap.
- The main use of the config file is for a program running in a Pod on your cluster to consume the file to configure itself.
- Consumers of the file prefer to consume via file in a Pod or environment variable in a pod, rather than the Kubernetes API.
- You want to perform rolling updates via Deployment, etc., when the file is updated.
Use a custom resource (CRD or Aggregated API) if most of the following apply:
- You want to use Kubernetes client libraries and CLIs to create and update the new resource.
- You want top-level support from
kubectl
; for example,kubectl get my-object object-name
. - You want to build new automation that watches for updates on the new object, and then CRUD other objects, or vice versa.
- You want to write automation that handles updates to the object.
- You want to use Kubernetes API conventions like
.spec
,.status
, and.metadata
. - You want the object to be an abstraction over a collection of controlled resources, or a summarization of other resources.
Adding custom resources
Kubernetes provides two ways to add custom resources to your cluster:
- CRDs are simple and can be created without any programming.
- API Aggregation requires programming, but allows more control over API behaviors like how data is stored and conversion between API versions.
Kubernetes provides these two options to meet the needs of different users, so that neither ease of use nor flexibility is compromised.
Aggregated APIs are subordinate API servers that sit behind the primary API server, which acts as a proxy. This arrangement is called API Aggregation (AA). To users, the Kubernetes API appears extended.
CRDs allow users to create new types of resources without adding another API server. You do not need to understand API Aggregation to use CRDs.
Regardless of how they are installed, the new resources are referred to as Custom Resources to distinguish them from built-in Kubernetes resources (like pods).
CustomResourceDefinitions
The CustomResourceDefinition API resource allows you to define custom resources. Defining a CRD object creates a new custom resource with a name and schema that you specify. The Kubernetes API serves and handles the storage of your custom resource. The name of a CRD object must be a valid DNS subdomain name.
This frees you from writing your own API server to handle the custom resource, but the generic nature of the implementation means you have less flexibility than with API server aggregation.
Refer to the custom controller example for an example of how to register a new custom resource, work with instances of your new resource type, and use a controller to handle events.
API server aggregation
Usually, each resource in the Kubernetes API requires code that handles REST requests and manages persistent storage of objects. The main Kubernetes API server handles built-in resources like pods and services, and can also generically handle custom resources through CRDs.
The aggregation layer allows you to provide specialized implementations for your custom resources by writing and deploying your own API server. The main API server delegates requests to your API server for the custom resources that you handle, making them available to all of its clients.
Choosing a method for adding custom resources
CRDs are easier to use. Aggregated APIs are more flexible. Choose the method that best meets your needs.
Typically, CRDs are a good fit if:
- You have a handful of fields
- You are using the resource within your company, or as part of a small open-source project (as opposed to a commercial product)
Comparing ease of use
CRDs are easier to create than Aggregated APIs.
CRDs | Aggregated API |
---|---|
Do not require programming. Users can choose any language for a CRD controller. | Requires programming and building binary and image. |
No additional service to run; CRDs are handled by API server. | An additional service to create and that could fail. |
No ongoing support once the CRD is created. Any bug fixes are picked up as part of normal Kubernetes Master upgrades. | May need to periodically pickup bug fixes from upstream and rebuild and update the Aggregated API server. |
No need to handle multiple versions of your API; for example, when you control the client for this resource, you can upgrade it in sync with the API. | You need to handle multiple versions of your API; for example, when developing an extension to share with the world. |
Advanced features and flexibility
Aggregated APIs offer more advanced API features and customization of other features; for example, the storage layer.
Feature | Description | CRDs | Aggregated API |
---|---|---|---|
Validation | Help users prevent errors and allow you to evolve your API independently of your clients. These features are most useful when there are many clients who can't all update at the same time. | Yes. Most validation can be specified in the CRD using OpenAPI v3.0 validation. Any other validations supported by addition of a Validating Webhook. | Yes, arbitrary validation checks |
Defaulting | See above | Yes, either via OpenAPI v3.0 validation default keyword (GA in 1.17), or via a Mutating Webhook (though this will not be run when reading from etcd for old objects). |
Yes |
Multi-versioning | Allows serving the same object through two API versions. Can help ease API changes like renaming fields. Less important if you control your client versions. | Yes | Yes |
Custom Storage | If you need storage with a different performance mode (for example, a time-series database instead of key-value store) or isolation for security (for example, encryption of sensitive information, etc.) | No | Yes |
Custom Business Logic | Perform arbitrary checks or actions when creating, reading, updating or deleting an object | Yes, using Webhooks. | Yes |
Scale Subresource | Allows systems like HorizontalPodAutoscaler and PodDisruptionBudget interact with your new resource | Yes | Yes |
Status Subresource | Allows fine-grained access control where user writes the spec section and the controller writes the status section. Allows incrementing object Generation on custom resource data mutation (requires separate spec and status sections in the resource) | Yes | Yes |
Other Subresources | Add operations other than CRUD, such as "logs" or "exec". | No | Yes |
strategic-merge-patch | The new endpoints support PATCH with Content-Type: application/strategic-merge-patch+json . Useful for updating objects that may be modified both locally, and by the server. For more information, see "Update API Objects in Place Using kubectl patch" |
No | Yes |
Protocol Buffers | The new resource supports clients that want to use Protocol Buffers | No | Yes |
OpenAPI Schema | Is there an OpenAPI (swagger) schema for the types that can be dynamically fetched from the server? Is the user protected from misspelling field names by ensuring only allowed fields are set? Are types enforced (in other words, don't put an int in a string field?) |
Yes, based on the OpenAPI v3.0 validation schema (GA in 1.16). | Yes |
Common Features
When you create a custom resource, either via a CRD or an AA, you get many features for your API, compared to implementing it outside the Kubernetes platform:
Feature | What it does |
---|---|
CRUD | The new endpoints support CRUD basic operations via HTTP and kubectl |
Watch | The new endpoints support Kubernetes Watch operations via HTTP |
Discovery | Clients like kubectl and dashboard automatically offer list, display, and field edit operations on your resources |
json-patch | The new endpoints support PATCH with Content-Type: application/json-patch+json |
merge-patch | The new endpoints support PATCH with Content-Type: application/merge-patch+json |
HTTPS | The new endpoints uses HTTPS |
Built-in Authentication | Access to the extension uses the core API server (aggregation layer) for authentication |
Built-in Authorization | Access to the extension can reuse the authorization used by the core API server; for example, RBAC. |
Finalizers | Block deletion of extension resources until external cleanup happens. |
Admission Webhooks | Set default values and validate extension resources during any create/update/delete operation. |
UI/CLI Display | Kubectl, dashboard can display extension resources. |
Unset versus Empty | Clients can distinguish unset fields from zero-valued fields. |
Client Libraries Generation | Kubernetes provides generic client libraries, as well as tools to generate type-specific client libraries. |
Labels and annotations | Common metadata across objects that tools know how to edit for core and custom resources. |
Preparing to install a custom resource
There are several points to be aware of before adding a custom resource to your cluster.
Third party code and new points of failure
While creating a CRD does not automatically add any new points of failure (for example, by causing third party code to run on your API server), packages (for example, Charts) or other installation bundles often include CRDs as well as a Deployment of third-party code that implements the business logic for a new custom resource.
Installing an Aggregated API server always involves running a new Deployment.
Storage
Custom resources consume storage space in the same way that ConfigMaps do. Creating too many custom resources may overload your API server's storage space.
Aggregated API servers may use the same storage as the main API server, in which case the same warning applies.
Authentication, authorization, and auditing
CRDs always use the same authentication, authorization, and audit logging as the built-in resources of your API server.
If you use RBAC for authorization, most RBAC roles will not grant access to the new resources (except the cluster-admin role or any role created with wildcard rules). You'll need to explicitly grant access to the new resources. CRDs and Aggregated APIs often come bundled with new role definitions for the types they add.
Aggregated API servers may or may not use the same authentication, authorization, and auditing as the primary API server.
Accessing a custom resource
Kubernetes client libraries can be used to access custom resources. Not all client libraries support custom resources. The Go and Python client libraries do.
When you add a custom resource, you can access it using:
kubectl
- The kubernetes dynamic client.
- A REST client that you write.
- A client generated using Kubernetes client generation tools (generating one is an advanced undertaking, but some projects may provide a client along with the CRD or AA).
What's next
-
Learn how to Extend the Kubernetes API with the aggregation layer.
-
Learn how to Extend the Kubernetes API with CustomResourceDefinition.
3.12.1.2 - Kubernetes API Aggregation Layer
The aggregation layer allows Kubernetes to be extended with additional APIs, beyond what is offered by the core Kubernetes APIs. The additional APIs can either be ready-made solutions such as a metrics server, or APIs that you develop yourself.
The aggregation layer is different from Custom Resources, which are a way to make the kube-apiserver recognise new kinds of object.
Aggregation layer
The aggregation layer runs in-process with the kube-apiserver. Until an extension resource is registered, the aggregation layer will do nothing. To register an API, you add an APIService object, which "claims" the URL path in the Kubernetes API. At that point, the aggregation layer will proxy anything sent to that API path (e.g. /apis/myextension.mycompany.io/v1/…
) to the registered APIService.
The most common way to implement the APIService is to run an extension API server in Pod(s) that run in your cluster. If you're using the extension API server to manage resources in your cluster, the extension API server (also written as "extension-apiserver") is typically paired with one or more controllers. The apiserver-builder library provides a skeleton for both extension API servers and the associated controller(s).
Response latency
Extension API servers should have low latency networking to and from the kube-apiserver. Discovery requests are required to round-trip from the kube-apiserver in five seconds or less.
If your extension API server cannot achieve that latency requirement, consider making changes that let you meet it.
What's next
- To get the aggregator working in your environment, configure the aggregation layer.
- Then, setup an extension api-server to work with the aggregation layer.
- Read about APIService in the API reference
Alternatively: learn how to extend the Kubernetes API using Custom Resource Definitions.
3.12.2 - Compute, Storage, and Networking Extensions
3.12.2.1 - Network Plugins
Network plugins in Kubernetes come in a few flavors:
- CNI plugins: adhere to the Container Network Interface (CNI) specification, designed for interoperability.
- Kubernetes follows the v0.4.0 release of the CNI specification.
- Kubenet plugin: implements basic
cbr0
using thebridge
andhost-local
CNI plugins
Installation
The kubelet has a single default network plugin, and a default network common to the entire cluster. It probes for plugins when it starts up, remembers what it finds, and executes the selected plugin at appropriate times in the pod lifecycle (this is only true for Docker, as CRI manages its own CNI plugins). There are two Kubelet command line parameters to keep in mind when using plugins:
cni-bin-dir
: Kubelet probes this directory for plugins on startupnetwork-plugin
: The network plugin to use fromcni-bin-dir
. It must match the name reported by a plugin probed from the plugin directory. For CNI plugins, this iscni
.
Network Plugin Requirements
Besides providing the NetworkPlugin
interface to configure and clean up pod networking, the plugin may also need specific support for kube-proxy. The iptables proxy obviously depends on iptables, and the plugin may need to ensure that container traffic is made available to iptables. For example, if the plugin connects containers to a Linux bridge, the plugin must set the net/bridge/bridge-nf-call-iptables
sysctl to 1
to ensure that the iptables proxy functions correctly. If the plugin does not use a Linux bridge (but instead something like Open vSwitch or some other mechanism) it should ensure container traffic is appropriately routed for the proxy.
By default if no kubelet network plugin is specified, the noop
plugin is used, which sets net/bridge/bridge-nf-call-iptables=1
to ensure simple configurations (like Docker with a bridge) work correctly with the iptables proxy.
CNI
The CNI plugin is selected by passing Kubelet the --network-plugin=cni
command-line option. Kubelet reads a file from --cni-conf-dir
(default /etc/cni/net.d
) and uses the CNI configuration from that file to set up each pod's network. The CNI configuration file must match the CNI specification, and any required CNI plugins referenced by the configuration must be present in --cni-bin-dir
(default /opt/cni/bin
).
If there are multiple CNI configuration files in the directory, the kubelet uses the configuration file that comes first by name in lexicographic order.
In addition to the CNI plugin specified by the configuration file, Kubernetes requires the standard CNI lo
plugin, at minimum version 0.2.0
Support hostPort
The CNI networking plugin supports hostPort
. You can use the official portmap
plugin offered by the CNI plugin team or use your own plugin with portMapping functionality.
If you want to enable hostPort
support, you must specify portMappings capability
in your cni-conf-dir
.
For example:
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "127.0.0.1",
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true}
}
]
}
Support traffic shaping
Experimental Feature
The CNI networking plugin also supports pod ingress and egress traffic shaping. You can use the official bandwidth plugin offered by the CNI plugin team or use your own plugin with bandwidth control functionality.
If you want to enable traffic shaping support, you must add the bandwidth
plugin to your CNI configuration file
(default /etc/cni/net.d
) and ensure that the binary is included in your CNI bin dir (default /opt/cni/bin
).
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "127.0.0.1",
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
}
]
}
Now you can add the kubernetes.io/ingress-bandwidth
and kubernetes.io/egress-bandwidth
annotations to your pod.
For example:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/ingress-bandwidth: 1M
kubernetes.io/egress-bandwidth: 1M
...
kubenet
Kubenet is a very basic, simple network plugin, on Linux only. It does not, of itself, implement more advanced features like cross-node networking or network policy. It is typically used together with a cloud provider that sets up routing rules for communication between nodes, or in single-node environments.
Kubenet creates a Linux bridge named cbr0
and creates a veth pair for each pod with the host end of each pair connected to cbr0
. The pod end of the pair is assigned an IP address allocated from a range assigned to the node either through configuration or by the controller-manager. cbr0
is assigned an MTU matching the smallest MTU of an enabled normal interface on the host.
The plugin requires a few things:
- The standard CNI
bridge
,lo
andhost-local
plugins are required, at minimum version 0.2.0. Kubenet will first search for them in/opt/cni/bin
. Specifycni-bin-dir
to supply additional search path. The first found match will take effect. - Kubelet must be run with the
--network-plugin=kubenet
argument to enable the plugin - Kubelet should also be run with the
--non-masquerade-cidr=<clusterCidr>
argument to ensure traffic to IPs outside this range will use IP masquerade. - The node must be assigned an IP subnet through either the
--pod-cidr
kubelet command-line option or the--allocate-node-cidrs=true --cluster-cidr=<cidr>
controller-manager command-line options.
Customizing the MTU (with kubenet)
The MTU should always be configured correctly to get the best networking performance. Network plugins will usually try to infer a sensible MTU, but sometimes the logic will not result in an optimal MTU. For example, if the Docker bridge or another interface has a small MTU, kubenet will currently select that MTU. Or if you are using IPSEC encapsulation, the MTU must be reduced, and this calculation is out-of-scope for most network plugins.
Where needed, you can specify the MTU explicitly with the network-plugin-mtu
kubelet option. For example,
on AWS the eth0
MTU is typically 9001, so you might specify --network-plugin-mtu=9001
. If you're using IPSEC you
might reduce it to allow for encapsulation overhead; for example: --network-plugin-mtu=8873
.
This option is provided to the network-plugin; currently only kubenet supports network-plugin-mtu
.
Usage Summary
--network-plugin=cni
specifies that we use thecni
network plugin with actual CNI plugin binaries located in--cni-bin-dir
(default/opt/cni/bin
) and CNI plugin configuration located in--cni-conf-dir
(default/etc/cni/net.d
).--network-plugin=kubenet
specifies that we use thekubenet
network plugin with CNIbridge
,lo
andhost-local
plugins placed in/opt/cni/bin
orcni-bin-dir
.--network-plugin-mtu=9001
specifies the MTU to use, currently only used by thekubenet
network plugin.
What's next
3.12.2.2 - Device Plugins
Kubernetes v1.10 [beta]
Kubernetes provides a device plugin framework that you can use to advertise system hardware resources to the Kubelet.
Instead of customizing the code for Kubernetes itself, vendors can implement a device plugin that you deploy either manually or as a DaemonSet. The targeted devices include GPUs, high-performance NICs, FPGAs, InfiniBand adapters, and other similar computing resources that may require vendor specific initialization and setup.
Device plugin registration
The kubelet exports a Registration
gRPC service:
service Registration {
rpc Register(RegisterRequest) returns (Empty) {}
}
A device plugin can register itself with the kubelet through this gRPC service. During the registration, the device plugin needs to send:
- The name of its Unix socket.
- The Device Plugin API version against which it was built.
- The
ResourceName
it wants to advertise. HereResourceName
needs to follow the extended resource naming scheme asvendor-domain/resourcetype
. (For example, an NVIDIA GPU is advertised asnvidia.com/gpu
.)
Following a successful registration, the device plugin sends the kubelet the
list of devices it manages, and the kubelet is then in charge of advertising those
resources to the API server as part of the kubelet node status update.
For example, after a device plugin registers hardware-vendor.example/foo
with the kubelet
and reports two healthy devices on a node, the node status is updated
to advertise that the node has 2 "Foo" devices installed and available.
Then, users can request devices as part of a Pod specification
(see container
).
Requesting extended resources is similar to how you manage requests and limits for
other resources, with the following differences:
- Extended resources are only supported as integer resources and cannot be overcommitted.
- Devices cannot be shared between containers.
Example
Suppose a Kubernetes cluster is running a device plugin that advertises resource hardware-vendor.example/foo
on certain nodes. Here is an example of a pod requesting this resource to run a demo workload:
---
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
spec:
containers:
- name: demo-container-1
image: k8s.gcr.io/pause:2.0
resources:
limits:
hardware-vendor.example/foo: 2
#
# This Pod needs 2 of the hardware-vendor.example/foo devices
# and can only schedule onto a Node that's able to satisfy
# that need.
#
# If the Node has more than 2 of those devices available, the
# remainder would be available for other Pods to use.
Device plugin implementation
The general workflow of a device plugin includes the following steps:
-
Initialization. During this phase, the device plugin performs vendor specific initialization and setup to make sure the devices are in a ready state.
-
The plugin starts a gRPC service, with a Unix socket under host path
/var/lib/kubelet/device-plugins/
, that implements the following interfaces:service DevicePlugin { // GetDevicePluginOptions returns options to be communicated with Device Manager. rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {} // ListAndWatch returns a stream of List of Devices // Whenever a Device state change or a Device disappears, ListAndWatch // returns the new list rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {} // Allocate is called during container creation so that the Device // Plugin can run device specific operations and instruct Kubelet // of the steps to make the Device available in the container rpc Allocate(AllocateRequest) returns (AllocateResponse) {} // GetPreferredAllocation returns a preferred set of devices to allocate // from a list of available ones. The resulting preferred allocation is not // guaranteed to be the allocation ultimately performed by the // devicemanager. It is only designed to help the devicemanager make a more // informed allocation decision when possible. rpc GetPreferredAllocation(PreferredAllocationRequest) returns (PreferredAllocationResponse) {} // PreStartContainer is called, if indicated by Device Plugin during registeration phase, // before each container start. Device plugin can run device specific operations // such as resetting the device before making devices available to the container. rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {} }
Note: Plugins are not required to provide useful implementations forGetPreferredAllocation()
orPreStartContainer()
. Flags indicating which (if any) of these calls are available should be set in theDevicePluginOptions
message sent back by a call toGetDevicePluginOptions()
. Thekubelet
will always callGetDevicePluginOptions()
to see which optional functions are available, before calling any of them directly. -
The plugin registers itself with the kubelet through the Unix socket at host path
/var/lib/kubelet/device-plugins/kubelet.sock
. -
After successfully registering itself, the device plugin runs in serving mode, during which it keeps monitoring device health and reports back to the kubelet upon any device state changes. It is also responsible for serving
Allocate
gRPC requests. DuringAllocate
, the device plugin may do device-specific preparation; for example, GPU cleanup or QRNG initialization. If the operations succeed, the device plugin returns anAllocateResponse
that contains container runtime configurations for accessing the allocated devices. The kubelet passes this information to the container runtime.
Handling kubelet restarts
A device plugin is expected to detect kubelet restarts and re-register itself with the new
kubelet instance. In the current implementation, a new kubelet instance deletes all the existing Unix sockets
under /var/lib/kubelet/device-plugins
when it starts. A device plugin can monitor the deletion
of its Unix socket and re-register itself upon such an event.
Device plugin deployment
You can deploy a device plugin as a DaemonSet, as a package for your node's operating system, or manually.
The canonical directory /var/lib/kubelet/device-plugins
requires privileged access,
so a device plugin must run in a privileged security context.
If you're deploying a device plugin as a DaemonSet, /var/lib/kubelet/device-plugins
must be mounted as a Volume
in the plugin's
PodSpec.
If you choose the DaemonSet approach you can rely on Kubernetes to: place the device plugin's Pod onto Nodes, to restart the daemon Pod after failure, and to help automate upgrades.
API compatibility
Kubernetes device plugin support is in beta. The API may change before stabilization, in incompatible ways. As a project, Kubernetes recommends that device plugin developers:
- Watch for changes in future releases.
- Support multiple versions of the device plugin API for backward/forward compatibility.
If you enable the DevicePlugins feature and run device plugins on nodes that need to be upgraded to a Kubernetes release with a newer device plugin API version, upgrade your device plugins to support both versions before upgrading these nodes. Taking that approach will ensure the continuous functioning of the device allocations during the upgrade.
Monitoring device plugin resources
Kubernetes v1.15 [beta]
In order to monitor resources provided by device plugins, monitoring agents need to be able to
discover the set of devices that are in-use on the node and obtain metadata to describe which
container the metric should be associated with. Prometheus metrics
exposed by device monitoring agents should follow the
Kubernetes Instrumentation Guidelines,
identifying containers using pod
, namespace
, and container
prometheus labels.
The kubelet provides a gRPC service to enable discovery of in-use devices, and to provide metadata for these devices:
// PodResourcesLister is a service provided by the kubelet that provides information about the
// node resources consumed by pods and containers on the node
service PodResourcesLister {
rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
rpc GetAllocatableResources(AllocatableResourcesRequest) returns (AllocatableResourcesResponse) {}
}
List
gRPC endpoint
The List
endpoint provides information on resources of running pods, with details such as the
id of exclusively allocated CPUs, device id as it was reported by device plugins and id of
the NUMA node where these devices are allocated. Also, for NUMA-based machines, it contains the information about memory and hugepages reserved for a container.
// ListPodResourcesResponse is the response returned by List function
message ListPodResourcesResponse {
repeated PodResources pod_resources = 1;
}
// PodResources contains information about the node resources assigned to a pod
message PodResources {
string name = 1;
string namespace = 2;
repeated ContainerResources containers = 3;
}
// ContainerResources contains information about the resources assigned to a container
message ContainerResources {
string name = 1;
repeated ContainerDevices devices = 2;
repeated int64 cpu_ids = 3;
repeated ContainerMemory memory = 4;
}
// ContainerMemory contains information about memory and hugepages assigned to a container
message ContainerMemory {
string memory_type = 1;
uint64 size = 2;
TopologyInfo topology = 3;
}
// Topology describes hardware topology of the resource
message TopologyInfo {
repeated NUMANode nodes = 1;
}
// NUMA representation of NUMA node
message NUMANode {
int64 ID = 1;
}
// ContainerDevices contains information about the devices assigned to a container
message ContainerDevices {
string resource_name = 1;
repeated string device_ids = 2;
TopologyInfo topology = 3;
}
cpu_ids in the ContainerResources
in the List
endpoint correspond to exclusive CPUs allocated
to a partilar container. If the goal is to evaluate CPUs that belong to the shared pool, the List
endpoint needs to be used in conjunction with the GetAllocatableResources
endpoint as explained
below:
- Call
GetAllocatableResources
to get a list of all the allocatable CPUs - Call
GetCpuIds
on allContainerResources
in the system - Subtract out all of the CPUs from the
GetCpuIds
calls from theGetAllocatableResources
call
GetAllocatableResources
gRPC endpoint
Kubernetes v1.23 [beta]
GetAllocatableResources provides information on resources initially available on the worker node. It provides more information than kubelet exports to APIServer.
GetAllocatableResources
should only be used to evaluate allocatable
resources on a node. If the goal is to evaluate free/unallocated resources it should be used in
conjunction with the List() endpoint. The result obtained by GetAllocatableResources
would remain
the same unless the underlying resources exposed to kubelet change. This happens rarely but when
it does (for example: hotplug/hotunplug, device health changes), client is expected to call
GetAlloctableResources
endpoint.
However, calling GetAllocatableResources
endpoint is not sufficient in case of cpu and/or memory
update and Kubelet needs to be restarted to reflect the correct resource capacity and allocatable.
// AllocatableResourcesResponses contains informations about all the devices known by the kubelet
message AllocatableResourcesResponse {
repeated ContainerDevices devices = 1;
repeated int64 cpu_ids = 2;
repeated ContainerMemory memory = 3;
}
Starting from Kubernetes v1.23, the GetAllocatableResources
is enabled by default.
You can disable it by turning off the
KubeletPodResourcesGetAllocatable
feature gate.
Preceding Kubernetes v1.23, to enable this feature kubelet
must be started with the following flag:
--feature-gates=KubeletPodResourcesGetAllocatable=true
ContainerDevices
do expose the topology information declaring to which NUMA cells the device is affine.
The NUMA cells are identified using a opaque integer ID, which value is consistent to what device
plugins report when they register themselves to the kubelet.
The gRPC service is served over a unix socket at /var/lib/kubelet/pod-resources/kubelet.sock
.
Monitoring agents for device plugin resources can be deployed as a daemon, or as a DaemonSet.
The canonical directory /var/lib/kubelet/pod-resources
requires privileged access, so monitoring
agents must run in a privileged security context. If a device monitoring agent is running as a
DaemonSet, /var/lib/kubelet/pod-resources
must be mounted as a
Volume in the device monitoring agent's
PodSpec.
Support for the PodResourcesLister service
requires KubeletPodResources
feature gate to be enabled.
It is enabled by default starting with Kubernetes 1.15 and is v1 since Kubernetes 1.20.
Device plugin integration with the Topology Manager
Kubernetes v1.18 [beta]
The Topology Manager is a Kubelet component that allows resources to be co-ordinated in a Topology aligned manner. In order to do this, the Device Plugin API was extended to include a TopologyInfo
struct.
message TopologyInfo {
repeated NUMANode nodes = 1;
}
message NUMANode {
int64 ID = 1;
}
Device Plugins that wish to leverage the Topology Manager can send back a populated TopologyInfo struct as part of the device registration, along with the device IDs and the health of the device. The device manager will then use this information to consult with the Topology Manager and make resource assignment decisions.
TopologyInfo
supports a nodes
field that is either nil
(the default) or a list of NUMA nodes. This lets the Device Plugin publish that can span NUMA nodes.
An example TopologyInfo
struct populated for a device by a Device Plugin:
pluginapi.Device{ID: "25102017", Health: pluginapi.Healthy, Topology:&pluginapi.TopologyInfo{Nodes: []*pluginapi.NUMANode{&pluginapi.NUMANode{ID: 0,},}}}
Device plugin examples
Here are some examples of device plugin implementations:
- The AMD GPU device plugin
- The Intel device plugins for Intel GPU, FPGA, QAT, VPU, SGX, DSA, DLB and IAA devices
- The KubeVirt device plugins for hardware-assisted virtualization
- The NVIDIA GPU device plugin
- Requires nvidia-docker 2.0, which allows you to run GPU-enabled Docker containers.
- The NVIDIA GPU device plugin for Container-Optimized OS
- The RDMA device plugin
- The SocketCAN device plugin
- The Solarflare device plugin
- The SR-IOV Network device plugin
- The Xilinx FPGA device plugins for Xilinx FPGA devices
What's next
- Learn about scheduling GPU resources using device plugins
- Learn about advertising extended resources on a node
- Learn about the Topology Manager
- Read about using hardware acceleration for TLS ingress with Kubernetes
3.12.3 - Operator pattern
Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop.
Motivation
The Operator pattern aims to capture the key aim of a human operator who is managing a service or set of services. Human operators who look after specific applications and services have deep knowledge of how the system ought to behave, how to deploy it, and how to react if there are problems.
People who run workloads on Kubernetes often like to use automation to take care of repeatable tasks. The Operator pattern captures how you can write code to automate a task beyond what Kubernetes itself provides.
Operators in Kubernetes
Kubernetes is designed for automation. Out of the box, you get lots of built-in automation from the core of Kubernetes. You can use Kubernetes to automate deploying and running workloads, and you can automate how Kubernetes does that.
Kubernetes' operator pattern concept lets you extend the cluster's behaviour without modifying the code of Kubernetes itself by linking controllers to one or more custom resources. Operators are clients of the Kubernetes API that act as controllers for a Custom Resource.
An example Operator
Some of the things that you can use an operator to automate include:
- deploying an application on demand
- taking and restoring backups of that application's state
- handling upgrades of the application code alongside related changes such as database schemas or extra configuration settings
- publishing a Service to applications that don't support Kubernetes APIs to discover them
- simulating failure in all or part of your cluster to test its resilience
- choosing a leader for a distributed application without an internal member election process
What might an Operator look like in more detail? Here's an example:
- A custom resource named SampleDB, that you can configure into the cluster.
- A Deployment that makes sure a Pod is running that contains the controller part of the operator.
- A container image of the operator code.
- Controller code that queries the control plane to find out what SampleDB resources are configured.
- The core of the Operator is code to tell the API server how to make
reality match the configured resources.
- If you add a new SampleDB, the operator sets up PersistentVolumeClaims to provide durable database storage, a StatefulSet to run SampleDB and a Job to handle initial configuration.
- If you delete it, the Operator takes a snapshot, then makes sure that the StatefulSet and Volumes are also removed.
- The operator also manages regular database backups. For each SampleDB resource, the operator determines when to create a Pod that can connect to the database and take backups. These Pods would rely on a ConfigMap and / or a Secret that has database connection details and credentials.
- Because the Operator aims to provide robust automation for the resource it manages, there would be additional supporting code. For this example, code checks to see if the database is running an old version and, if so, creates Job objects that upgrade it for you.
Deploying Operators
The most common way to deploy an Operator is to add the Custom Resource Definition and its associated Controller to your cluster. The Controller will normally run outside of the control plane, much as you would run any containerized application. For example, you can run the controller in your cluster as a Deployment.
Using an Operator
Once you have an Operator deployed, you'd use it by adding, modifying or deleting the kind of resource that the Operator uses. Following the above example, you would set up a Deployment for the Operator itself, and then:
kubectl get SampleDB # find configured databases
kubectl edit SampleDB/example-database # manually change some settings
…and that's it! The Operator will take care of applying the changes as well as keeping the existing service in good shape.
Writing your own Operator
If there isn't an Operator in the ecosystem that implements the behavior you want, you can code your own.
You also implement an Operator (that is, a Controller) using any language / runtime that can act as a client for the Kubernetes API.
Following are a few libraries and tools you can use to write your own cloud native Operator.
- Charmed Operator Framework
- Kopf (Kubernetes Operator Pythonic Framework)
- kubebuilder
- KubeOps (.NET operator SDK)
- KUDO (Kubernetes Universal Declarative Operator)
- Metacontroller along with WebHooks that you implement yourself
- Operator Framework
- shell-operator
What's next
- Read the CNCF Operator White Paper.
- Learn more about Custom Resources
- Find ready-made operators on OperatorHub.io to suit your use case
- Publish your operator for other people to use
- Read CoreOS' original article that introduced the Operator pattern (this is an archived version of the original article).
- Read an article from Google Cloud about best practices for building Operators
3.12.4 - Service Catalog
Service Catalog is an extension API that enables applications running in Kubernetes clusters to easily use external managed software offerings, such as a datastore service offered by a cloud provider.
It provides a way to list, provision, and bind with external Managed Services from Service Brokers without needing detailed knowledge about how those services are created or managed.
A service broker, as defined by the Open service broker API spec, is an endpoint for a set of managed services offered and maintained by a third-party, which could be a cloud provider such as AWS, GCP, or Azure. Some examples of managed services are Microsoft Azure Cloud Queue, Amazon Simple Queue Service, and Google Cloud Pub/Sub, but they can be any software offering that can be used by an application.
Using Service Catalog, a cluster operator can browse the list of managed services offered by a service broker, provision an instance of a managed service, and bind with it to make it available to an application in the Kubernetes cluster.
Example use case
An application developer wants to use message queuing as part of their application running in a Kubernetes cluster. However, they do not want to deal with the overhead of setting such a service up and administering it themselves. Fortunately, there is a cloud provider that offers message queuing as a managed service through its service broker.
A cluster operator can setup Service Catalog and use it to communicate with the cloud provider's service broker to provision an instance of the message queuing service and make it available to the application within the Kubernetes cluster. The application developer therefore does not need to be concerned with the implementation details or management of the message queue. The application can access the message queue as a service.
Architecture
Service Catalog uses the Open service broker API to communicate with service brokers, acting as an intermediary for the Kubernetes API Server to negotiate the initial provisioning and retrieve the credentials necessary for the application to use a managed service.
It is implemented using a CRDs-based architecture.
API Resources
Service Catalog installs the servicecatalog.k8s.io
API and provides the following Kubernetes resources:
ClusterServiceBroker
: An in-cluster representation of a service broker, encapsulating its server connection details. These are created and managed by cluster operators who wish to use that broker server to make new types of managed services available within their cluster.ClusterServiceClass
: A managed service offered by a particular service broker. When a newClusterServiceBroker
resource is added to the cluster, the Service Catalog controller connects to the service broker to obtain a list of available managed services. It then creates a newClusterServiceClass
resource corresponding to each managed service.ClusterServicePlan
: A specific offering of a managed service. For example, a managed service may have different plans available, such as a free tier or paid tier, or it may have different configuration options, such as using SSD storage or having more resources. Similar toClusterServiceClass
, when a newClusterServiceBroker
is added to the cluster, Service Catalog creates a newClusterServicePlan
resource corresponding to each Service Plan available for each managed service.ServiceInstance
: A provisioned instance of aClusterServiceClass
. These are created by cluster operators to make a specific instance of a managed service available for use by one or more in-cluster applications. When a newServiceInstance
resource is created, the Service Catalog controller connects to the appropriate service broker and instruct it to provision the service instance.ServiceBinding
: Access credentials to aServiceInstance
. These are created by cluster operators who want their applications to make use of aServiceInstance
. Upon creation, the Service Catalog controller creates a KubernetesSecret
containing connection details and credentials for the Service Instance, which can be mounted into Pods.
Authentication
Service Catalog supports these methods of authentication:
- Basic (username/password)
- OAuth 2.0 Bearer Token
Usage
A cluster operator can use Service Catalog API Resources to provision managed services and make them available within a Kubernetes cluster. The steps involved are:
- Listing the managed services and Service Plans available from a service broker.
- Provisioning a new instance of the managed service.
- Binding to the managed service, which returns the connection credentials.
- Mapping the connection credentials into the application.
Listing managed services and Service Plans
First, a cluster operator must create a ClusterServiceBroker
resource within the servicecatalog.k8s.io
group. This resource contains the URL and connection details necessary to access a service broker endpoint.
This is an example of a ClusterServiceBroker
resource:
apiVersion: servicecatalog.k8s.io/v1beta1
kind: ClusterServiceBroker
metadata:
name: cloud-broker
spec:
# Points to the endpoint of a service broker. (This example is not a working URL.)
url: https://servicebroker.somecloudprovider.com/v1alpha1/projects/service-catalog/brokers/default
#####
# Additional values can be added here, which may be used to communicate
# with the service broker, such as bearer token info or a caBundle for TLS.
#####
The following is a sequence diagram illustrating the steps involved in listing managed services and Plans available from a service broker:
-
Once the
ClusterServiceBroker
resource is added to Service Catalog, it triggers a call to the external service broker for a list of available services. -
The service broker returns a list of available managed services and a list of Service Plans, which are cached locally as
ClusterServiceClass
andClusterServicePlan
resources respectively. -
A cluster operator can then get the list of available managed services using the following command:
kubectl get clusterserviceclasses -o=custom-columns=SERVICE\ NAME:.metadata.name,EXTERNAL\ NAME:.spec.externalName
It should output a list of service names with a format similar to:
SERVICE NAME EXTERNAL NAME 4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468 cloud-provider-service ... ...
They can also view the Service Plans available using the following command:
kubectl get clusterserviceplans -o=custom-columns=PLAN\ NAME:.metadata.name,EXTERNAL\ NAME:.spec.externalName
It should output a list of plan names with a format similar to:
PLAN NAME EXTERNAL NAME 86064792-7ea2-467b-af93-ac9694d96d52 service-plan-name ... ...
Provisioning a new instance
A cluster operator can initiate the provisioning of a new instance by creating a ServiceInstance
resource.
This is an example of a ServiceInstance
resource:
apiVersion: servicecatalog.k8s.io/v1beta1
kind: ServiceInstance
metadata:
name: cloud-queue-instance
namespace: cloud-apps
spec:
# References one of the previously returned services
clusterServiceClassExternalName: cloud-provider-service
clusterServicePlanExternalName: service-plan-name
#####
# Additional parameters can be added here,
# which may be used by the service broker.
#####
The following sequence diagram illustrates the steps involved in provisioning a new instance of a managed service:
- When the
ServiceInstance
resource is created, Service Catalog initiates a call to the external service broker to provision an instance of the service. - The service broker creates a new instance of the managed service and returns an HTTP response.
- A cluster operator can then check the status of the instance to see if it is ready.
Binding to a managed service
After a new instance has been provisioned, a cluster operator must bind to the managed service to get the connection credentials and service account details necessary for the application to use the service. This is done by creating a ServiceBinding
resource.
The following is an example of a ServiceBinding
resource:
apiVersion: servicecatalog.k8s.io/v1beta1
kind: ServiceBinding
metadata:
name: cloud-queue-binding
namespace: cloud-apps
spec:
instanceRef:
name: cloud-queue-instance
#####
# Additional information can be added here, such as a secretName or
# service account parameters, which may be used by the service broker.
#####
The following sequence diagram illustrates the steps involved in binding to a managed service instance:
- After the
ServiceBinding
is created, Service Catalog makes a call to the external service broker requesting the information necessary to bind with the service instance. - The service broker enables the application permissions/roles for the appropriate service account.
- The service broker returns the information necessary to connect and access the managed service instance. This is provider and service-specific so the information returned may differ between Service Providers and their managed services.
Mapping the connection credentials
After binding, the final step involves mapping the connection credentials and service-specific information into the application. These pieces of information are stored in secrets that the application in the cluster can access and use to connect directly with the managed service.
Pod configuration File
One method to perform this mapping is to use a declarative Pod configuration.
The following example describes how to map service account credentials into the application. A key called sa-key
is stored in a volume named provider-cloud-key
, and the application mounts this volume at /var/secrets/provider/key.json
. The environment variable PROVIDER_APPLICATION_CREDENTIALS
is mapped from the value of the mounted file.
...
spec:
volumes:
- name: provider-cloud-key
secret:
secretName: sa-key
containers:
...
volumeMounts:
- name: provider-cloud-key
mountPath: /var/secrets/provider
env:
- name: PROVIDER_APPLICATION_CREDENTIALS
value: "/var/secrets/provider/key.json"
The following example describes how to map secret values into application environment variables. In this example, the messaging queue topic name is mapped from a secret named provider-queue-credentials
with a key named topic
to the environment variable TOPIC
.
...
env:
- name: "TOPIC"
valueFrom:
secretKeyRef:
name: provider-queue-credentials
key: topic
What's next
- If you are familiar with Helm Charts, install Service Catalog using Helm into your Kubernetes cluster. Alternatively, you can install Service Catalog using the SC tool.
- View sample service brokers.
- Explore the kubernetes-sigs/service-catalog project.
4 - Tasks
This section of the Kubernetes documentation contains pages that show how to do individual tasks. A task page shows how to do a single thing, typically by giving a short sequence of steps.
If you would like to write a task page, see Creating a Documentation Pull Request.
4.1 - Install Tools
kubectl
The Kubernetes command-line tool, kubectl, allows
you to run commands against Kubernetes clusters.
You can use kubectl to deploy applications, inspect and manage cluster resources,
and view logs. For more information including a complete list of kubectl operations, see the
kubectl
reference documentation.
kubectl is installable on a variety of Linux platforms, macOS and Windows. Find your preferred operating system below.
kind
kind
lets you run Kubernetes on
your local computer. This tool requires that you have
Docker installed and configured.
The kind Quick Start page shows you what you need to do to get up and running with kind.
minikube
Like kind
, minikube
is a tool that lets you run Kubernetes
locally. minikube
runs a single-node Kubernetes cluster on your personal
computer (including Windows, macOS and Linux PCs) so that you can try out
Kubernetes, or for daily development work.
You can follow the official Get Started! guide if your focus is on getting the tool installed.
View minikube Get Started! Guide
Once you have minikube
working, you can use it to
run a sample application.
kubeadm
You can use the kubeadm tool to create and manage Kubernetes clusters. It performs the actions necessary to get a minimum viable, secure cluster up and running in a user friendly way.
Installing kubeadm shows you how to install kubeadm. Once installed, you can use it to create a cluster.
4.1.1 - Install and Set Up kubectl on Linux
Before you begin
You must use a kubectl version that is within one minor version difference of your cluster. For example, a v1.23 client can communicate with v1.22, v1.23, and v1.24 control planes. Using the latest compatible version of kubectl helps avoid unforeseen issues.
Install kubectl on Linux
The following methods exist for installing kubectl on Linux:
- Install kubectl binary with curl on Linux
- Install using native package management
- Install using other package management
Install kubectl binary with curl on Linux
-
Download the latest release with the command:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
Note:To download a specific version, replace the
$(curl -L -s https://dl.k8s.io/release/stable.txt)
portion of the command with the specific version.For example, to download version v1.23.0 on Linux, type:
curl -LO https://dl.k8s.io/release/v1.23.0/bin/linux/amd64/kubectl
-
Validate the binary (optional)
Download the kubectl checksum file:
curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
Validate the kubectl binary against the checksum file:
echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
If valid, the output is:
kubectl: OK
If the check fails,
sha256
exits with nonzero status and prints output similar to:kubectl: FAILED sha256sum: WARNING: 1 computed checksum did NOT match
Note: Download the same version of the binary and checksum. -
Install kubectl
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
Note:If you do not have root access on the target system, you can still install kubectl to the
~/.local/bin
directory:chmod +x kubectl mkdir -p ~/.local/bin mv ./kubectl ~/.local/bin/kubectl # and then append (or prepend) ~/.local/bin to $PATH
-
Test to ensure the version you installed is up-to-date:
kubectl version --client
Or use this for detailed view of version:
kubectl version --client --output=yaml
Install using native package management
-
Update the
apt
package index and install packages needed to use the Kubernetesapt
repository:sudo apt-get update sudo apt-get install -y apt-transport-https ca-certificates curl
-
Download the Google Cloud public signing key:
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
-
Add the Kubernetes
apt
repository:echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
-
Update
apt
package index with the new repository and install kubectl:sudo apt-get update sudo apt-get install -y kubectl
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
sudo yum install -y kubectl
Install using other package management
If you are on Ubuntu or another Linux distribution that support snap package manager, kubectl is available as a snap application.
snap install kubectl --classic
kubectl version --client
If you are on Linux and using Homebrew package manager, kubectl is available for installation.
brew install kubectl
kubectl version --client
Verify kubectl configuration
In order for kubectl to find and access a Kubernetes cluster, it needs a
kubeconfig file,
which is created automatically when you create a cluster using
kube-up.sh
or successfully deploy a Minikube cluster.
By default, kubectl configuration is located at ~/.kube/config
.
Check that kubectl is properly configured by getting the cluster state:
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check whether it is configured properly, use:
kubectl cluster-info dump
Optional kubectl configurations and plugins
Enable shell autocompletion
kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell, which can save you a lot of typing.
Below are the procedures to set up autocompletion for Bash, Fish, and Zsh.
Introduction
The kubectl completion script for Bash can be generated with the command kubectl completion bash
. Sourcing the completion script in your shell enables kubectl autocompletion.
However, the completion script depends on bash-completion, which means that you have to install this software first (you can test if you have bash-completion already installed by running type _init_completion
).
Install bash-completion
bash-completion is provided by many package managers (see here). You can install it with apt-get install bash-completion
or yum install bash-completion
, etc.
The above commands create /usr/share/bash-completion/bash_completion
, which is the main script of bash-completion. Depending on your package manager, you have to manually source this file in your ~/.bashrc
file.
To find out, reload your shell and run type _init_completion
. If the command succeeds, you're already set, otherwise add the following to your ~/.bashrc
file:
source /usr/share/bash-completion/bash_completion
Reload your shell and verify that bash-completion is correctly installed by typing type _init_completion
.
Enable kubectl autocompletion
Bash
You now need to ensure that the kubectl completion script gets sourced in all your shell sessions. There are two ways in which you can do this:
echo 'source <(kubectl completion bash)' >>~/.bashrc
kubectl completion bash | sudo tee /etc/bash_completion.d/kubectl > /dev/null
If you have an alias for kubectl, you can extend shell completion to work with that alias:
echo 'alias k=kubectl' >>~/.bashrc
echo 'complete -F __start_kubectl k' >>~/.bashrc
/etc/bash_completion.d
.
Both approaches are equivalent. After reloading your shell, kubectl autocompletion should be working.
The kubectl completion script for Fish can be generated with the command kubectl completion fish
. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish
file:
kubectl completion fish | source
After reloading your shell, kubectl autocompletion should be working.
The kubectl completion script for Zsh can be generated with the command kubectl completion zsh
. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following to your ~/.zshrc
file:
source <(kubectl completion zsh)
If you have an alias for kubectl, kubectl autocompletion will automatically work with it.
After reloading your shell, kubectl autocompletion should be working.
If you get an error like 2: command not found: compdef
, then add the following to the beginning of your ~/.zshrc
file:
autoload -Uz compinit
compinit
Install kubectl convert
plugin
A plugin for Kubernetes command-line tool kubectl
, which allows you to convert manifests between different API
versions. This can be particularly helpful to migrate manifests to a non-deprecated api version with newer Kubernetes release.
For more info, visit migrate to non deprecated apis
-
Download the latest release with the command:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl-convert"
-
Validate the binary (optional)
Download the kubectl-convert checksum file:
curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl-convert.sha256"
Validate the kubectl-convert binary against the checksum file:
echo "$(cat kubectl-convert.sha256) kubectl-convert" | sha256sum --check
If valid, the output is:
kubectl-convert: OK
If the check fails,
sha256
exits with nonzero status and prints output similar to:kubectl-convert: FAILED sha256sum: WARNING: 1 computed checksum did NOT match
Note: Download the same version of the binary and checksum. -
Install kubectl-convert
sudo install -o root -g root -m 0755 kubectl-convert /usr/local/bin/kubectl-convert
-
Verify plugin is successfully installed
kubectl convert --help
If you do not see an error, it means the plugin is successfully installed.
What's next
- Install Minikube
- See the getting started guides for more about creating clusters.
- Learn how to launch and expose your application.
- If you need access to a cluster you didn't create, see the Sharing Cluster Access document.
- Read the kubectl reference docs
4.1.2 - Install and Set Up kubectl on macOS
Before you begin
You must use a kubectl version that is within one minor version difference of your cluster. For example, a v1.23 client can communicate with v1.22, v1.23, and v1.24 control planes. Using the latest compatible version of kubectl helps avoid unforeseen issues.
Install kubectl on macOS
The following methods exist for installing kubectl on macOS:
- Install kubectl binary with curl on macOS
- Install with Homebrew on macOS
- Install with Macports on macOS
Install kubectl binary with curl on macOS
-
Download the latest release:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/amd64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/arm64/kubectl"
Note:To download a specific version, replace the
$(curl -L -s https://dl.k8s.io/release/stable.txt)
portion of the command with the specific version.For example, to download version v1.23.0 on Intel macOS, type:
curl -LO "https://dl.k8s.io/release/v1.23.0/bin/darwin/amd64/kubectl"
And for macOS on Apple Silicon, type:
curl -LO "https://dl.k8s.io/release/v1.23.0/bin/darwin/arm64/kubectl"
-
Validate the binary (optional)
Download the kubectl checksum file:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/amd64/kubectl.sha256"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/arm64/kubectl.sha256"
Validate the kubectl binary against the checksum file:
echo "$(cat kubectl.sha256) kubectl" | shasum -a 256 --check
If valid, the output is:
kubectl: OK
If the check fails,
shasum
exits with nonzero status and prints output similar to:kubectl: FAILED shasum: WARNING: 1 computed checksum did NOT match
Note: Download the same version of the binary and checksum. -
Make the kubectl binary executable.
chmod +x ./kubectl
-
Move the kubectl binary to a file location on your system
PATH
.sudo mv ./kubectl /usr/local/bin/kubectl sudo chown root: /usr/local/bin/kubectl
Note: Make sure/usr/local/bin
is in your PATH environment variable. -
Test to ensure the version you installed is up-to-date:
kubectl version --client
Or use this for detailed view of version:
kubectl version --client --output=yaml
Install with Homebrew on macOS
If you are on macOS and using Homebrew package manager, you can install kubectl with Homebrew.
-
Run the installation command:
brew install kubectl
or
brew install kubernetes-cli
-
Test to ensure the version you installed is up-to-date:
kubectl version --client
Install with Macports on macOS
If you are on macOS and using Macports package manager, you can install kubectl with Macports.
-
Run the installation command:
sudo port selfupdate sudo port install kubectl
-
Test to ensure the version you installed is up-to-date:
kubectl version --client
Verify kubectl configuration
In order for kubectl to find and access a Kubernetes cluster, it needs a
kubeconfig file,
which is created automatically when you create a cluster using
kube-up.sh
or successfully deploy a Minikube cluster.
By default, kubectl configuration is located at ~/.kube/config
.
Check that kubectl is properly configured by getting the cluster state:
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check whether it is configured properly, use:
kubectl cluster-info dump
Optional kubectl configurations and plugins
Enable shell autocompletion
kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell which can save you a lot of typing.
Below are the procedures to set up autocompletion for Bash, Fish, and Zsh.
Introduction
The kubectl completion script for Bash can be generated with kubectl completion bash
. Sourcing this script in your shell enables kubectl completion.
However, the kubectl completion script depends on bash-completion which you thus have to previously install.
Upgrade Bash
The instructions here assume you use Bash 4.1+. You can check your Bash's version by running:
echo $BASH_VERSION
If it is too old, you can install/upgrade it using Homebrew:
brew install bash
Reload your shell and verify that the desired version is being used:
echo $BASH_VERSION $SHELL
Homebrew usually installs it at /usr/local/bin/bash
.
Install bash-completion
You can test if you have bash-completion v2 already installed with type _init_completion
. If not, you can install it with Homebrew:
brew install bash-completion@2
As stated in the output of this command, add the following to your ~/.bash_profile
file:
export BASH_COMPLETION_COMPAT_DIR="/usr/local/etc/bash_completion.d"
[[ -r "/usr/local/etc/profile.d/bash_completion.sh" ]] && . "/usr/local/etc/profile.d/bash_completion.sh"
Reload your shell and verify that bash-completion v2 is correctly installed with type _init_completion
.
Enable kubectl autocompletion
You now have to ensure that the kubectl completion script gets sourced in all your shell sessions. There are multiple ways to achieve this:
-
Source the completion script in your
~/.bash_profile
file:echo 'source <(kubectl completion bash)' >>~/.bash_profile
-
Add the completion script to the
/usr/local/etc/bash_completion.d
directory:kubectl completion bash >/usr/local/etc/bash_completion.d/kubectl
-
If you have an alias for kubectl, you can extend shell completion to work with that alias:
echo 'alias k=kubectl' >>~/.bash_profile echo 'complete -F __start_kubectl k' >>~/.bash_profile
-
If you installed kubectl with Homebrew (as explained here), then the kubectl completion script should already be in
/usr/local/etc/bash_completion.d/kubectl
. In that case, you don't need to do anything.Note: The Homebrew installation of bash-completion v2 sources all the files in theBASH_COMPLETION_COMPAT_DIR
directory, that's why the latter two methods work.
In any case, after reloading your shell, kubectl completion should be working.
The kubectl completion script for Fish can be generated with the command kubectl completion fish
. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish
file:
kubectl completion fish | source
After reloading your shell, kubectl autocompletion should be working.
The kubectl completion script for Zsh can be generated with the command kubectl completion zsh
. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following to your ~/.zshrc
file:
source <(kubectl completion zsh)
If you have an alias for kubectl, kubectl autocompletion will automatically work with it.
After reloading your shell, kubectl autocompletion should be working.
If you get an error like 2: command not found: compdef
, then add the following to the beginning of your ~/.zshrc
file:
autoload -Uz compinit
compinit
Install kubectl convert
plugin
A plugin for Kubernetes command-line tool kubectl
, which allows you to convert manifests between different API
versions. This can be particularly helpful to migrate manifests to a non-deprecated api version with newer Kubernetes release.
For more info, visit migrate to non deprecated apis
-
Download the latest release with the command:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/amd64/kubectl-convert"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/arm64/kubectl-convert"
-
Validate the binary (optional)
Download the kubectl-convert checksum file:
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/amd64/kubectl-convert.sha256"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/arm64/kubectl-convert.sha256"
Validate the kubectl-convert binary against the checksum file:
echo "$(cat kubectl-convert.sha256) kubectl-convert" | shasum -a 256 --check
If valid, the output is:
kubectl-convert: OK
If the check fails,
shasum
exits with nonzero status and prints output similar to:kubectl-convert: FAILED shasum: WARNING: 1 computed checksum did NOT match
Note: Download the same version of the binary and checksum. -
Make kubectl-convert binary executable
chmod +x ./kubectl-convert
-
Move the kubectl-convert binary to a file location on your system
PATH
.sudo mv ./kubectl-convert /usr/local/bin/kubectl-convert sudo chown root: /usr/local/bin/kubectl-convert
Note: Make sure/usr/local/bin
is in your PATH environment variable. -
Verify plugin is successfully installed
kubectl convert --help
If you do not see an error, it means the plugin is successfully installed.
What's next
- Install Minikube
- See the getting started guides for more about creating clusters.
- Learn how to launch and expose your application.
- If you need access to a cluster you didn't create, see the Sharing Cluster Access document.
- Read the kubectl reference docs
4.1.3 - Install and Set Up kubectl on Windows
Before you begin
You must use a kubectl version that is within one minor version difference of your cluster. For example, a v1.23 client can communicate with v1.22, v1.23, and v1.24 control planes. Using the latest compatible version of kubectl helps avoid unforeseen issues.
Install kubectl on Windows
The following methods exist for installing kubectl on Windows:
Install kubectl binary with curl on Windows
-
Download the latest release v1.23.0.
Or if you have
curl
installed, use this command:curl -LO "https://dl.k8s.io/release/v1.23.0/bin/windows/amd64/kubectl.exe"
Note: To find out the latest stable version (for example, for scripting), take a look at https://dl.k8s.io/release/stable.txt. -
Validate the binary (optional)
Download the
kubectl
checksum file:curl -LO "https://dl.k8s.io/v1.23.0/bin/windows/amd64/kubectl.exe.sha256"
Validate the
kubectl
binary against the checksum file:-
Using Command Prompt to manually compare
CertUtil
's output to the checksum file downloaded:CertUtil -hashfile kubectl.exe SHA256 type kubectl.exe.sha256
-
Using PowerShell to automate the verification using the
-eq
operator to get aTrue
orFalse
result:$($(CertUtil -hashfile .\kubectl.exe SHA256)[1] -replace " ", "") -eq $(type .\kubectl.exe.sha256)
-
-
Append or prepend the
kubectl
binary folder to yourPATH
environment variable. -
Test to ensure the version of
kubectl
is the same as downloaded:kubectl version --client
Or use this for detailed view of version:
kubectl version --client --output=yaml
kubectl
to PATH
.
If you have installed Docker Desktop before, you may need to place your PATH
entry before the one added by the Docker Desktop installer or remove the Docker Desktop's kubectl
.
Install on Windows using Chocolatey or Scoop
-
To install kubectl on Windows you can use either Chocolatey package manager or Scoop command-line installer.
choco install kubernetes-cli
scoop install kubectl
-
Test to ensure the version you installed is up-to-date:
kubectl version --client
-
Navigate to your home directory:
# If you're using cmd.exe, run: cd %USERPROFILE% cd ~
-
Create the
.kube
directory:mkdir .kube
-
Change to the
.kube
directory you just created:cd .kube
-
Configure kubectl to use a remote Kubernetes cluster:
New-Item config -type file
Verify kubectl configuration
In order for kubectl to find and access a Kubernetes cluster, it needs a
kubeconfig file,
which is created automatically when you create a cluster using
kube-up.sh
or successfully deploy a Minikube cluster.
By default, kubectl configuration is located at ~/.kube/config
.
Check that kubectl is properly configured by getting the cluster state:
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check whether it is configured properly, use:
kubectl cluster-info dump
Optional kubectl configurations and plugins
Enable shell autocompletion
kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell, which can save you a lot of typing.
Below are the procedures to set up autocompletion for PowerShell.
The kubectl completion script for PowerShell can be generated with the command kubectl completion powershell
.
To do so in all your shell sessions, add the following line to your $PROFILE
file:
kubectl completion powershell | Out-String | Invoke-Expression
This command will regenerate the auto-completion script on every PowerShell start up. You can also add the generated script directly to your $PROFILE
file.
To add the generated script to your $PROFILE
file, run the following line in your powershell prompt:
kubectl completion powershell >> $PROFILE
After reloading your shell, kubectl autocompletion should be working.
Install kubectl convert
plugin
A plugin for Kubernetes command-line tool kubectl
, which allows you to convert manifests between different API
versions. This can be particularly helpful to migrate manifests to a non-deprecated api version with newer Kubernetes release.
For more info, visit migrate to non deprecated apis
-
Download the latest release with the command:
curl -LO "https://dl.k8s.io/release/v1.23.0/bin/windows/amd64/kubectl-convert.exe"
-
Validate the binary (optional)
Download the
kubectl-convert
checksum file:curl -LO "https://dl.k8s.io/v1.23.0/bin/windows/amd64/kubectl-convert.exe.sha256"
Validate the
kubectl-convert
binary against the checksum file:-
Using Command Prompt to manually compare
CertUtil
's output to the checksum file downloaded:CertUtil -hashfile kubectl-convert.exe SHA256 type kubectl-convert.exe.sha256
-
Using PowerShell to automate the verification using the
-eq
operator to get aTrue
orFalse
result:$($(CertUtil -hashfile .\kubectl-convert.exe SHA256)[1] -replace " ", "") -eq $(type .\kubectl-convert.exe.sha256)
-
-
Append or prepend the
kubectl-convert
binary folder to yourPATH
environment variable. -
Verify plugin is successfully installed
kubectl convert --help
If you do not see an error, it means the plugin is successfully installed.
What's next
- Install Minikube
- See the getting started guides for more about creating clusters.
- Learn how to launch and expose your application.
- If you need access to a cluster you didn't create, see the Sharing Cluster Access document.
- Read the kubectl reference docs
4.1.4 - Tools Included
4.1.4.1 - bash auto-completion on Linux
Introduction
The kubectl completion script for Bash can be generated with the command kubectl completion bash
. Sourcing the completion script in your shell enables kubectl autocompletion.
However, the completion script depends on bash-completion, which means that you have to install this software first (you can test if you have bash-completion already installed by running type _init_completion
).
Install bash-completion
bash-completion is provided by many package managers (see here). You can install it with apt-get install bash-completion
or yum install bash-completion
, etc.
The above commands create /usr/share/bash-completion/bash_completion
, which is the main script of bash-completion. Depending on your package manager, you have to manually source this file in your ~/.bashrc
file.
To find out, reload your shell and run type _init_completion
. If the command succeeds, you're already set, otherwise add the following to your ~/.bashrc
file:
source /usr/share/bash-completion/bash_completion
Reload your shell and verify that bash-completion is correctly installed by typing type _init_completion
.
Enable kubectl autocompletion
Bash
You now need to ensure that the kubectl completion script gets sourced in all your shell sessions. There are two ways in which you can do this:
echo 'source <(kubectl completion bash)' >>~/.bashrc
kubectl completion bash | sudo tee /etc/bash_completion.d/kubectl > /dev/null
If you have an alias for kubectl, you can extend shell completion to work with that alias:
echo 'alias k=kubectl' >>~/.bashrc
echo 'complete -F __start_kubectl k' >>~/.bashrc
/etc/bash_completion.d
.
Both approaches are equivalent. After reloading your shell, kubectl autocompletion should be working.
4.1.4.2 - bash auto-completion on macOS
Introduction
The kubectl completion script for Bash can be generated with kubectl completion bash
. Sourcing this script in your shell enables kubectl completion.
However, the kubectl completion script depends on bash-completion which you thus have to previously install.
Upgrade Bash
The instructions here assume you use Bash 4.1+. You can check your Bash's version by running:
echo $BASH_VERSION
If it is too old, you can install/upgrade it using Homebrew:
brew install bash
Reload your shell and verify that the desired version is being used:
echo $BASH_VERSION $SHELL
Homebrew usually installs it at /usr/local/bin/bash
.
Install bash-completion
You can test if you have bash-completion v2 already installed with type _init_completion
. If not, you can install it with Homebrew:
brew install bash-completion@2
As stated in the output of this command, add the following to your ~/.bash_profile
file:
export BASH_COMPLETION_COMPAT_DIR="/usr/local/etc/bash_completion.d"
[[ -r "/usr/local/etc/profile.d/bash_completion.sh" ]] && . "/usr/local/etc/profile.d/bash_completion.sh"
Reload your shell and verify that bash-completion v2 is correctly installed with type _init_completion
.
Enable kubectl autocompletion
You now have to ensure that the kubectl completion script gets sourced in all your shell sessions. There are multiple ways to achieve this:
-
Source the completion script in your
~/.bash_profile
file:echo 'source <(kubectl completion bash)' >>~/.bash_profile
-
Add the completion script to the
/usr/local/etc/bash_completion.d
directory:kubectl completion bash >/usr/local/etc/bash_completion.d/kubectl
-
If you have an alias for kubectl, you can extend shell completion to work with that alias:
echo 'alias k=kubectl' >>~/.bash_profile echo 'complete -F __start_kubectl k' >>~/.bash_profile
-
If you installed kubectl with Homebrew (as explained here), then the kubectl completion script should already be in
/usr/local/etc/bash_completion.d/kubectl
. In that case, you don't need to do anything.Note: The Homebrew installation of bash-completion v2 sources all the files in theBASH_COMPLETION_COMPAT_DIR
directory, that's why the latter two methods work.
In any case, after reloading your shell, kubectl completion should be working.
4.1.4.3 - fish auto-completion
The kubectl completion script for Fish can be generated with the command kubectl completion fish
. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish
file:
kubectl completion fish | source
After reloading your shell, kubectl autocompletion should be working.
4.1.4.4 - kubectl-convert overview
A plugin for Kubernetes command-line tool kubectl
, which allows you to convert manifests between different API
versions. This can be particularly helpful to migrate manifests to a non-deprecated api version with newer Kubernetes release.
For more info, visit migrate to non deprecated apis
4.1.4.5 - PowerShell auto-completion
The kubectl completion script for PowerShell can be generated with the command kubectl completion powershell
.
To do so in all your shell sessions, add the following line to your $PROFILE
file:
kubectl completion powershell | Out-String | Invoke-Expression
This command will regenerate the auto-completion script on every PowerShell start up. You can also add the generated script directly to your $PROFILE
file.
To add the generated script to your $PROFILE
file, run the following line in your powershell prompt:
kubectl completion powershell >> $PROFILE
After reloading your shell, kubectl autocompletion should be working.
4.1.4.6 - verify kubectl install
In order for kubectl to find and access a Kubernetes cluster, it needs a
kubeconfig file,
which is created automatically when you create a cluster using
kube-up.sh
or successfully deploy a Minikube cluster.
By default, kubectl configuration is located at ~/.kube/config
.
Check that kubectl is properly configured by getting the cluster state:
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check whether it is configured properly, use:
kubectl cluster-info dump
4.1.4.7 - What's next?
- Install Minikube
- See the getting started guides for more about creating clusters.
- Learn how to launch and expose your application.
- If you need access to a cluster you didn't create, see the Sharing Cluster Access document.
- Read the kubectl reference docs
4.1.4.8 - zsh auto-completion
The kubectl completion script for Zsh can be generated with the command kubectl completion zsh
. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following to your ~/.zshrc
file:
source <(kubectl completion zsh)
If you have an alias for kubectl, kubectl autocompletion will automatically work with it.
After reloading your shell, kubectl autocompletion should be working.
If you get an error like 2: command not found: compdef
, then add the following to the beginning of your ~/.zshrc
file:
autoload -Uz compinit
compinit
4.2 - Administer a Cluster
4.2.1 - Administration with kubeadm
4.2.1.1 - Certificate Management with kubeadm
Kubernetes v1.15 [stable]
Client certificates generated by kubeadm expire after 1 year. This page explains how to manage certificate renewals with kubeadm. It also covers other tasks related to kubeadm certificate management.
Before you begin
You should be familiar with PKI certificates and requirements in Kubernetes.
Using custom certificates
By default, kubeadm generates all the certificates needed for a cluster to run. You can override this behavior by providing your own certificates.
To do so, you must place them in whatever directory is specified by the
--cert-dir
flag or the certificatesDir
field of kubeadm's ClusterConfiguration
.
By default this is /etc/kubernetes/pki
.
If a given certificate and private key pair exists before running kubeadm init
,
kubeadm does not overwrite them. This means you can, for example, copy an existing
CA into /etc/kubernetes/pki/ca.crt
and /etc/kubernetes/pki/ca.key
,
and kubeadm will use this CA for signing the rest of the certificates.
External CA mode
It is also possible to provide only the ca.crt
file and not the
ca.key
file (this is only available for the root CA file, not other cert pairs).
If all other certificates and kubeconfig files are in place, kubeadm recognizes
this condition and activates the "External CA" mode. kubeadm will proceed without the
CA key on disk.
Instead, run the controller-manager standalone with --controllers=csrsigner
and
point to the CA certificate and key.
PKI certificates and requirements includes guidance on setting up a cluster to use an external CA.
Check certificate expiration
You can use the check-expiration
subcommand to check when certificates expire:
kubeadm certs check-expiration
The output is similar to this:
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Dec 30, 2020 23:36 UTC 364d no
apiserver Dec 30, 2020 23:36 UTC 364d ca no
apiserver-etcd-client Dec 30, 2020 23:36 UTC 364d etcd-ca no
apiserver-kubelet-client Dec 30, 2020 23:36 UTC 364d ca no
controller-manager.conf Dec 30, 2020 23:36 UTC 364d no
etcd-healthcheck-client Dec 30, 2020 23:36 UTC 364d etcd-ca no
etcd-peer Dec 30, 2020 23:36 UTC 364d etcd-ca no
etcd-server Dec 30, 2020 23:36 UTC 364d etcd-ca no
front-proxy-client Dec 30, 2020 23:36 UTC 364d front-proxy-ca no
scheduler.conf Dec 30, 2020 23:36 UTC 364d no
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Dec 28, 2029 23:36 UTC 9y no
etcd-ca Dec 28, 2029 23:36 UTC 9y no
front-proxy-ca Dec 28, 2029 23:36 UTC 9y no
The command shows expiration/residual time for the client certificates in the /etc/kubernetes/pki
folder and for the client certificate embedded in the KUBECONFIG files used by kubeadm (admin.conf
, controller-manager.conf
and scheduler.conf
).
Additionally, kubeadm informs the user if the certificate is externally managed; in this case, the user should take care of managing certificate renewal manually/using other tools.
kubeadm
cannot manage certificates signed by an external CA.
kubelet.conf
is not included in the list above because kubeadm configures kubelet
for automatic certificate renewal
with rotatable certificates under /var/lib/kubelet/pki
.
To repair an expired kubelet client certificate see
Kubelet client certificate rotation fails.
On nodes created with kubeadm init
, prior to kubeadm version 1.17, there is a
bug where you manually have to modify the contents of kubelet.conf
. After kubeadm init
finishes, you should update kubelet.conf
to point to the
rotated kubelet client certificates, by replacing client-certificate-data
and client-key-data
with:
client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
Automatic certificate renewal
kubeadm renews all the certificates during control plane upgrade.
This feature is designed for addressing the simplest use cases; if you don't have specific requirements on certificate renewal and perform Kubernetes version upgrades regularly (less than 1 year in between each upgrade), kubeadm will take care of keeping your cluster up to date and reasonably secure.
If you have more complex requirements for certificate renewal, you can opt out from the default behavior by passing --certificate-renewal=false
to kubeadm upgrade apply
or to kubeadm upgrade node
.
--certificate-renewal
is false
for the kubeadm upgrade node
command. In that case, you should explicitly set --certificate-renewal=true
.
Manual certificate renewal
You can renew your certificates manually at any time with the kubeadm certs renew
command.
This command performs the renewal using CA (or front-proxy-CA) certificate and key stored in /etc/kubernetes/pki
.
After running the command you should restart the control plane Pods. This is required since
dynamic certificate reload is currently not supported for all components and certificates.
Static Pods are managed by the local kubelet
and not by the API Server, thus kubectl cannot be used to delete and restart them.
To restart a static Pod you can temporarily remove its manifest file from /etc/kubernetes/manifests/
and wait for 20 seconds (see the fileCheckFrequency
value in KubeletConfiguration struct.
The kubelet will terminate the Pod if it's no longer in the manifest directory.
You can then move the file back and after another fileCheckFrequency
period, the kubelet will recreate
the Pod and the certificate renewal for the component can complete.
certs renew
uses the existing certificates as the authoritative source for attributes (Common Name, Organization, SAN, etc.) instead of the kubeadm-config ConfigMap. It is strongly recommended to keep them both in sync.
kubeadm certs renew
provides the following options:
The Kubernetes certificates normally reach their expiration date after one year.
-
--csr-only
can be used to renew certificates with an external CA by generating certificate signing requests (without actually renewing certificates in place); see next paragraph for more information. -
It's also possible to renew a single certificate instead of all.
Renew certificates with the Kubernetes certificates API
This section provides more details about how to execute manual certificate renewal using the Kubernetes certificates API.
Set up a signer
The Kubernetes Certificate Authority does not work out of the box. You can configure an external signer such as cert-manager, or you can use the built-in signer.
The built-in signer is part of kube-controller-manager
.
To activate the built-in signer, you must pass the --cluster-signing-cert-file
and --cluster-signing-key-file
flags.
If you're creating a new cluster, you can use a kubeadm configuration file:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controllerManager:
extraArgs:
cluster-signing-cert-file: /etc/kubernetes/pki/ca.crt
cluster-signing-key-file: /etc/kubernetes/pki/ca.key
Create certificate signing requests (CSR)
See Create CertificateSigningRequest for creating CSRs with the Kubernetes API.
Renew certificates with external CA
This section provide more details about how to execute manual certificate renewal using an external CA.
To better integrate with external CAs, kubeadm can also produce certificate signing requests (CSRs). A CSR represents a request to a CA for a signed certificate for a client. In kubeadm terms, any certificate that would normally be signed by an on-disk CA can be produced as a CSR instead. A CA, however, cannot be produced as a CSR.
Create certificate signing requests (CSR)
You can create certificate signing requests with kubeadm certs renew --csr-only
.
Both the CSR and the accompanying private key are given in the output.
You can pass in a directory with --csr-dir
to output the CSRs to the specified location.
If --csr-dir
is not specified, the default certificate directory (/etc/kubernetes/pki
) is used.
Certificates can be renewed with kubeadm certs renew --csr-only
.
As with kubeadm init
, an output directory can be specified with the --csr-dir
flag.
A CSR contains a certificate's name, domains, and IPs, but it does not specify usages. It is the responsibility of the CA to specify the correct cert usages when issuing a certificate.
- In
openssl
this is done with theopenssl ca
command. - In
cfssl
you specify usages in the config file.
After a certificate is signed using your preferred method, the certificate and the private key must be copied to the PKI directory (by default /etc/kubernetes/pki
).
Certificate authority (CA) rotation
Kubeadm does not support rotation or replacement of CA certificates out of the box.
For more information about manual rotation or replacement of CA, see manual rotation of CA certificates.
Enabling signed kubelet serving certificates
By default the kubelet serving certificate deployed by kubeadm is self-signed. This means a connection from external services like the metrics-server to a kubelet cannot be secured with TLS.
To configure the kubelets in a new kubeadm cluster to obtain properly signed serving
certificates you must pass the following minimal configuration to kubeadm init
:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serverTLSBootstrap: true
If you have already created the cluster you must adapt it by doing the following:
- Find and edit the
kubelet-config-1.23
ConfigMap in thekube-system
namespace. In that ConfigMap, thekubelet
key has a KubeletConfiguration document as its value. Edit the KubeletConfiguration document to setserverTLSBootstrap: true
. - On each node, add the
serverTLSBootstrap: true
field in/var/lib/kubelet/config.yaml
and restart the kubelet withsystemctl restart kubelet
The field serverTLSBootstrap: true
will enable the bootstrap of kubelet serving
certificates by requesting them from the certificates.k8s.io
API. One known limitation
is that the CSRs (Certificate Signing Requests) for these certificates cannot be automatically
approved by the default signer in the kube-controller-manager -
kubernetes.io/kubelet-serving
.
This will require action from the user or a third party controller.
These CSRs can be viewed using:
kubectl get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-9wvgt 112s kubernetes.io/kubelet-serving system:node:worker-1 Pending
csr-lz97v 1m58s kubernetes.io/kubelet-serving system:node:control-plane-1 Pending
To approve them you can do the following:
kubectl certificate approve <CSR-name>
By default, these serving certificate will expire after one year. Kubeadm sets the
KubeletConfiguration
field rotateCertificates
to true
, which means that close
to expiration a new set of CSRs for the serving certificates will be created and must
be approved to complete the rotation. To understand more see
Certificate Rotation.
If you are looking for a solution for automatic approval of these CSRs it is recommended that you contact your cloud provider and ask if they have a CSR signer that verifies the node identity with an out of band mechanism.
Third party custom controllers can be used:
Such a controller is not a secure mechanism unless it not only verifies the CommonName in the CSR but also verifies the requested IPs and domain names. This would prevent a malicious actor that has access to a kubelet client certificate to create CSRs requesting serving certificates for any IP or domain name.
Generating kubeconfig files for additional users
During cluster creation, kubeadm signs the certificate in the admin.conf
to have
Subject: O = system:masters, CN = kubernetes-admin
.
system:masters
is a break-glass, super user group that bypasses the authorization layer (e.g. RBAC).
Sharing the admin.conf
with additional users is not recommended!
Instead, you can use the kubeadm kubeconfig user
command to generate kubeconfig files for additional users.
The command accepts a mixture of command line flags and
kubeadm configuration options.
The generated kubeconfig will be written to stdout and can be piped to a file
using kubeadm kubeconfig user ... > somefile.conf
.
Example configuration file that can be used with --config
:
# example.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
# Will be used as the target "cluster" in the kubeconfig
clusterName: "kubernetes"
# Will be used as the "server" (IP or DNS name) of this cluster in the kubeconfig
controlPlaneEndpoint: "some-dns-address:6443"
# The cluster CA key and certificate will be loaded from this local directory
certificatesDir: "/etc/kubernetes/pki"
Make sure that these settings match the desired target cluster settings. To see the settings of an existing cluster use:
kubectl get cm kubeadm-config -n kube-system -o=jsonpath="{.data.ClusterConfiguration}"
The following example will generate a kubeconfig file with credentials valid for 24 hours
for a new user johndoe
that is part of the appdevs
group:
kubeadm kubeconfig user --config example.yaml --org appdevs --client-name johndoe --validity-period 24h
The following example will generate a kubeconfig file with administrator credentials valid for 1 week:
kubeadm kubeconfig user --config example.yaml --client-name admin --validity-period 168h
4.2.1.2 - Configuring a cgroup driver
This page explains how to configure the kubelet cgroup driver to match the container runtime cgroup driver for kubeadm clusters.
Before you begin
You should be familiar with the Kubernetes container runtime requirements.
Configuring the container runtime cgroup driver
The Container runtimes page
explains that the systemd
driver is recommended for kubeadm based setups instead
of the cgroupfs
driver, because kubeadm manages the kubelet as a systemd service.
The page also provides details on how to setup a number of different container runtimes with the
systemd
driver by default.
Configuring the kubelet cgroup driver
kubeadm allows you to pass a KubeletConfiguration
structure during kubeadm init
.
This KubeletConfiguration
can include the cgroupDriver
field which controls the cgroup
driver of the kubelet.
cgroupDriver
field under KubeletConfiguration
,
kubeadm
will default it to systemd
.
A minimal example of configuring the field explicitly:
# kubeadm-config.yaml
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.21.0
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
Such a configuration file can then be passed to the kubeadm command:
kubeadm init --config kubeadm-config.yaml
Kubeadm uses the same KubeletConfiguration
for all nodes in the cluster.
The KubeletConfiguration
is stored in a ConfigMap
object under the kube-system
namespace.
Executing the sub commands init
, join
and upgrade
would result in kubeadm
writing the KubeletConfiguration
as a file under /var/lib/kubelet/config.yaml
and passing it to the local node kubelet.
Using the cgroupfs
driver
As this guide explains using the cgroupfs
driver with kubeadm is not recommended.
To continue using cgroupfs
and to prevent kubeadm upgrade
from modifying the
KubeletConfiguration
cgroup driver on existing setups, you must be explicit
about its value. This applies to a case where you do not wish future versions
of kubeadm to apply the systemd
driver by default.
See the below section on "Modify the kubelet ConfigMap" for details on how to be explicit about the value.
If you wish to configure a container runtime to use the cgroupfs
driver,
you must refer to the documentation of the container runtime of your choice.
Migrating to the systemd
driver
To change the cgroup driver of an existing kubeadm cluster to systemd
in-place,
a similar procedure to a kubelet upgrade is required. This must include both
steps outlined below.
systemd
driver. This requires executing only the first step below
before joining the new nodes and ensuring the workloads can safely move to the new
nodes before deleting the old nodes.
Modify the kubelet ConfigMap
-
Find the kubelet ConfigMap name using
kubectl get cm -n kube-system | grep kubelet-config
. -
Call
kubectl edit cm kubelet-config-x.yy -n kube-system
(replacex.yy
with the Kubernetes version). -
Either modify the existing
cgroupDriver
value or add a new field that looks like this:cgroupDriver: systemd
This field must be present under the
kubelet:
section of the ConfigMap.
Update the cgroup driver on all nodes
For each node in the cluster:
- Drain the node using
kubectl drain <node-name> --ignore-daemonsets
- Stop the kubelet using
systemctl stop kubelet
- Stop the container runtime
- Modify the container runtime cgroup driver to
systemd
- Set
cgroupDriver: systemd
in/var/lib/kubelet/config.yaml
- Start the container runtime
- Start the kubelet using
systemctl start kubelet
- Uncordon the node using
kubectl uncordon <node-name>
Execute these steps on nodes one at a time to ensure workloads have sufficient time to schedule on different nodes.
Once the process is complete ensure that all nodes and workloads are healthy.
4.2.1.3 - Upgrading kubeadm clusters
This page explains how to upgrade a Kubernetes cluster created with kubeadm from version
1.22.x to version 1.23.x, and from version
1.23.x to 1.23.y (where y > x
). Skipping MINOR versions
when upgrading is unsupported.
To see information about upgrading clusters created using older versions of kubeadm, please refer to following pages instead:
- Upgrading a kubeadm cluster from 1.21 to 1.22
- Upgrading a kubeadm cluster from 1.20 to 1.21
- Upgrading a kubeadm cluster from 1.19 to 1.20
- Upgrading a kubeadm cluster from 1.18 to 1.19
The upgrade workflow at high level is the following:
- Upgrade a primary control plane node.
- Upgrade additional control plane nodes.
- Upgrade worker nodes.
Before you begin
- Make sure you read the release notes carefully.
- The cluster should use a static control plane and etcd pods or external etcd.
- Make sure to back up any important components, such as app-level state stored in a database.
kubeadm upgrade
does not touch your workloads, only components internal to Kubernetes, but backups are always a best practice. - Swap must be disabled.
Additional information
- The instructions below outline when to drain each node during the upgrade process. If you are performing a minor version upgrade for any kubelet, you must first drain the node (or nodes) that you are upgrading. In the case of control plane nodes, they could be running CoreDNS Pods or other critical workloads. For more information see Draining nodes.
- All containers are restarted after upgrade, because the container spec hash value is changed.
- To verify that the kubelet service has successfully restarted after the kubelet has been upgraded, you can execute
systemctl status kubelet
or view the service logs withjournalctl -xeu kubelet
.
Determine which version to upgrade to
Find the latest patch release for Kubernetes 1.23 using the OS package manager:
apt update
apt-cache madison kubeadm
# find the latest 1.23 version in the list
# it should look like 1.23.x-00, where x is the latest patch
yum list --showduplicates kubeadm --disableexcludes=kubernetes
# find the latest 1.23 version in the list
# it should look like 1.23.x-0, where x is the latest patch
Upgrading control plane nodes
The upgrade procedure on control plane nodes should be executed one node at a time.
Pick a control plane node that you wish to upgrade first. It must have the /etc/kubernetes/admin.conf
file.
Call "kubeadm upgrade"
For the first control plane node
- Upgrade kubeadm:
# replace x in 1.23.x-00 with the latest patch version
apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.23.x-00 && \
apt-mark hold kubeadm
# replace x in 1.23.x-0 with the latest patch version
yum install -y kubeadm-1.23.x-0 --disableexcludes=kubernetes
-
Verify that the download works and has the expected version:
kubeadm version
-
Verify the upgrade plan:
kubeadm upgrade plan
This command checks that your cluster can be upgraded, and fetches the versions you can upgrade to. It also shows a table with the component config version states.
kubeadm upgrade
also automatically renews the certificates that it manages on this node.
To opt-out of certificate renewal the flag --certificate-renewal=false
can be used.
For more information see the certificate management guide.
kubeadm upgrade plan
shows any component configs that require manual upgrade, users must provide
a config file with replacement configs to kubeadm upgrade apply
via the --config
command line flag.
Failing to do so will cause kubeadm upgrade apply
to exit with an error and not perform an upgrade.
-
Choose a version to upgrade to, and run the appropriate command. For example:
# replace x with the patch version you picked for this upgrade sudo kubeadm upgrade apply v1.23.x
Once the command finishes you should see:
[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.23.x". Enjoy! [upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.
-
Manually upgrade your CNI provider plugin.
Your Container Network Interface (CNI) provider may have its own upgrade instructions to follow. Check the addons page to find your CNI provider and see whether additional upgrade steps are required.
This step is not required on additional control plane nodes if the CNI provider runs as a DaemonSet.
For the other control plane nodes
Same as the first control plane node but use:
sudo kubeadm upgrade node
instead of:
sudo kubeadm upgrade apply
Also calling kubeadm upgrade plan
and upgrading the CNI provider plugin is no longer needed.
Drain the node
-
Prepare the node for maintenance by marking it unschedulable and evicting the workloads:
# replace <node-to-drain> with the name of your node you are draining kubectl drain <node-to-drain> --ignore-daemonsets
Upgrade kubelet and kubectl
- Upgrade the kubelet and kubectl:
# replace x in 1.23.x-00 with the latest patch version
apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.23.x-00 kubectl=1.23.x-00 && \
apt-mark hold kubelet kubectl
# replace x in 1.23.x-0 with the latest patch version
yum install -y kubelet-1.23.x-0 kubectl-1.23.x-0 --disableexcludes=kubernetes
-
Restart the kubelet:
sudo systemctl daemon-reload sudo systemctl restart kubelet
Uncordon the node
-
Bring the node back online by marking it schedulable:
# replace <node-to-drain> with the name of your node kubectl uncordon <node-to-drain>
Upgrade worker nodes
The upgrade procedure on worker nodes should be executed one node at a time or few nodes at a time, without compromising the minimum required capacity for running your workloads.
Upgrade kubeadm
- Upgrade kubeadm:
# replace x in 1.23.x-00 with the latest patch version
apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm=1.23.x-00 && \
apt-mark hold kubeadm
# replace x in 1.23.x-0 with the latest patch version
yum install -y kubeadm-1.23.x-0 --disableexcludes=kubernetes
Call "kubeadm upgrade"
-
For worker nodes this upgrades the local kubelet configuration:
sudo kubeadm upgrade node
Drain the node
-
Prepare the node for maintenance by marking it unschedulable and evicting the workloads:
# replace <node-to-drain> with the name of your node you are draining kubectl drain <node-to-drain> --ignore-daemonsets
Upgrade kubelet and kubectl
- Upgrade the kubelet and kubectl:
# replace x in 1.23.x-00 with the latest patch version
apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet=1.23.x-00 kubectl=1.23.x-00 && \
apt-mark hold kubelet kubectl
# replace x in 1.23.x-0 with the latest patch version
yum install -y kubelet-1.23.x-0 kubectl-1.23.x-0 --disableexcludes=kubernetes
-
Restart the kubelet:
sudo systemctl daemon-reload sudo systemctl restart kubelet
Uncordon the node
-
Bring the node back online by marking it schedulable:
# replace <node-to-drain> with the name of your node kubectl uncordon <node-to-drain>
Verify the status of the cluster
After the kubelet is upgraded on all nodes verify that all nodes are available again by running the following command from anywhere kubectl can access the cluster:
kubectl get nodes
The STATUS
column should show Ready
for all your nodes, and the version number should be updated.
Recovering from a failure state
If kubeadm upgrade
fails and does not roll back, for example because of an unexpected shutdown during execution, you can run kubeadm upgrade
again.
This command is idempotent and eventually makes sure that the actual state is the desired state you declare.
To recover from a bad state, you can also run kubeadm upgrade apply --force
without changing the version that your cluster is running.
During upgrade kubeadm writes the following backup folders under /etc/kubernetes/tmp
:
kubeadm-backup-etcd-<date>-<time>
kubeadm-backup-manifests-<date>-<time>
kubeadm-backup-etcd
contains a backup of the local etcd member data for this control plane Node.
In case of an etcd upgrade failure and if the automatic rollback does not work, the contents of this folder
can be manually restored in /var/lib/etcd
. In case external etcd is used this backup folder will be empty.
kubeadm-backup-manifests
contains a backup of the static Pod manifest files for this control plane Node.
In case of a upgrade failure and if the automatic rollback does not work, the contents of this folder can be
manually restored in /etc/kubernetes/manifests
. If for some reason there is no difference between a pre-upgrade
and post-upgrade manifest file for a certain component, a backup file for it will not be written.
How it works
kubeadm upgrade apply
does the following:
- Checks that your cluster is in an upgradeable state:
- The API server is reachable
- All nodes are in the
Ready
state - The control plane is healthy
- Enforces the version skew policies.
- Makes sure the control plane images are available or available to pull to the machine.
- Generates replacements and/or uses user supplied overwrites if component configs require version upgrades.
- Upgrades the control plane components or rollbacks if any of them fails to come up.
- Applies the new
CoreDNS
andkube-proxy
manifests and makes sure that all necessary RBAC rules are created. - Creates new certificate and key files of the API server and backs up old files if they're about to expire in 180 days.
kubeadm upgrade node
does the following on additional control plane nodes:
- Fetches the kubeadm
ClusterConfiguration
from the cluster. - Optionally backups the kube-apiserver certificate.
- Upgrades the static Pod manifests for the control plane components.
- Upgrades the kubelet configuration for this node.
kubeadm upgrade node
does the following on worker nodes:
- Fetches the kubeadm
ClusterConfiguration
from the cluster. - Upgrades the kubelet configuration for this node.
4.2.1.4 - Adding Windows nodes
Kubernetes v1.18 [beta]
You can use Kubernetes to run a mixture of Linux and Windows nodes, so you can mix Pods that run on Linux on with Pods that run on Windows. This page shows how to register Windows nodes to your cluster.
Before you begin
Your Kubernetes server must be at or later than version 1.17. To check the version, enterkubectl version
.
-
Obtain a Windows Server 2019 license (or higher) in order to configure the Windows node that hosts Windows containers. If you are using VXLAN/Overlay networking you must have also have KB4489899 installed.
-
A Linux-based Kubernetes kubeadm cluster in which you have access to the control plane (see Creating a single control-plane cluster with kubeadm).
Objectives
- Register a Windows node to the cluster
- Configure networking so Pods and Services on Linux and Windows can communicate with each other
Getting Started: Adding a Windows Node to Your Cluster
Networking Configuration
Once you have a Linux-based Kubernetes control-plane node you are ready to choose a networking solution. This guide illustrates using Flannel in VXLAN mode for simplicity.
Configuring Flannel
-
Prepare Kubernetes control plane for Flannel
Some minor preparation is recommended on the Kubernetes control plane in our cluster. It is recommended to enable bridged IPv4 traffic to iptables chains when using Flannel. The following command must be run on all Linux nodes:
sudo sysctl net.bridge.bridge-nf-call-iptables=1
-
Download & configure Flannel for Linux
Download the most recent Flannel manifest:
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Modify the
net-conf.json
section of the flannel manifest in order to set the VNI to 4096 and the Port to 4789. It should look as follows:net-conf.json: | { "Network": "10.244.0.0/16", "Backend": { "Type": "vxlan", "VNI": 4096, "Port": 4789 } }
Note: The VNI must be set to 4096 and port 4789 for Flannel on Linux to interoperate with Flannel on Windows. See the VXLAN documentation. for an explanation of these fields.Note: To use L2Bridge/Host-gateway mode instead change the value ofType
to"host-gw"
and omitVNI
andPort
. -
Apply the Flannel manifest and validate
Let's apply the Flannel configuration:
kubectl apply -f kube-flannel.yml
After a few minutes, you should see all the pods as running if the Flannel pod network was deployed.
kubectl get pods -n kube-system
The output should include the Linux flannel DaemonSet as running:
NAMESPACE NAME READY STATUS RESTARTS AGE ... kube-system kube-flannel-ds-54954 1/1 Running 0 1m
-
Add Windows Flannel and kube-proxy DaemonSets
Now you can add Windows-compatible versions of Flannel and kube-proxy. In order to ensure that you get a compatible version of kube-proxy, you'll need to substitute the tag of the image. The following example shows usage for Kubernetes v1.23.0, but you should adjust the version for your own deployment.
curl -L https://github.com/kubernetes-sigs/sig-windows-tools/releases/latest/download/kube-proxy.yml | sed 's/VERSION/v1.23.0/g' | kubectl apply -f - kubectl apply -f https://github.com/kubernetes-sigs/sig-windows-tools/releases/latest/download/flannel-overlay.yml
Note: If you're using host-gateway use https://github.com/kubernetes-sigs/sig-windows-tools/releases/latest/download/flannel-host-gw.yml insteadNote:If you're using a different interface rather than Ethernet (i.e. "Ethernet0 2") on the Windows nodes, you have to modify the line:
wins cli process run --path /k/flannel/setup.exe --args "--mode=overlay --interface=Ethernet"
in the
flannel-host-gw.yml
orflannel-overlay.yml
file and specify your interface accordingly.# Example curl -L https://github.com/kubernetes-sigs/sig-windows-tools/releases/latest/download/flannel-overlay.yml | sed 's/Ethernet/Ethernet0 2/g' | kubectl apply -f -
Joining a Windows worker node
Install Docker EE
Install the Containers
feature
Install-WindowsFeature -Name containers
Install Docker Instructions to do so are available at Install Docker Engine - Enterprise on Windows Servers.
Install wins, kubelet, and kubeadm
curl.exe -LO https://raw.githubusercontent.com/kubernetes-sigs/sig-windows-tools/master/kubeadm/scripts/PrepareNode.ps1
.\PrepareNode.ps1 -KubernetesVersion v1.23.0
Run kubeadm
to join the node
Use the command that was given to you when you ran kubeadm init
on a control plane host.
If you no longer have this command, or the token has expired, you can run kubeadm token create --print-join-command
(on a control plane host) to generate a new token and join command.
Install containerD
curl.exe -LO https://github.com/kubernetes-sigs/sig-windows-tools/releases/latest/download/Install-Containerd.ps1
.\Install-Containerd.ps1
To install a specific version of containerD specify the version with -ContainerDVersion.
# Example
.\Install-Containerd.ps1 -ContainerDVersion 1.4.1
If you're using a different interface rather than Ethernet (i.e. "Ethernet0 2") on the Windows nodes, specify the name with -netAdapterName
.
# Example
.\Install-Containerd.ps1 -netAdapterName "Ethernet0 2"
Install wins, kubelet, and kubeadm
curl.exe -LO https://raw.githubusercontent.com/kubernetes-sigs/sig-windows-tools/master/kubeadm/scripts/PrepareNode.ps1
.\PrepareNode.ps1 -KubernetesVersion v1.23.0 -ContainerRuntime containerD
Run kubeadm
to join the node
Use the command that was given to you when you ran kubeadm init
on a control plane host.
If you no longer have this command, or the token has expired, you can run kubeadm token create --print-join-command
(on a control plane host) to generate a new token and join command.
--cri-socket "npipe:////./pipe/containerd-containerd"
to the kubeadm call
Verifying your installation
You should now be able to view the Windows node in your cluster by running:
kubectl get nodes -o wide
If your new node is in the NotReady
state it is likely because the flannel image is still downloading.
You can check the progress as before by checking on the flannel pods in the kube-system
namespace:
kubectl -n kube-system get pods -l app=flannel
Once the flannel Pod is running, your node should enter the Ready
state and then be available to handle workloads.
What's next
4.2.1.5 - Upgrading Windows nodes
Kubernetes v1.18 [beta]
This page explains how to upgrade a Windows node created with kubeadm.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.17. To check the version, enterkubectl version
.
- Familiarize yourself with the process for upgrading the rest of your kubeadm cluster. You will want to upgrade the control plane nodes before upgrading your Windows nodes.
Upgrading worker nodes
Upgrade kubeadm
-
From the Windows node, upgrade kubeadm:
# replace v1.23.0 with your desired version curl.exe -Lo C:\k\kubeadm.exe https://dl.k8s.io//bin/windows/amd64/kubeadm.exe
Drain the node
-
From a machine with access to the Kubernetes API, prepare the node for maintenance by marking it unschedulable and evicting the workloads:
# replace <node-to-drain> with the name of your node you are draining kubectl drain <node-to-drain> --ignore-daemonsets
You should see output similar to this:
node/ip-172-31-85-18 cordoned node/ip-172-31-85-18 drained
Upgrade the kubelet configuration
-
From the Windows node, call the following command to sync new kubelet configuration:
kubeadm upgrade node
Upgrade kubelet
-
From the Windows node, upgrade and restart the kubelet:
stop-service kubelet curl.exe -Lo C:\k\kubelet.exe https://dl.k8s.io//bin/windows/amd64/kubelet.exe restart-service kubelet
Uncordon the node
-
From a machine with access to the Kubernetes API, bring the node back online by marking it schedulable:
# replace <node-to-drain> with the name of your node kubectl uncordon <node-to-drain>
Upgrade kube-proxy
-
From a machine with access to the Kubernetes API, run the following, again replacing v1.23.0 with your desired version:
curl -L https://github.com/kubernetes-sigs/sig-windows-tools/releases/latest/download/kube-proxy.yml | sed 's/VERSION/v1.23.0/g' | kubectl apply -f -
4.2.2 - Migrating from dockershim
This section presents information you need to know when migrating from dockershim to other container runtimes.
Since the announcement of dockershim deprecation in Kubernetes 1.20, there were questions on how this will affect various workloads and Kubernetes installations. Our Dockershim Removal FAQ is there to help you to understand the problem better.
It is recommended to migrate from dockershim to alternative container runtimes. Check out container runtimes section to know your options. Make sure to report issues you encountered with the migration. So the issue can be fixed in a timely manner and your cluster would be ready for dockershim removal.
4.2.2.1 - Changing the Container Runtime on a Node from Docker Engine to containerd
This task outlines the steps needed to update your container runtime to containerd from Docker. It is applicable for cluster operators running Kubernetes 1.23 or earlier. Also this covers an example scenario for migrating from dockershim to containerd and alternative container runtimes can be picked from this page.
Before you begin
Install containerd. For more information see, containerd's installation documentation and for specific prerequisite follow this.
Drain the node
# replace <node-to-drain> with the name of your node you are draining
kubectl drain <node-to-drain> --ignore-daemonsets
Stop the Docker daemon
systemctl stop kubelet
systemctl disable docker.service --now
Install Containerd
This page contains detailed steps to install containerd.
-
Install the
containerd.io
package from the official Docker repositories. Instructions for setting up the Docker repository for your respective Linux distribution and installing thecontainerd.io
package can be found at Install Docker Engine. -
Configure containerd:
sudo mkdir -p /etc/containerd containerd config default | sudo tee /etc/containerd/config.toml
-
Restart containerd:
sudo systemctl restart containerd
Start a Powershell session, set $Version
to the desired version (ex: $Version="1.4.3"
), and then run the following commands:
-
Download containerd:
curl.exe -L https://github.com/containerd/containerd/releases/download/v$Version/containerd-$Version-windows-amd64.tar.gz -o containerd-windows-amd64.tar.gz tar.exe xvf .\containerd-windows-amd64.tar.gz
-
Extract and configure:
Copy-Item -Path ".\bin\" -Destination "$Env:ProgramFiles\containerd" -Recurse -Force cd $Env:ProgramFiles\containerd\ .\containerd.exe config default | Out-File config.toml -Encoding ascii # Review the configuration. Depending on setup you may want to adjust: # - the sandbox_image (Kubernetes pause image) # - cni bin_dir and conf_dir locations Get-Content config.toml # (Optional - but highly recommended) Exclude containerd from Windows Defender Scans Add-MpPreference -ExclusionProcess "$Env:ProgramFiles\containerd\containerd.exe"
-
Start containerd:
.\containerd.exe --register-service Start-Service containerd
Configure the kubelet to use containerd as its container runtime
Edit the file /var/lib/kubelet/kubeadm-flags.env
and add the containerd runtime to the flags. --container-runtime=remote
and --container-runtime-endpoint=unix:///run/containerd/containerd.sock"
For users using kubeadm should consider the following:
The kubeadm
tool stores the CRI socket for each host as an annotation in the Node object for that host.
To change it you must do the following:
Execute kubectl edit no <NODE-NAME>
on a machine that has the kubeadm /etc/kubernetes/admin.conf
file.
This will start a text editor where you can edit the Node object.
To choose a text editor you can set the KUBE_EDITOR
environment variable.
-
Change the value of
kubeadm.alpha.kubernetes.io/cri-socket
from/var/run/dockershim.sock
to the CRI socket path of your choice (for exampleunix:///run/containerd/containerd.sock
).Note that new CRI socket paths must be prefixed with
unix://
ideally. -
Save the changes in the text editor, which will update the Node object.
Restart the kubelet
systemctl start kubelet
Verify that the node is healthy
Run kubectl get nodes -o wide
and containerd appears as the runtime for the node we just changed.
Remove Docker Engine
Finally if everything goes well remove docker
sudo yum remove docker-ce docker-ce-cli
sudo apt-get purge docker-ce docker-ce-cli
sudo dnf remove docker-ce docker-ce-cli
sudo apt-get purge docker-ce docker-ce-cli
4.2.2.2 - Find Out What Container Runtime is Used on a Node
This page outlines steps to find out what container runtime the nodes in your cluster use.
Depending on the way you run your cluster, the container runtime for the nodes may
have been pre-configured or you need to configure it. If you're using a managed
Kubernetes service, there might be vendor-specific ways to check what container runtime is
configured for the nodes. The method described on this page should work whenever
the execution of kubectl
is allowed.
Before you begin
Install and configure kubectl
. See Install Tools section for details.
Find out the container runtime used on a Node
Use kubectl
to fetch and show node information:
kubectl get nodes -o wide
The output is similar to the following. The column CONTAINER-RUNTIME
outputs
the runtime and its version.
# For dockershim
NAME STATUS VERSION CONTAINER-RUNTIME
node-1 Ready v1.16.15 docker://19.3.1
node-2 Ready v1.16.15 docker://19.3.1
node-3 Ready v1.16.15 docker://19.3.1
# For containerd
NAME STATUS VERSION CONTAINER-RUNTIME
node-1 Ready v1.19.6 containerd://1.4.1
node-2 Ready v1.19.6 containerd://1.4.1
node-3 Ready v1.19.6 containerd://1.4.1
Find out more information about container runtimes on Container Runtimes page.
4.2.2.3 - Check whether Dockershim deprecation affects you
The dockershim
component of Kubernetes allows to use Docker as a Kubernetes's
container runtime.
Kubernetes' built-in dockershim
component was deprecated in release v1.20.
This page explains how your cluster could be using Docker as a container runtime,
provides details on the role that dockershim
plays when in use, and shows steps
you can take to check whether any workloads could be affected by dockershim
deprecation.
Finding if your app has a dependencies on Docker
If you are using Docker for building your application containers, you can still run these containers on any container runtime. This use of Docker does not count as a dependency on Docker as a container runtime.
When alternative container runtime is used, executing Docker commands may either not work or yield unexpected output. This is how you can find whether you have a dependency on Docker:
- Make sure no privileged Pods execute Docker commands (like
docker ps
), restart the Docker service (commands such assystemctl restart docker.service
), or modify Docker-specific files such as/etc/docker/daemon.json
. - Check for any private registries or image mirror settings in the Docker
configuration file (like
/etc/docker/daemon.json
). Those typically need to be reconfigured for another container runtime. - Check that scripts and apps running on nodes outside of your Kubernetes
infrastructure do not execute Docker commands. It might be:
- SSH to nodes to troubleshoot;
- Node startup scripts;
- Monitoring and security agents installed on nodes directly.
- Third-party tools that perform above mentioned privileged operations. See Migrating telemetry and security agents from dockershim for more information.
- Make sure there is no indirect dependencies on dockershim behavior. This is an edge case and unlikely to affect your application. Some tooling may be configured to react to Docker-specific behaviors, for example, raise alert on specific metrics or search for a specific log message as part of troubleshooting instructions. If you have such tooling configured, test the behavior on test cluster before migration.
Dependency on Docker explained
A container runtime is software that can execute the containers that make up a Kubernetes pod. Kubernetes is responsible for orchestration and scheduling of Pods; on each node, the kubelet uses the container runtime interface as an abstraction so that you can use any compatible container runtime.
In its earliest releases, Kubernetes offered compatibility with one container runtime: Docker.
Later in the Kubernetes project's history, cluster operators wanted to adopt additional container runtimes.
The CRI was designed to allow this kind of flexibility - and the kubelet began supporting CRI. However,
because Docker existed before the CRI specification was invented, the Kubernetes project created an
adapter component, dockershim
. The dockershim adapter allows the kubelet to interact with Docker as
if Docker were a CRI compatible runtime.
You can read about it in Kubernetes Containerd integration goes GA blog post.
Switching to Containerd as a container runtime eliminates the middleman. All the same containers can be run by container runtimes like Containerd as before. But now, since containers schedule directly with the container runtime, they are not visible to Docker. So any Docker tooling or fancy UI you might have used before to check on these containers is no longer available.
You cannot get container information using docker ps
or docker inspect
commands. As you cannot list containers, you cannot get logs, stop containers,
or execute something inside container using docker exec
.
You can still pull images or build them using docker build
command. But images
built or pulled by Docker would not be visible to container runtime and
Kubernetes. They needed to be pushed to some registry to allow them to be used
by Kubernetes.
4.2.2.4 - Migrating telemetry and security agents from dockershim
Kubernetes' support for direct integration with Docker Engine is deprecated, and will be removed. Most apps do not have a direct dependency on runtime hosting containers. However, there are still a lot of telemetry and monitoring agents that has a dependency on docker to collect containers metadata, logs and metrics. This document aggregates information on how to detect these dependencies and links on how to migrate these agents to use generic tools or alternative runtimes.
Telemetry and security agents
Within a Kubernetes cluster there are a few different ways to run telemetry or security agents. Some agents have a direct dependency on Docker Engine when they as DaemonSets or directly on nodes.
Why do some telemetry agents communicate with Docker Engine?
Historically, Kubernetes was written to work specifically with Docker Engine. Kubernetes took care of networking and scheduling, relying on Docker Engine for launching and running containers (within Pods) on a node. Some information that is relevant to telemetry, such as a pod name, is only available from Kubernetes components. Other data, such as container metrics, is not the responsibility of the container runtime. Early telemetry agents needed to query the container runtime and Kubernetes to report an accurate picture. Over time, Kubernetes gained the ability to support multiple runtimes, and now supports any runtime that is compatible with the container runtime interface.
Some telemetry agents rely specifically on Docker Engine tooling. For example, an agent
might run a command such as
docker ps
or docker top
to list
containers and processes or docker logs
to receive streamed logs. If nodes in your existing cluster use
Docker Engine, and you switch to a different container runtime,
these commands will not work any longer.
Identify DaemonSets that depend on Docker Engine
If a pod wants to make calls to the dockerd
running on the node, the pod must either:
- mount the filesystem containing the Docker daemon's privileged socket, as a volume; or
- mount the specific path of the Docker daemon's privileged socket directly, also as a volume.
For example: on COS images, Docker exposes its Unix domain socket at
/var/run/docker.sock
This means that the pod spec will include a
hostPath
volume mount of /var/run/docker.sock
.
Here's a sample shell script to find Pods that have a mount directly mapping the
Docker socket. This script outputs the namespace and name of the pod. You can
remove the grep '/var/run/docker.sock'
to review other mounts.
kubectl get pods --all-namespaces \
-o=jsonpath='{range .items[*]}{"\n"}{.metadata.namespace}{":\t"}{.metadata.name}{":\t"}{range .spec.volumes[*]}{.hostPath.path}{", "}{end}{end}' \
| sort \
| grep '/var/run/docker.sock'
/var/run
may be mounted instead of the full path (like in this
example).
The script above only detects the most common uses.
Detecting Docker dependency from node agents
In case your cluster nodes are customized and install additional security and telemetry agents on the node, make sure to check with the vendor of the agent whether it has dependency on Docker.
Telemetry and security agent vendors
We keep the work in progress version of migration instructions for various telemetry and security agent vendors in Google doc. Please contact the vendor to get up to date instructions for migrating from dockershim.
4.2.3 - Certificates
When using client certificate authentication, you can generate certificates
manually through easyrsa
, openssl
or cfssl
.
easyrsa
easyrsa can manually generate certificates for your cluster.
-
Download, unpack, and initialize the patched version of easyrsa3.
curl -LO https://storage.googleapis.com/kubernetes-release/easy-rsa/easy-rsa.tar.gz tar xzf easy-rsa.tar.gz cd easy-rsa-master/easyrsa3 ./easyrsa init-pki
-
Generate a new certificate authority (CA).
--batch
sets automatic mode;--req-cn
specifies the Common Name (CN) for the CA's new root certificate../easyrsa --batch "--req-cn=${MASTER_IP}@`date +%s`" build-ca nopass
-
Generate server certificate and key. The argument
--subject-alt-name
sets the possible IPs and DNS names the API server will be accessed with. TheMASTER_CLUSTER_IP
is usually the first IP from the service CIDR that is specified as the--service-cluster-ip-range
argument for both the API server and the controller manager component. The argument--days
is used to set the number of days after which the certificate expires. The sample below also assumes that you are usingcluster.local
as the default DNS domain name../easyrsa --subject-alt-name="IP:${MASTER_IP},"\ "IP:${MASTER_CLUSTER_IP},"\ "DNS:kubernetes,"\ "DNS:kubernetes.default,"\ "DNS:kubernetes.default.svc,"\ "DNS:kubernetes.default.svc.cluster,"\ "DNS:kubernetes.default.svc.cluster.local" \ --days=10000 \ build-server-full server nopass
-
Copy
pki/ca.crt
,pki/issued/server.crt
, andpki/private/server.key
to your directory. -
Fill in and add the following parameters into the API server start parameters:
--client-ca-file=/yourdirectory/ca.crt --tls-cert-file=/yourdirectory/server.crt --tls-private-key-file=/yourdirectory/server.key
openssl
openssl can manually generate certificates for your cluster.
-
Generate a ca.key with 2048bit:
openssl genrsa -out ca.key 2048
-
According to the ca.key generate a ca.crt (use -days to set the certificate effective time):
openssl req -x509 -new -nodes -key ca.key -subj "/CN=${MASTER_IP}" -days 10000 -out ca.crt
-
Generate a server.key with 2048bit:
openssl genrsa -out server.key 2048
-
Create a config file for generating a Certificate Signing Request (CSR). Be sure to substitute the values marked with angle brackets (e.g.
<MASTER_IP>
) with real values before saving this to a file (e.g.csr.conf
). Note that the value forMASTER_CLUSTER_IP
is the service cluster IP for the API server as described in previous subsection. The sample below also assumes that you are usingcluster.local
as the default DNS domain name.[ req ] default_bits = 2048 prompt = no default_md = sha256 req_extensions = req_ext distinguished_name = dn [ dn ] C = <country> ST = <state> L = <city> O = <organization> OU = <organization unit> CN = <MASTER_IP> [ req_ext ] subjectAltName = @alt_names [ alt_names ] DNS.1 = kubernetes DNS.2 = kubernetes.default DNS.3 = kubernetes.default.svc DNS.4 = kubernetes.default.svc.cluster DNS.5 = kubernetes.default.svc.cluster.local IP.1 = <MASTER_IP> IP.2 = <MASTER_CLUSTER_IP> [ v3_ext ] authorityKeyIdentifier=keyid,issuer:always basicConstraints=CA:FALSE keyUsage=keyEncipherment,dataEncipherment extendedKeyUsage=serverAuth,clientAuth subjectAltName=@alt_names
-
Generate the certificate signing request based on the config file:
openssl req -new -key server.key -out server.csr -config csr.conf
-
Generate the server certificate using the ca.key, ca.crt and server.csr:
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key \ -CAcreateserial -out server.crt -days 10000 \ -extensions v3_ext -extfile csr.conf
-
View the certificate signing request:
openssl req -noout -text -in ./server.csr
-
View the certificate:
openssl x509 -noout -text -in ./server.crt
Finally, add the same parameters into the API server start parameters.
cfssl
cfssl is another tool for certificate generation.
-
Download, unpack and prepare the command line tools as shown below. Note that you may need to adapt the sample commands based on the hardware architecture and cfssl version you are using.
curl -L https://github.com/cloudflare/cfssl/releases/download/v1.5.0/cfssl_1.5.0_linux_amd64 -o cfssl chmod +x cfssl curl -L https://github.com/cloudflare/cfssl/releases/download/v1.5.0/cfssljson_1.5.0_linux_amd64 -o cfssljson chmod +x cfssljson curl -L https://github.com/cloudflare/cfssl/releases/download/v1.5.0/cfssl-certinfo_1.5.0_linux_amd64 -o cfssl-certinfo chmod +x cfssl-certinfo
-
Create a directory to hold the artifacts and initialize cfssl:
mkdir cert cd cert ../cfssl print-defaults config > config.json ../cfssl print-defaults csr > csr.json
-
Create a JSON config file for generating the CA file, for example,
ca-config.json
:{ "signing": { "default": { "expiry": "8760h" }, "profiles": { "kubernetes": { "usages": [ "signing", "key encipherment", "server auth", "client auth" ], "expiry": "8760h" } } } }
-
Create a JSON config file for CA certificate signing request (CSR), for example,
ca-csr.json
. Be sure to replace the values marked with angle brackets with real values you want to use.{ "CN": "kubernetes", "key": { "algo": "rsa", "size": 2048 }, "names":[{ "C": "<country>", "ST": "<state>", "L": "<city>", "O": "<organization>", "OU": "<organization unit>" }] }
-
Generate CA key (
ca-key.pem
) and certificate (ca.pem
):../cfssl gencert -initca ca-csr.json | ../cfssljson -bare ca
-
Create a JSON config file for generating keys and certificates for the API server, for example,
server-csr.json
. Be sure to replace the values in angle brackets with real values you want to use. TheMASTER_CLUSTER_IP
is the service cluster IP for the API server as described in previous subsection. The sample below also assumes that you are usingcluster.local
as the default DNS domain name.{ "CN": "kubernetes", "hosts": [ "127.0.0.1", "<MASTER_IP>", "<MASTER_CLUSTER_IP>", "kubernetes", "kubernetes.default", "kubernetes.default.svc", "kubernetes.default.svc.cluster", "kubernetes.default.svc.cluster.local" ], "key": { "algo": "rsa", "size": 2048 }, "names": [{ "C": "<country>", "ST": "<state>", "L": "<city>", "O": "<organization>", "OU": "<organization unit>" }] }
-
Generate the key and certificate for the API server, which are by default saved into file
server-key.pem
andserver.pem
respectively:../cfssl gencert -ca=ca.pem -ca-key=ca-key.pem \ --config=ca-config.json -profile=kubernetes \ server-csr.json | ../cfssljson -bare server
Distributing Self-Signed CA Certificate
A client node may refuse to recognize a self-signed CA certificate as valid. For a non-production deployment, or for a deployment that runs behind a company firewall, you can distribute a self-signed CA certificate to all clients and refresh the local list for valid certificates.
On each client, perform the following operations:
sudo cp ca.crt /usr/local/share/ca-certificates/kubernetes.crt
sudo update-ca-certificates
Updating certificates in /etc/ssl/certs...
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d....
done.
Certificates API
You can use the certificates.k8s.io
API to provision
x509 certificates to use for authentication as documented
here.
4.2.4 - Manage Memory, CPU, and API Resources
4.2.4.1 - Configure Default Memory Requests and Limits for a Namespace
This page shows how to configure default memory requests and limits for a namespace.
A Kubernetes cluster can be divided into namespaces. Once you have a namespace that has a default memory limit, and you then try to create a Pod with a container that does not specify its own memory limit, then the control plane assigns the default memory limit to that container.
Kubernetes assigns a default memory request under certain conditions that are explained later in this topic.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
You must have access to create namespaces in your cluster.
Each node in your cluster must have at least 2 GiB of memory.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace default-mem-example
Create a LimitRange and a Pod
Here's a manifest for an example LimitRange. The manifest specifies a default memory request and a default memory limit.
apiVersion: v1
kind: LimitRange
metadata:
name: mem-limit-range
spec:
limits:
- default:
memory: 512Mi
defaultRequest:
memory: 256Mi
type: Container
Create the LimitRange in the default-mem-example namespace:
kubectl apply -f https://k8s.io/examples/admin/resource/memory-defaults.yaml --namespace=default-mem-example
Now if you create a Pod in the default-mem-example namespace, and any container within that Pod does not specify its own values for memory request and memory limit, then the control plane applies default values: a memory request of 256MiB and a memory limit of 512MiB.
Here's an example manifest for a Pod that has one container. The container does not specify a memory request and limit.
apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo
spec:
containers:
- name: default-mem-demo-ctr
image: nginx
Create the Pod.
kubectl apply -f https://k8s.io/examples/admin/resource/memory-defaults-pod.yaml --namespace=default-mem-example
View detailed information about the Pod:
kubectl get pod default-mem-demo --output=yaml --namespace=default-mem-example
The output shows that the Pod's container has a memory request of 256 MiB and a memory limit of 512 MiB. These are the default values specified by the LimitRange.
containers:
- image: nginx
imagePullPolicy: Always
name: default-mem-demo-ctr
resources:
limits:
memory: 512Mi
requests:
memory: 256Mi
Delete your Pod:
kubectl delete pod default-mem-demo --namespace=default-mem-example
What if you specify a container's limit, but not its request?
Here's a manifest for a Pod that has one container. The container specifies a memory limit, but not a request:
apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo-2
spec:
containers:
- name: default-mem-demo-2-ctr
image: nginx
resources:
limits:
memory: "1Gi"
Create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/memory-defaults-pod-2.yaml --namespace=default-mem-example
View detailed information about the Pod:
kubectl get pod default-mem-demo-2 --output=yaml --namespace=default-mem-example
The output shows that the container's memory request is set to match its memory limit. Notice that the container was not assigned the default memory request value of 256Mi.
resources:
limits:
memory: 1Gi
requests:
memory: 1Gi
What if you specify a container's request, but not its limit?
Here's a manifest for a Pod that has one container. The container specifies a memory request, but not a limit:
apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo-3
spec:
containers:
- name: default-mem-demo-3-ctr
image: nginx
resources:
requests:
memory: "128Mi"
Create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/memory-defaults-pod-3.yaml --namespace=default-mem-example
View the Pod's specification:
kubectl get pod default-mem-demo-3 --output=yaml --namespace=default-mem-example
The output shows that the container's memory request is set to the value specified in the container's manifest. The container is limited to use no more than 512MiB of memory, which matches the default memory limit for the namespace.
resources:
limits:
memory: 512Mi
requests:
memory: 128Mi
Motivation for default memory limits and requests
If your namespace has a memory resource quota configured, it is helpful to have a default value in place for memory limit. Here are two of the restrictions that a resource quota imposes on a namespace:
- For every Pod that runs in the namespace, the Pod and each of its containers must have a memory limit. (If you specify a memory limit for every container in a Pod, Kubernetes can infer the Pod-level memory limit by adding up the limits for its containers).
- Memory limits apply a resource reservation on the node where the Pod in question is scheduled. The total amount of memory reserved for all Pods in the namespace must not exceed a specified limit.
- The total amount of memory actually used by all Pods in the namespace must also not exceed a specified limit.
When you add a LimitRange:
If any Pod in that namespace that includes a container does not specify its own memory limit, the control plane applies the default memory limit to that container, and the Pod can be allowed to run in a namespace that is restricted by a memory ResourceQuota.
Clean up
Delete your namespace:
kubectl delete namespace default-mem-example
What's next
For cluster administrators
-
Configure Minimum and Maximum Memory Constraints for a Namespace
-
Configure Minimum and Maximum CPU Constraints for a Namespace
For app developers
4.2.4.2 - Configure Default CPU Requests and Limits for a Namespace
This page shows how to configure default CPU requests and limits for a namespace.
A Kubernetes cluster can be divided into namespaces. If you create a Pod within a namespace that has a default CPU limit, and any container in that Pod does not specify its own CPU limit, then the control plane assigns the default CPU limit to that container.
Kubernetes assigns a default CPU request, but only under certain conditions that are explained later in this page.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
You must have access to create namespaces in your cluster.
If you're not already familiar with what Kubernetes means by 1.0 CPU, read meaning of CPU.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace default-cpu-example
Create a LimitRange and a Pod
Here's a manifest for an example LimitRange. The manifest specifies a default CPU request and a default CPU limit.
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-limit-range
spec:
limits:
- default:
cpu: 1
defaultRequest:
cpu: 0.5
type: Container
Create the LimitRange in the default-cpu-example namespace:
kubectl apply -f https://k8s.io/examples/admin/resource/cpu-defaults.yaml --namespace=default-cpu-example
Now if you create a Pod in the default-cpu-example namespace, and any container in that Pod does not specify its own values for CPU request and CPU limit, then the control plane applies default values: a CPU request of 0.5 and a default CPU limit of 1.
Here's a manifest for a Pod that has one container. The container does not specify a CPU request and limit.
apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo
spec:
containers:
- name: default-cpu-demo-ctr
image: nginx
Create the Pod.
kubectl apply -f https://k8s.io/examples/admin/resource/cpu-defaults-pod.yaml --namespace=default-cpu-example
View the Pod's specification:
kubectl get pod default-cpu-demo --output=yaml --namespace=default-cpu-example
The output shows that the Pod's only container has a CPU request of 500m cpu
(which you can read as “500 millicpu”), and a CPU limit of 1 cpu
.
These are the default values specified by the LimitRange.
containers:
- image: nginx
imagePullPolicy: Always
name: default-cpu-demo-ctr
resources:
limits:
cpu: "1"
requests:
cpu: 500m
What if you specify a container's limit, but not its request?
Here's a manifest for a Pod that has one container. The container specifies a CPU limit, but not a request:
apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo-2
spec:
containers:
- name: default-cpu-demo-2-ctr
image: nginx
resources:
limits:
cpu: "1"
Create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/cpu-defaults-pod-2.yaml --namespace=default-cpu-example
View the specification of the Pod that you created:
kubectl get pod default-cpu-demo-2 --output=yaml --namespace=default-cpu-example
The output shows that the container's CPU request is set to match its CPU limit.
Notice that the container was not assigned the default CPU request value of 0.5 cpu
:
resources:
limits:
cpu: "1"
requests:
cpu: "1"
What if you specify a container's request, but not its limit?
Here's an example manifest for a Pod that has one container. The container specifies a CPU request, but not a limit:
apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo-3
spec:
containers:
- name: default-cpu-demo-3-ctr
image: nginx
resources:
requests:
cpu: "0.75"
Create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/cpu-defaults-pod-3.yaml --namespace=default-cpu-example
View the specification of the Pod that you created:
kubectl get pod default-cpu-demo-3 --output=yaml --namespace=default-cpu-example
The output shows that the container's CPU request is set to the value you specified at
the time you created the Pod (in other words: it matches the manifest).
However, the same container's CPU limit is set to 1 cpu
, which is the default CPU limit
for that namespace.
resources:
limits:
cpu: "1"
requests:
cpu: 750m
Motivation for default CPU limits and requests
If your namespace has a CPU resource quota configured, it is helpful to have a default value in place for CPU limit. Here are two of the restrictions that a CPU resource quota imposes on a namespace:
- For every Pod that runs in the namespace, each of its containers must have a CPU limit.
- CPU limits apply a resource reservation on the node where the Pod in question is scheduled. The total amount of CPU that is reserved for use by all Pods in the namespace must not exceed a specified limit.
When you add a LimitRange:
If any Pod in that namespace that includes a container does not specify its own CPU limit, the control plane applies the default CPU limit to that container, and the Pod can be allowed to run in a namespace that is restricted by a CPU ResourceQuota.
Clean up
Delete your namespace:
kubectl delete namespace default-cpu-example
What's next
For cluster administrators
-
Configure Default Memory Requests and Limits for a Namespace
-
Configure Minimum and Maximum Memory Constraints for a Namespace
-
Configure Minimum and Maximum CPU Constraints for a Namespace
For app developers
4.2.4.3 - Configure Minimum and Maximum Memory Constraints for a Namespace
This page shows how to set minimum and maximum values for memory used by containers running in a namespace. You specify minimum and maximum memory values in a LimitRange object. If a Pod does not meet the constraints imposed by the LimitRange, it cannot be created in the namespace.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
You must have access to create namespaces in your cluster.
Each node in your cluster must have at least 1 GiB of memory available for Pods.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace constraints-mem-example
Create a LimitRange and a Pod
Here's an example manifest for a LimitRange:
apiVersion: v1
kind: LimitRange
metadata:
name: mem-min-max-demo-lr
spec:
limits:
- max:
memory: 1Gi
min:
memory: 500Mi
type: Container
Create the LimitRange:
kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints.yaml --namespace=constraints-mem-example
View detailed information about the LimitRange:
kubectl get limitrange mem-min-max-demo-lr --namespace=constraints-mem-example --output=yaml
The output shows the minimum and maximum memory constraints as expected. But notice that even though you didn't specify default values in the configuration file for the LimitRange, they were created automatically.
limits:
- default:
memory: 1Gi
defaultRequest:
memory: 1Gi
max:
memory: 1Gi
min:
memory: 500Mi
type: Container
Now whenever you define a Pod within the constraints-mem-example namespace, Kubernetes performs these steps:
-
If any container in that Pod does not specify its own memory request and limit, assign the default memory request and limit to that container.
-
Verify that every container in that Pod requests at least 500 MiB of memory.
-
Verify that every container in that Pod requests no more than 1024 MiB (1 GiB) of memory.
Here's a manifest for a Pod that has one container. Within the Pod spec, the sole container specifies a memory request of 600 MiB and a memory limit of 800 MiB. These satisfy the minimum and maximum memory constraints imposed by the LimitRange.
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo
spec:
containers:
- name: constraints-mem-demo-ctr
image: nginx
resources:
limits:
memory: "800Mi"
requests:
memory: "600Mi"
Create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints-pod.yaml --namespace=constraints-mem-example
Verify that the Pod is running and that its container is healthy:
kubectl get pod constraints-mem-demo --namespace=constraints-mem-example
View detailed information about the Pod:
kubectl get pod constraints-mem-demo --output=yaml --namespace=constraints-mem-example
The output shows that the container within that Pod has a memory request of 600 MiB and a memory limit of 800 MiB. These satisfy the constraints imposed by the LimitRange for this namespace:
resources:
limits:
memory: 800Mi
requests:
memory: 600Mi
Delete your Pod:
kubectl delete pod constraints-mem-demo --namespace=constraints-mem-example
Attempt to create a Pod that exceeds the maximum memory constraint
Here's a manifest for a Pod that has one container. The container specifies a memory request of 800 MiB and a memory limit of 1.5 GiB.
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-2
spec:
containers:
- name: constraints-mem-demo-2-ctr
image: nginx
resources:
limits:
memory: "1.5Gi"
requests:
memory: "800Mi"
Attempt to create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints-pod-2.yaml --namespace=constraints-mem-example
The output shows that the Pod does not get created, because it defines a container that requests more memory than is allowed:
Error from server (Forbidden): error when creating "examples/admin/resource/memory-constraints-pod-2.yaml":
pods "constraints-mem-demo-2" is forbidden: maximum memory usage per Container is 1Gi, but limit is 1536Mi.
Attempt to create a Pod that does not meet the minimum memory request
Here's a manifest for a Pod that has one container. That container specifies a memory request of 100 MiB and a memory limit of 800 MiB.
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-3
spec:
containers:
- name: constraints-mem-demo-3-ctr
image: nginx
resources:
limits:
memory: "800Mi"
requests:
memory: "100Mi"
Attempt to create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints-pod-3.yaml --namespace=constraints-mem-example
The output shows that the Pod does not get created, because it defines a container that requests less memory than the enforced minimum:
Error from server (Forbidden): error when creating "examples/admin/resource/memory-constraints-pod-3.yaml":
pods "constraints-mem-demo-3" is forbidden: minimum memory usage per Container is 500Mi, but request is 100Mi.
Create a Pod that does not specify any memory request or limit
Here's a manifest for a Pod that has one container. The container does not specify a memory request, and it does not specify a memory limit.
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-4
spec:
containers:
- name: constraints-mem-demo-4-ctr
image: nginx
Create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints-pod-4.yaml --namespace=constraints-mem-example
View detailed information about the Pod:
kubectl get pod constraints-mem-demo-4 --namespace=constraints-mem-example --output=yaml
The output shows that the Pod's only container has a memory request of 1 GiB and a memory limit of 1 GiB. How did that container get those values?
resources:
limits:
memory: 1Gi
requests:
memory: 1Gi
Because your Pod did not define any memory request and limit for that container, the cluster applied a default memory request and limit from the LimitRange.
This means that the definition of that Pod shows those values. You can check it using
kubectl describe
:
# Look for the "Requests:" section of the output
kubectl describe pod constraints-mem-demo-4 --namespace=constraints-mem-example
At this point, your Pod might be running or it might not be running. Recall that a prerequisite for this task is that your Nodes have at least 1 GiB of memory. If each of your Nodes has only 1 GiB of memory, then there is not enough allocatable memory on any Node to accommodate a memory request of 1 GiB. If you happen to be using Nodes with 2 GiB of memory, then you probably have enough space to accommodate the 1 GiB request.
Delete your Pod:
kubectl delete pod constraints-mem-demo-4 --namespace=constraints-mem-example
Enforcement of minimum and maximum memory constraints
The maximum and minimum memory constraints imposed on a namespace by a LimitRange are enforced only when a Pod is created or updated. If you change the LimitRange, it does not affect Pods that were created previously.
Motivation for minimum and maximum memory constraints
As a cluster administrator, you might want to impose restrictions on the amount of memory that Pods can use. For example:
-
Each Node in a cluster has 2 GiB of memory. You do not want to accept any Pod that requests more than 2 GiB of memory, because no Node in the cluster can support the request.
-
A cluster is shared by your production and development departments. You want to allow production workloads to consume up to 8 GiB of memory, but you want development workloads to be limited to 512 MiB. You create separate namespaces for production and development, and you apply memory constraints to each namespace.
Clean up
Delete your namespace:
kubectl delete namespace constraints-mem-example
What's next
For cluster administrators
-
Configure Default Memory Requests and Limits for a Namespace
-
Configure Minimum and Maximum CPU Constraints for a Namespace
For app developers
4.2.4.4 - Configure Minimum and Maximum CPU Constraints for a Namespace
This page shows how to set minimum and maximum values for the CPU resources used by containers and Pods in a namespace. You specify minimum and maximum CPU values in a LimitRange object. If a Pod does not meet the constraints imposed by the LimitRange, it cannot be created in the namespace.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
You must have access to create namespaces in your cluster.
Your cluster must have at least 1.0 CPU available for use to run the task examples. See meaning of CPU to learn what Kubernetes means by “1 CPU”.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace constraints-cpu-example
Create a LimitRange and a Pod
Here's an example manifest for a LimitRange:
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-min-max-demo-lr
spec:
limits:
- max:
cpu: "800m"
min:
cpu: "200m"
type: Container
Create the LimitRange:
kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints.yaml --namespace=constraints-cpu-example
View detailed information about the LimitRange:
kubectl get limitrange cpu-min-max-demo-lr --output=yaml --namespace=constraints-cpu-example
The output shows the minimum and maximum CPU constraints as expected. But notice that even though you didn't specify default values in the configuration file for the LimitRange, they were created automatically.
limits:
- default:
cpu: 800m
defaultRequest:
cpu: 800m
max:
cpu: 800m
min:
cpu: 200m
type: Container
Now whenever you create a Pod in the constraints-cpu-example namespace (or some other client of the Kubernetes API creates an equivalent Pod), Kubernetes performs these steps:
-
If any container in that Pod does not specify its own CPU request and limit, the control plane assigns the default CPU request and limit to that container.
-
Verify that every container in that Pod specifies a CPU request that is greater than or equal to 200 millicpu.
-
Verify that every container in that Pod specifies a CPU limit that is less than or equal to 800 millicpu.
LimitRange
object, you can specify limits on huge-pages
or GPUs as well. However, when both default
and defaultRequest
are specified
on these resources, the two values must be the same.
Here's a manifest for a Pod that has one container. The container manifest specifies a CPU request of 500 millicpu and a CPU limit of 800 millicpu. These satisfy the minimum and maximum CPU constraints imposed by the LimitRange.
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo
spec:
containers:
- name: constraints-cpu-demo-ctr
image: nginx
resources:
limits:
cpu: "800m"
requests:
cpu: "500m"
Create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints-pod.yaml --namespace=constraints-cpu-example
Verify that the Pod is running and that its container is healthy:
kubectl get pod constraints-cpu-demo --namespace=constraints-cpu-example
View detailed information about the Pod:
kubectl get pod constraints-cpu-demo --output=yaml --namespace=constraints-cpu-example
The output shows that the Pod's only container has a CPU request of 500 millicpu and CPU limit of 800 millicpu. These satisfy the constraints imposed by the LimitRange.
resources:
limits:
cpu: 800m
requests:
cpu: 500m
Delete the Pod
kubectl delete pod constraints-cpu-demo --namespace=constraints-cpu-example
Attempt to create a Pod that exceeds the maximum CPU constraint
Here's a manifest for a Pod that has one container. The container specifies a CPU request of 500 millicpu and a cpu limit of 1.5 cpu.
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-2
spec:
containers:
- name: constraints-cpu-demo-2-ctr
image: nginx
resources:
limits:
cpu: "1.5"
requests:
cpu: "500m"
Attempt to create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints-pod-2.yaml --namespace=constraints-cpu-example
The output shows that the Pod does not get created, because it defines an unacceptable container. That container is not acceptable because it specifies a CPU limit that is too large:
Error from server (Forbidden): error when creating "examples/admin/resource/cpu-constraints-pod-2.yaml":
pods "constraints-cpu-demo-2" is forbidden: maximum cpu usage per Container is 800m, but limit is 1500m.
Attempt to create a Pod that does not meet the minimum CPU request
Here's a manifest for a Pod that has one container. The container specifies a CPU request of 100 millicpu and a CPU limit of 800 millicpu.
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-3
spec:
containers:
- name: constraints-cpu-demo-3-ctr
image: nginx
resources:
limits:
cpu: "800m"
requests:
cpu: "100m"
Attempt to create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints-pod-3.yaml --namespace=constraints-cpu-example
The output shows that the Pod does not get created, because it defines an unacceptable container. That container is not acceptable because it specifies a CPU request that is lower than the enforced minimum:
Error from server (Forbidden): error when creating "examples/admin/resource/cpu-constraints-pod-3.yaml":
pods "constraints-cpu-demo-3" is forbidden: minimum cpu usage per Container is 200m, but request is 100m.
Create a Pod that does not specify any CPU request or limit
Here's a manifest for a Pod that has one container. The container does not specify a CPU request, nor does it specify a CPU limit.
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-4
spec:
containers:
- name: constraints-cpu-demo-4-ctr
image: vish/stress
Create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints-pod-4.yaml --namespace=constraints-cpu-example
View detailed information about the Pod:
kubectl get pod constraints-cpu-demo-4 --namespace=constraints-cpu-example --output=yaml
The output shows that the Pod's single container has a CPU request of 800 millicpu and a CPU limit of 800 millicpu. How did that container get those values?
resources:
limits:
cpu: 800m
requests:
cpu: 800m
Because that container did not specify its own CPU request and limit, the control plane applied the default CPU request and limit from the LimitRange for this namespace.
At this point, your Pod might be running or it might not be running. Recall that a prerequisite for this task is that your cluster must have at least 1 CPU available for use. If each of your Nodes has only 1 CPU, then there might not be enough allocatable CPU on any Node to accommodate a request of 800 millicpu. If you happen to be using Nodes with 2 CPU, then you probably have enough CPU to accommodate the 800 millicpu request.
Delete your Pod:
kubectl delete pod constraints-cpu-demo-4 --namespace=constraints-cpu-example
Enforcement of minimum and maximum CPU constraints
The maximum and minimum CPU constraints imposed on a namespace by a LimitRange are enforced only when a Pod is created or updated. If you change the LimitRange, it does not affect Pods that were created previously.
Motivation for minimum and maximum CPU constraints
As a cluster administrator, you might want to impose restrictions on the CPU resources that Pods can use. For example:
-
Each Node in a cluster has 2 CPU. You do not want to accept any Pod that requests more than 2 CPU, because no Node in the cluster can support the request.
-
A cluster is shared by your production and development departments. You want to allow production workloads to consume up to 3 CPU, but you want development workloads to be limited to 1 CPU. You create separate namespaces for production and development, and you apply CPU constraints to each namespace.
Clean up
Delete your namespace:
kubectl delete namespace constraints-cpu-example
What's next
For cluster administrators
-
Configure Default Memory Requests and Limits for a Namespace
-
Configure Minimum and Maximum Memory Constraints for a Namespace
For app developers
4.2.4.5 - Configure Memory and CPU Quotas for a Namespace
This page shows how to set quotas for the total amount memory and CPU that can be used by all Pods running in a namespace. You specify quotas in a ResourceQuota object.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
You must have access to create namespaces in your cluster.
Each node in your cluster must have at least 1 GiB of memory.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace quota-mem-cpu-example
Create a ResourceQuota
Here is a manifest for an example ResourceQuota:
apiVersion: v1
kind: ResourceQuota
metadata:
name: mem-cpu-demo
spec:
hard:
requests.cpu: "1"
requests.memory: 1Gi
limits.cpu: "2"
limits.memory: 2Gi
Create the ResourceQuota:
kubectl apply -f https://k8s.io/examples/admin/resource/quota-mem-cpu.yaml --namespace=quota-mem-cpu-example
View detailed information about the ResourceQuota:
kubectl get resourcequota mem-cpu-demo --namespace=quota-mem-cpu-example --output=yaml
The ResourceQuota places these requirements on the quota-mem-cpu-example namespace:
- For every Pod in the namespace, each container must have a memory request, memory limit, cpu request, and cpu limit.
- The memory request total for all Pods in that namespace must not exceed 1 GiB.
- The memory limit total for all Pods in that namespace must not exceed 2 GiB.
- The CPU request total for all Pods in that namespace must not exceed 1 cpu.
- The CPU limit total for all Pods in that namespace must not exceed 2 cpu.
See meaning of CPU to learn what Kubernetes means by “1 CPU”.
Create a Pod
Here is a manifest for an example Pod:
apiVersion: v1
kind: Pod
metadata:
name: quota-mem-cpu-demo
spec:
containers:
- name: quota-mem-cpu-demo-ctr
image: nginx
resources:
limits:
memory: "800Mi"
cpu: "800m"
requests:
memory: "600Mi"
cpu: "400m"
Create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/quota-mem-cpu-pod.yaml --namespace=quota-mem-cpu-example
Verify that the Pod is running and that its (only) container is healthy:
kubectl get pod quota-mem-cpu-demo --namespace=quota-mem-cpu-example
Once again, view detailed information about the ResourceQuota:
kubectl get resourcequota mem-cpu-demo --namespace=quota-mem-cpu-example --output=yaml
The output shows the quota along with how much of the quota has been used. You can see that the memory and CPU requests and limits for your Pod do not exceed the quota.
status:
hard:
limits.cpu: "2"
limits.memory: 2Gi
requests.cpu: "1"
requests.memory: 1Gi
used:
limits.cpu: 800m
limits.memory: 800Mi
requests.cpu: 400m
requests.memory: 600Mi
If you have the jq
tool, you can also query (using JSONPath)
for just the used
values, and pretty-print that that of the output. For example:
kubectl get resourcequota mem-cpu-demo --namespace=quota-mem-cpu-example -o jsonpath='{ .status.used }' | jq .
Attempt to create a second Pod
Here is a manifest for a second Pod:
apiVersion: v1
kind: Pod
metadata:
name: quota-mem-cpu-demo-2
spec:
containers:
- name: quota-mem-cpu-demo-2-ctr
image: redis
resources:
limits:
memory: "1Gi"
cpu: "800m"
requests:
memory: "700Mi"
cpu: "400m"
In the manifest, you can see that the Pod has a memory request of 700 MiB. Notice that the sum of the used memory request and this new memory request exceeds the memory request quota: 600 MiB + 700 MiB > 1 GiB.
Attempt to create the Pod:
kubectl apply -f https://k8s.io/examples/admin/resource/quota-mem-cpu-pod-2.yaml --namespace=quota-mem-cpu-example
The second Pod does not get created. The output shows that creating the second Pod would cause the memory request total to exceed the memory request quota.
Error from server (Forbidden): error when creating "examples/admin/resource/quota-mem-cpu-pod-2.yaml":
pods "quota-mem-cpu-demo-2" is forbidden: exceeded quota: mem-cpu-demo,
requested: requests.memory=700Mi,used: requests.memory=600Mi, limited: requests.memory=1Gi
Discussion
As you have seen in this exercise, you can use a ResourceQuota to restrict the memory request total for all Pods running in a namespace. You can also restrict the totals for memory limit, cpu request, and cpu limit.
Instead of managing total resource use within a namespace, you might want to restrict individual Pods, or the containers in those Pods. To achieve that kind of limiting, use a LimitRange.
Clean up
Delete your namespace:
kubectl delete namespace quota-mem-cpu-example
What's next
For cluster administrators
-
Configure Default Memory Requests and Limits for a Namespace
-
Configure Minimum and Maximum Memory Constraints for a Namespace
-
Configure Minimum and Maximum CPU Constraints for a Namespace
For app developers
4.2.4.6 - Configure a Pod Quota for a Namespace
This page shows how to set a quota for the total number of Pods that can run in a Namespace. You specify quotas in a ResourceQuota object.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
You must have access to create namespaces in your cluster.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace quota-pod-example
Create a ResourceQuota
Here is an example manifest for a ResourceQuota:
apiVersion: v1
kind: ResourceQuota
metadata:
name: pod-demo
spec:
hard:
pods: "2"
Create the ResourceQuota:
kubectl apply -f https://k8s.io/examples/admin/resource/quota-pod.yaml --namespace=quota-pod-example
View detailed information about the ResourceQuota:
kubectl get resourcequota pod-demo --namespace=quota-pod-example --output=yaml
The output shows that the namespace has a quota of two Pods, and that currently there are no Pods; that is, none of the quota is used.
spec:
hard:
pods: "2"
status:
hard:
pods: "2"
used:
pods: "0"
Here is an example manifest for a Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod-quota-demo
spec:
selector:
matchLabels:
purpose: quota-demo
replicas: 3
template:
metadata:
labels:
purpose: quota-demo
spec:
containers:
- name: pod-quota-demo
image: nginx
In that manifest, replicas: 3
tells Kubernetes to attempt to create three new Pods, all
running the same application.
Create the Deployment:
kubectl apply -f https://k8s.io/examples/admin/resource/quota-pod-deployment.yaml --namespace=quota-pod-example
View detailed information about the Deployment:
kubectl get deployment pod-quota-demo --namespace=quota-pod-example --output=yaml
The output shows that even though the Deployment specifies three replicas, only two Pods were created because of the quota you defined earlier:
spec:
...
replicas: 3
...
status:
availableReplicas: 2
...
lastUpdateTime: 2021-04-02T20:57:05Z
message: 'unable to create pods: pods "pod-quota-demo-1650323038-" is forbidden:
exceeded quota: pod-demo, requested: pods=1, used: pods=2, limited: pods=2'
Choice of resource
In this task you have defined a ResourceQuota that limited the total number of Pods, but you could also limit the total number of other kinds of object. For example, you might decide to limit how many CronJobs that can live in a single namespace.
Clean up
Delete your namespace:
kubectl delete namespace quota-pod-example
What's next
For cluster administrators
-
Configure Default Memory Requests and Limits for a Namespace
-
Configure Minimum and Maximum Memory Constraints for a Namespace
-
Configure Minimum and Maximum CPU Constraints for a Namespace
For app developers
4.2.5 - Install a Network Policy Provider
4.2.5.1 - Use Antrea for NetworkPolicy
This page shows how to install and use Antrea CNI plugin on Kubernetes. For background on Project Antrea, read the Introduction to Antrea.
Before you begin
You need to have a Kubernetes cluster. Follow the kubeadm getting started guide to bootstrap one.
Deploying Antrea with kubeadm
Follow Getting Started guide to deploy Antrea for kubeadm.
What's next
Once your cluster is running, you can follow the Declare Network Policy to try out Kubernetes NetworkPolicy.
4.2.5.2 - Use Calico for NetworkPolicy
This page shows a couple of quick ways to create a Calico cluster on Kubernetes.
Before you begin
Decide whether you want to deploy a cloud or local cluster.
Creating a Calico cluster with Google Kubernetes Engine (GKE)
Prerequisite: gcloud.
-
To launch a GKE cluster with Calico, include the
--enable-network-policy
flag.Syntax
gcloud container clusters create [CLUSTER_NAME] --enable-network-policy
Example
gcloud container clusters create my-calico-cluster --enable-network-policy
-
To verify the deployment, use the following command.
kubectl get pods --namespace=kube-system
The Calico pods begin with
calico
. Check to make sure each one has a status ofRunning
.
Creating a local Calico cluster with kubeadm
To get a local single-host Calico cluster in fifteen minutes using kubeadm, refer to the Calico Quickstart.
What's next
Once your cluster is running, you can follow the Declare Network Policy to try out Kubernetes NetworkPolicy.
4.2.5.3 - Use Cilium for NetworkPolicy
This page shows how to use Cilium for NetworkPolicy.
For background on Cilium, read the Introduction to Cilium.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Deploying Cilium on Minikube for Basic Testing
To get familiar with Cilium easily you can follow the Cilium Kubernetes Getting Started Guide to perform a basic DaemonSet installation of Cilium in minikube.
To start minikube, minimal version required is >= v1.5.2, run the with the following arguments:
minikube version
minikube version: v1.5.2
minikube start --network-plugin=cni
For minikube you can install Cilium using its CLI tool. Cilium will automatically detect the cluster configuration and will install the appropriate components for a successful installation:
curl -LO https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz
sudo tar xzvfC cilium-linux-amd64.tar.gz /usr/local/bin
rm cilium-linux-amd64.tar.gz
cilium install
🔮 Auto-detected Kubernetes kind: minikube
✨ Running "minikube" validation checks
✅ Detected minikube version "1.20.0"
ℹ️ Cilium version not set, using default version "v1.10.0"
🔮 Auto-detected cluster name: minikube
🔮 Auto-detected IPAM mode: cluster-pool
🔮 Auto-detected datapath mode: tunnel
🔑 Generating CA...
2021/05/27 02:54:44 [INFO] generate received request
2021/05/27 02:54:44 [INFO] received CSR
2021/05/27 02:54:44 [INFO] generating key: ecdsa-256
2021/05/27 02:54:44 [INFO] encoded CSR
2021/05/27 02:54:44 [INFO] signed certificate with serial number 48713764918856674401136471229482703021230538642
🔑 Generating certificates for Hubble...
2021/05/27 02:54:44 [INFO] generate received request
2021/05/27 02:54:44 [INFO] received CSR
2021/05/27 02:54:44 [INFO] generating key: ecdsa-256
2021/05/27 02:54:44 [INFO] encoded CSR
2021/05/27 02:54:44 [INFO] signed certificate with serial number 3514109734025784310086389188421560613333279574
🚀 Creating Service accounts...
🚀 Creating Cluster roles...
🚀 Creating ConfigMap...
🚀 Creating Agent DaemonSet...
🚀 Creating Operator Deployment...
⌛ Waiting for Cilium to be installed...
The remainder of the Getting Started Guide explains how to enforce both L3/L4 (i.e., IP address + port) security policies, as well as L7 (e.g., HTTP) security policies using an example application.
Deploying Cilium for Production Use
For detailed instructions around deploying Cilium for production, see: Cilium Kubernetes Installation Guide This documentation includes detailed requirements, instructions and example production DaemonSet files.
Understanding Cilium components
Deploying a cluster with Cilium adds Pods to the kube-system
namespace. To see
this list of Pods run:
kubectl get pods --namespace=kube-system -l k8s-app=cilium
You'll see a list of Pods similar to this:
NAME READY STATUS RESTARTS AGE
cilium-kkdhz 1/1 Running 0 3m23s
...
A cilium
Pod runs on each node in your cluster and enforces network policy
on the traffic to/from Pods on that node using Linux BPF.
What's next
Once your cluster is running, you can follow the Declare Network Policy to try out Kubernetes NetworkPolicy with Cilium. Have fun, and if you have questions, contact us using the Cilium Slack Channel.
4.2.5.4 - Use Kube-router for NetworkPolicy
This page shows how to use Kube-router for NetworkPolicy.
Before you begin
You need to have a Kubernetes cluster running. If you do not already have a cluster, you can create one by using any of the cluster installers like Kops, Bootkube, Kubeadm etc.
Installing Kube-router addon
The Kube-router Addon comes with a Network Policy Controller that watches Kubernetes API server for any NetworkPolicy and pods updated and configures iptables rules and ipsets to allow or block traffic as directed by the policies. Please follow the trying Kube-router with cluster installers guide to install Kube-router addon.
What's next
Once you have installed the Kube-router addon, you can follow the Declare Network Policy to try out Kubernetes NetworkPolicy.
4.2.5.5 - Romana for NetworkPolicy
This page shows how to use Romana for NetworkPolicy.
Before you begin
Complete steps 1, 2, and 3 of the kubeadm getting started guide.
Installing Romana with kubeadm
Follow the containerized installation guide for kubeadm.
Applying network policies
To apply network policies use one of the following:
- Romana network policies.
- The NetworkPolicy API.
What's next
Once you have installed Romana, you can follow the Declare Network Policy to try out Kubernetes NetworkPolicy.
4.2.5.6 - Weave Net for NetworkPolicy
This page shows how to use Weave Net for NetworkPolicy.
Before you begin
You need to have a Kubernetes cluster. Follow the kubeadm getting started guide to bootstrap one.
Install the Weave Net addon
Follow the Integrating Kubernetes via the Addon guide.
The Weave Net addon for Kubernetes comes with a
Network Policy Controller
that automatically monitors Kubernetes for any NetworkPolicy annotations on all
namespaces and configures iptables
rules to allow or block traffic as directed by the policies.
Test the installation
Verify that the weave works.
Enter the following command:
kubectl get pods -n kube-system -o wide
The output is similar to this:
NAME READY STATUS RESTARTS AGE IP NODE
weave-net-1t1qg 2/2 Running 0 9d 192.168.2.10 worknode3
weave-net-231d7 2/2 Running 1 7d 10.2.0.17 worknodegpu
weave-net-7nmwt 2/2 Running 3 9d 192.168.2.131 masternode
weave-net-pmw8w 2/2 Running 0 9d 192.168.2.216 worknode2
Each Node has a weave Pod, and all Pods are Running
and 2/2 READY
. (2/2
means that each Pod has weave
and weave-npc
.)
What's next
Once you have installed the Weave Net addon, you can follow the Declare Network Policy to try out Kubernetes NetworkPolicy. If you have any question, contact us at #weave-community on Slack or Weave User Group.
4.2.6 - Access Clusters Using the Kubernetes API
This page shows how to access clusters using the Kubernetes API.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Accessing the Kubernetes API
Accessing for the first time with kubectl
When accessing the Kubernetes API for the first time, use the
Kubernetes command-line tool, kubectl
.
To access a cluster, you need to know the location of the cluster and have credentials to access it. Typically, this is automatically set-up when you work through a Getting started guide, or someone else setup the cluster and provided you with credentials and a location.
Check the location and credentials that kubectl knows about with this command:
kubectl config view
Many of the examples provide an introduction to using kubectl. Complete documentation is found in the kubectl manual.
Directly accessing the REST API
kubectl handles locating and authenticating to the API server. If you want to directly access the REST API with an http client like
curl
or wget
, or a browser, there are multiple ways you can locate and authenticate against the API server:
- Run kubectl in proxy mode (recommended). This method is recommended, since it uses the stored apiserver location and verifies the identity of the API server using a self-signed cert. No man-in-the-middle (MITM) attack is possible using this method.
- Alternatively, you can provide the location and credentials directly to the http client. This works with client code that is confused by proxies. To protect against man in the middle attacks, you'll need to import a root cert into your browser.
Using the Go or Python client libraries provides accessing kubectl in proxy mode.
Using kubectl proxy
The following command runs kubectl in a mode where it acts as a reverse proxy. It handles locating the API server and authenticating.
Run it like this:
kubectl proxy --port=8080 &
See kubectl proxy for more details.
Then you can explore the API with curl, wget, or a browser, like so:
curl http://localhost:8080/api/
The output is similar to this:
{
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}
Without kubectl proxy
It is possible to avoid using kubectl proxy by passing an authentication token directly to the API server, like this:
Using grep/cut
approach:
# Check all possible clusters, as your .KUBECONFIG may have multiple contexts:
kubectl config view -o jsonpath='{"Cluster name\tServer\n"}{range .clusters[*]}{.name}{"\t"}{.cluster.server}{"\n"}{end}'
# Select name of cluster you want to interact with from above output:
export CLUSTER_NAME="some_server_name"
# Point to the API server referring the cluster name
APISERVER=$(kubectl config view -o jsonpath="{.clusters[?(@.name==\"$CLUSTER_NAME\")].cluster.server}")
# Create a secret to hold a token for the default service account
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: default-token
annotations:
kubernetes.io/service-account.name: default
type: kubernetes.io/service-account-token
EOF
# Wait for the token controller to populate the secret with a token:
while ! kubectl describe secret default-token | grep -E '^token' >/dev/null; do
echo "waiting for token..." >&2
sleep 1
done
# Get the token value
TOKEN=$(kubectl get secret default-token -o jsonpath='{.data.token}' | base64 --decode)
# Explore the API with TOKEN
curl -X GET $APISERVER/api --header "Authorization: Bearer $TOKEN" --insecure
The output is similar to this:
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}
The above example uses the --insecure
flag. This leaves it subject to MITM
attacks. When kubectl accesses the cluster it uses a stored root certificate
and client certificates to access the server. (These are installed in the
~/.kube
directory). Since cluster certificates are typically self-signed, it
may take special configuration to get your http client to use root
certificate.
On some clusters, the API server does not require authentication; it may serve on localhost, or be protected by a firewall. There is not a standard for this. Controlling Access to the Kubernetes API describes how you can configure this as a cluster administrator.
Programmatic access to the API
Kubernetes officially supports client libraries for Go, Python, Java, dotnet, Javascript, and Haskell. There are other client libraries that are provided and maintained by their authors, not the Kubernetes team. See client libraries for accessing the API from other languages and how they authenticate.
Go client
- To get the library, run the following command:
go get k8s.io/client-go@kubernetes-<kubernetes-version-number>
See https://github.com/kubernetes/client-go/releases to see which versions are supported. - Write an application atop of the client-go clients.
import "k8s.io/client-go/kubernetes"
is correct.
The Go client can use the same kubeconfig file as the kubectl CLI does to locate and authenticate to the API server. See this example:
package main
import (
"context"
"fmt"
"k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
)
func main() {
// uses the current context in kubeconfig
// path-to-kubeconfig -- for example, /root/.kube/config
config, _ := clientcmd.BuildConfigFromFlags("", "<path-to-kubeconfig>")
// creates the clientset
clientset, _ := kubernetes.NewForConfig(config)
// access the API to list pods
pods, _ := clientset.CoreV1().Pods("").List(context.TODO(), v1.ListOptions{})
fmt.Printf("There are %d pods in the cluster\n", len(pods.Items))
}
If the application is deployed as a Pod in the cluster, see Accessing the API from within a Pod.
Python client
To use Python client, run the following command: pip install kubernetes
See Python Client Library page for more installation options.
The Python client can use the same kubeconfig file as the kubectl CLI does to locate and authenticate to the API server. See this example:
from kubernetes import client, config
config.load_kube_config()
v1=client.CoreV1Api()
print("Listing pods with their IPs:")
ret = v1.list_pod_for_all_namespaces(watch=False)
for i in ret.items:
print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name))
Java client
To install the Java Client, run:
# Clone java library
git clone --recursive https://github.com/kubernetes-client/java
# Installing project artifacts, POM etc:
cd java
mvn install
See https://github.com/kubernetes-client/java/releases to see which versions are supported.
The Java client can use the same kubeconfig file as the kubectl CLI does to locate and authenticate to the API server. See this example:
package io.kubernetes.client.examples;
import io.kubernetes.client.ApiClient;
import io.kubernetes.client.ApiException;
import io.kubernetes.client.Configuration;
import io.kubernetes.client.apis.CoreV1Api;
import io.kubernetes.client.models.V1Pod;
import io.kubernetes.client.models.V1PodList;
import io.kubernetes.client.util.ClientBuilder;
import io.kubernetes.client.util.KubeConfig;
import java.io.FileReader;
import java.io.IOException;
/**
* A simple example of how to use the Java API from an application outside a kubernetes cluster
*
* <p>Easiest way to run this: mvn exec:java
* -Dexec.mainClass="io.kubernetes.client.examples.KubeConfigFileClientExample"
*
*/
public class KubeConfigFileClientExample {
public static void main(String[] args) throws IOException, ApiException {
// file path to your KubeConfig
String kubeConfigPath = "~/.kube/config";
// loading the out-of-cluster config, a kubeconfig from file-system
ApiClient client =
ClientBuilder.kubeconfig(KubeConfig.loadKubeConfig(new FileReader(kubeConfigPath))).build();
// set the global default api-client to the in-cluster one from above
Configuration.setDefaultApiClient(client);
// the CoreV1Api loads default api-client from global configuration.
CoreV1Api api = new CoreV1Api();
// invokes the CoreV1Api client
V1PodList list = api.listPodForAllNamespaces(null, null, null, null, null, null, null, null, null);
System.out.println("Listing all pods: ");
for (V1Pod item : list.getItems()) {
System.out.println(item.getMetadata().getName());
}
}
}
dotnet client
To use dotnet client, run the following command: dotnet add package KubernetesClient --version 1.6.1
See dotnet Client Library page for more installation options. See https://github.com/kubernetes-client/csharp/releases to see which versions are supported.
The dotnet client can use the same kubeconfig file as the kubectl CLI does to locate and authenticate to the API server. See this example:
using System;
using k8s;
namespace simple
{
internal class PodList
{
private static void Main(string[] args)
{
var config = KubernetesClientConfiguration.BuildDefaultConfig();
IKubernetes client = new Kubernetes(config);
Console.WriteLine("Starting Request!");
var list = client.ListNamespacedPod("default");
foreach (var item in list.Items)
{
Console.WriteLine(item.Metadata.Name);
}
if (list.Items.Count == 0)
{
Console.WriteLine("Empty!");
}
}
}
}
JavaScript client
To install JavaScript client, run the following command: npm install @kubernetes/client-node
. See https://github.com/kubernetes-client/javascript/releases to see which versions are supported.
The JavaScript client can use the same kubeconfig file as the kubectl CLI does to locate and authenticate to the API server. See this example:
const k8s = require('@kubernetes/client-node');
const kc = new k8s.KubeConfig();
kc.loadFromDefault();
const k8sApi = kc.makeApiClient(k8s.CoreV1Api);
k8sApi.listNamespacedPod('default').then((res) => {
console.log(res.body);
});
Haskell client
See https://github.com/kubernetes-client/haskell/releases to see which versions are supported.
The Haskell client can use the same kubeconfig file as the kubectl CLI does to locate and authenticate to the API server. See this example:
exampleWithKubeConfig :: IO ()
exampleWithKubeConfig = do
oidcCache <- atomically $ newTVar $ Map.fromList []
(mgr, kcfg) <- mkKubeClientConfig oidcCache $ KubeConfigFile "/path/to/kubeconfig"
dispatchMime
mgr
kcfg
(CoreV1.listPodForAllNamespaces (Accept MimeJSON))
>>= print
What's next
4.2.7 - Advertise Extended Resources for a Node
This page shows how to specify extended resources for a Node. Extended resources allow cluster administrators to advertise node-level resources that would otherwise be unknown to Kubernetes.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Get the names of your Nodes
kubectl get nodes
Choose one of your Nodes to use for this exercise.
Advertise a new extended resource on one of your Nodes
To advertise a new extended resource on a Node, send an HTTP PATCH request to the Kubernetes API server. For example, suppose one of your Nodes has four dongles attached. Here's an example of a PATCH request that advertises four dongle resources for your Node.
PATCH /api/v1/nodes/<your-node-name>/status HTTP/1.1
Accept: application/json
Content-Type: application/json-patch+json
Host: k8s-master:8080
[
{
"op": "add",
"path": "/status/capacity/example.com~1dongle",
"value": "4"
}
]
Note that Kubernetes does not need to know what a dongle is or what a dongle is for. The preceding PATCH request tells Kubernetes that your Node has four things that you call dongles.
Start a proxy, so that you can easily send requests to the Kubernetes API server:
kubectl proxy
In another command window, send the HTTP PATCH request.
Replace <your-node-name>
with the name of your Node:
curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/example.com~1dongle", "value": "4"}]' \
http://localhost:8001/api/v1/nodes/<your-node-name>/status
~1
is the encoding for the character / in
the patch path. The operation path value in JSON-Patch is interpreted as a
JSON-Pointer. For more details, see
IETF RFC 6901, section 3.
The output shows that the Node has a capacity of 4 dongles:
"capacity": {
"cpu": "2",
"memory": "2049008Ki",
"example.com/dongle": "4",
Describe your Node:
kubectl describe node <your-node-name>
Once again, the output shows the dongle resource:
Capacity:
cpu: 2
memory: 2049008Ki
example.com/dongle: 4
Now, application developers can create Pods that request a certain number of dongles. See Assign Extended Resources to a Container.
Discussion
Extended resources are similar to memory and CPU resources. For example, just as a Node has a certain amount of memory and CPU to be shared by all components running on the Node, it can have a certain number of dongles to be shared by all components running on the Node. And just as application developers can create Pods that request a certain amount of memory and CPU, they can create Pods that request a certain number of dongles.
Extended resources are opaque to Kubernetes; Kubernetes does not know anything about what they are. Kubernetes knows only that a Node has a certain number of them. Extended resources must be advertised in integer amounts. For example, a Node can advertise four dongles, but not 4.5 dongles.
Storage example
Suppose a Node has 800 GiB of a special kind of disk storage. You could create a name for the special storage, say example.com/special-storage. Then you could advertise it in chunks of a certain size, say 100 GiB. In that case, your Node would advertise that it has eight resources of type example.com/special-storage.
Capacity:
...
example.com/special-storage: 8
If you want to allow arbitrary requests for special storage, you could advertise special storage in chunks of size 1 byte. In that case, you would advertise 800Gi resources of type example.com/special-storage.
Capacity:
...
example.com/special-storage: 800Gi
Then a Container could request any number of bytes of special storage, up to 800Gi.
Clean up
Here is a PATCH request that removes the dongle advertisement from a Node.
PATCH /api/v1/nodes/<your-node-name>/status HTTP/1.1
Accept: application/json
Content-Type: application/json-patch+json
Host: k8s-master:8080
[
{
"op": "remove",
"path": "/status/capacity/example.com~1dongle",
}
]
Start a proxy, so that you can easily send requests to the Kubernetes API server:
kubectl proxy
In another command window, send the HTTP PATCH request.
Replace <your-node-name>
with the name of your Node:
curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "remove", "path": "/status/capacity/example.com~1dongle"}]' \
http://localhost:8001/api/v1/nodes/<your-node-name>/status
Verify that the dongle advertisement has been removed:
kubectl describe node <your-node-name> | grep dongle
(you should not see any output)
What's next
For application developers
For cluster administrators
4.2.8 - Autoscale the DNS Service in a Cluster
This page shows how to enable and configure autoscaling of the DNS service in your Kubernetes cluster.
Before you begin
-
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
. -
This guide assumes your nodes use the AMD64 or Intel 64 CPU architecture.
-
Make sure Kubernetes DNS is enabled.
Determine whether DNS horizontal autoscaling is already enabled
List the Deployments in your cluster in the kube-system namespace:
kubectl get deployment --namespace=kube-system
The output is similar to this:
NAME READY UP-TO-DATE AVAILABLE AGE
...
dns-autoscaler 1/1 1 1 ...
...
If you see "dns-autoscaler" in the output, DNS horizontal autoscaling is already enabled, and you can skip to Tuning autoscaling parameters.
Get the name of your DNS Deployment
List the DNS deployments in your cluster in the kube-system namespace:
kubectl get deployment -l k8s-app=kube-dns --namespace=kube-system
The output is similar to this:
NAME READY UP-TO-DATE AVAILABLE AGE
...
coredns 2/2 2 2 ...
...
If you don't see a Deployment for DNS services, you can also look for it by name:
kubectl get deployment --namespace=kube-system
and look for a deployment named coredns
or kube-dns
.
Your scale target is
Deployment/<your-deployment-name>
where <your-deployment-name>
is the name of your DNS Deployment. For example, if
the name of your Deployment for DNS is coredns, your scale target is Deployment/coredns.
k8s-app=kube-dns
so that it can work in clusters that originally used
kube-dns.
Enable DNS horizontal autoscaling
In this section, you create a new Deployment. The Pods in the Deployment run a
container based on the cluster-proportional-autoscaler-amd64
image.
Create a file named dns-horizontal-autoscaler.yaml
with this content:
kind: ServiceAccount
apiVersion: v1
metadata:
name: kube-dns-autoscaler
namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: system:kube-dns-autoscaler
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["replicationcontrollers/scale"]
verbs: ["get", "update"]
- apiGroups: ["apps"]
resources: ["deployments/scale", "replicasets/scale"]
verbs: ["get", "update"]
# Remove the configmaps rule once below issue is fixed:
# kubernetes-incubator/cluster-proportional-autoscaler#16
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "create"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: system:kube-dns-autoscaler
subjects:
- kind: ServiceAccount
name: kube-dns-autoscaler
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:kube-dns-autoscaler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-dns-autoscaler
namespace: kube-system
labels:
k8s-app: kube-dns-autoscaler
kubernetes.io/cluster-service: "true"
spec:
selector:
matchLabels:
k8s-app: kube-dns-autoscaler
template:
metadata:
labels:
k8s-app: kube-dns-autoscaler
spec:
priorityClassName: system-cluster-critical
securityContext:
seccompProfile:
type: RuntimeDefault
supplementalGroups: [ 65534 ]
fsGroup: 65534
nodeSelector:
kubernetes.io/os: linux
containers:
- name: autoscaler
image: k8s.gcr.io/cpa/cluster-proportional-autoscaler:1.8.4
resources:
requests:
cpu: "20m"
memory: "10Mi"
command:
- /cluster-proportional-autoscaler
- --namespace=kube-system
- --configmap=kube-dns-autoscaler
# Should keep target in sync with cluster/addons/dns/kube-dns.yaml.base
- --target=<SCALE_TARGET>
# When cluster is using large nodes(with more cores), "coresPerReplica" should dominate.
# If using small nodes, "nodesPerReplica" should dominate.
- --default-params={"linear":{"coresPerReplica":256,"nodesPerReplica":16,"preventSinglePointFailure":true,"includeUnschedulableNodes":true}}
- --logtostderr=true
- --v=2
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
serviceAccountName: kube-dns-autoscaler
In the file, replace <SCALE_TARGET>
with your scale target.
Go to the directory that contains your configuration file, and enter this command to create the Deployment:
kubectl apply -f dns-horizontal-autoscaler.yaml
The output of a successful command is:
deployment.apps/dns-autoscaler created
DNS horizontal autoscaling is now enabled.
Tune DNS autoscaling parameters
Verify that the dns-autoscaler ConfigMap exists:
kubectl get configmap --namespace=kube-system
The output is similar to this:
NAME DATA AGE
...
dns-autoscaler 1 ...
...
Modify the data in the ConfigMap:
kubectl edit configmap dns-autoscaler --namespace=kube-system
Look for this line:
linear: '{"coresPerReplica":256,"min":1,"nodesPerReplica":16}'
Modify the fields according to your needs. The "min" field indicates the minimal number of DNS backends. The actual number of backends is calculated using this equation:
replicas = max( ceil( cores × 1/coresPerReplica ) , ceil( nodes × 1/nodesPerReplica ) )
Note that the values of both coresPerReplica
and nodesPerReplica
are
floats.
The idea is that when a cluster is using nodes that have many cores,
coresPerReplica
dominates. When a cluster is using nodes that have fewer
cores, nodesPerReplica
dominates.
There are other supported scaling patterns. For details, see cluster-proportional-autoscaler.
Disable DNS horizontal autoscaling
There are a few options for tuning DNS horizontal autoscaling. Which option to use depends on different conditions.
Option 1: Scale down the dns-autoscaler deployment to 0 replicas
This option works for all situations. Enter this command:
kubectl scale deployment --replicas=0 dns-autoscaler --namespace=kube-system
The output is:
deployment.apps/dns-autoscaler scaled
Verify that the replica count is zero:
kubectl get rs --namespace=kube-system
The output displays 0 in the DESIRED and CURRENT columns:
NAME DESIRED CURRENT READY AGE
...
dns-autoscaler-6b59789fc8 0 0 0 ...
...
Option 2: Delete the dns-autoscaler deployment
This option works if dns-autoscaler is under your own control, which means no one will re-create it:
kubectl delete deployment dns-autoscaler --namespace=kube-system
The output is:
deployment.apps "dns-autoscaler" deleted
Option 3: Delete the dns-autoscaler manifest file from the master node
This option works if dns-autoscaler is under control of the (deprecated) Addon Manager, and you have write access to the master node.
Sign in to the master node and delete the corresponding manifest file. The common path for this dns-autoscaler is:
/etc/kubernetes/addons/dns-horizontal-autoscaler/dns-horizontal-autoscaler.yaml
After the manifest file is deleted, the Addon Manager will delete the dns-autoscaler Deployment.
Understanding how DNS horizontal autoscaling works
-
The cluster-proportional-autoscaler application is deployed separately from the DNS service.
-
An autoscaler Pod runs a client that polls the Kubernetes API server for the number of nodes and cores in the cluster.
-
A desired replica count is calculated and applied to the DNS backends based on the current schedulable nodes and cores and the given scaling parameters.
-
The scaling parameters and data points are provided via a ConfigMap to the autoscaler, and it refreshes its parameters table every poll interval to be up to date with the latest desired scaling parameters.
-
Changes to the scaling parameters are allowed without rebuilding or restarting the autoscaler Pod.
-
The autoscaler provides a controller interface to support two control patterns: linear and ladder.
What's next
- Read about Guaranteed Scheduling For Critical Add-On Pods.
- Learn more about the implementation of cluster-proportional-autoscaler.
4.2.9 - Change the default StorageClass
This page shows how to change the default Storage Class that is used to provision volumes for PersistentVolumeClaims that have no special requirements.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Why change the default storage class?
Depending on the installation method, your Kubernetes cluster may be deployed with an existing StorageClass that is marked as default. This default StorageClass is then used to dynamically provision storage for PersistentVolumeClaims that do not require any specific storage class. See PersistentVolumeClaim documentation for details.
The pre-installed default StorageClass may not fit well with your expected workload; for example, it might provision storage that is too expensive. If this is the case, you can either change the default StorageClass or disable it completely to avoid dynamic provisioning of storage.
Deleting the default StorageClass may not work, as it may be re-created automatically by the addon manager running in your cluster. Please consult the docs for your installation for details about addon manager and how to disable individual addons.
Changing the default StorageClass
-
List the StorageClasses in your cluster:
kubectl get storageclass
The output is similar to this:
NAME PROVISIONER AGE standard (default) kubernetes.io/gce-pd 1d gold kubernetes.io/gce-pd 1d
The default StorageClass is marked by
(default)
. -
Mark the default StorageClass as non-default:
The default StorageClass has an annotation
storageclass.kubernetes.io/is-default-class
set totrue
. Any other value or absence of the annotation is interpreted asfalse
.To mark a StorageClass as non-default, you need to change its value to
false
:kubectl patch storageclass standard -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
where
standard
is the name of your chosen StorageClass. -
Mark a StorageClass as default:
Similar to the previous step, you need to add/set the annotation
storageclass.kubernetes.io/is-default-class=true
.kubectl patch storageclass gold -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Please note that at most one StorageClass can be marked as default. If two or more of them are marked as default, a
PersistentVolumeClaim
withoutstorageClassName
explicitly specified cannot be created. -
Verify that your chosen StorageClass is default:
kubectl get storageclass
The output is similar to this:
NAME PROVISIONER AGE standard kubernetes.io/gce-pd 1d gold (default) kubernetes.io/gce-pd 1d
What's next
- Learn more about PersistentVolumes.
4.2.10 - Change the Reclaim Policy of a PersistentVolume
This page shows how to change the reclaim policy of a Kubernetes PersistentVolume.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Why change reclaim policy of a PersistentVolume
PersistentVolumes can have various reclaim policies, including "Retain", "Recycle", and "Delete". For dynamically provisioned PersistentVolumes, the default reclaim policy is "Delete". This means that a dynamically provisioned volume is automatically deleted when a user deletes the corresponding PersistentVolumeClaim. This automatic behavior might be inappropriate if the volume contains precious data. In that case, it is more appropriate to use the "Retain" policy. With the "Retain" policy, if a user deletes a PersistentVolumeClaim, the corresponding PersistentVolume will not be deleted. Instead, it is moved to the Released phase, where all of its data can be manually recovered.
Changing the reclaim policy of a PersistentVolume
-
List the PersistentVolumes in your cluster:
kubectl get pv
The output is similar to this:
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-b6efd8da-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound default/claim1 manual 10s pvc-b95650f8-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound default/claim2 manual 6s pvc-bb3ca71d-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound default/claim3 manual 3s
This list also includes the name of the claims that are bound to each volume for easier identification of dynamically provisioned volumes.
-
Choose one of your PersistentVolumes and change its reclaim policy:
kubectl patch pv <your-pv-name> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
where
<your-pv-name>
is the name of your chosen PersistentVolume.Note:On Windows, you must double quote any JSONPath template that contains spaces (not single quote as shown above for bash). This in turn means that you must use a single quote or escaped double quote around any literals in the template. For example:
kubectl patch pv <your-pv-name> -p "{\"spec\":{\"persistentVolumeReclaimPolicy\":\"Retain\"}}"
-
Verify that your chosen PersistentVolume has the right policy:
kubectl get pv
The output is similar to this:
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-b6efd8da-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound default/claim1 manual 40s pvc-b95650f8-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound default/claim2 manual 36s pvc-bb3ca71d-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Retain Bound default/claim3 manual 33s
In the preceding output, you can see that the volume bound to claim
default/claim3
has reclaim policyRetain
. It will not be automatically deleted when a user deletes claimdefault/claim3
.
What's next
- Learn more about PersistentVolumes.
- Learn more about PersistentVolumeClaims.
References
-
PersistentVolume
- Pay attention to the
.spec.persistentVolumeReclaimPolicy
field of PersistentVolume.
- Pay attention to the
- PersistentVolumeClaim
4.2.11 - Cloud Controller Manager Administration
Kubernetes v1.11 [beta]
Since cloud providers develop and release at a different pace compared to the Kubernetes project, abstracting the provider-specific code to the cloud-controller-manager
binary allows cloud vendors to evolve independently from the core Kubernetes code.
The cloud-controller-manager
can be linked to any cloud provider that satisfies cloudprovider.Interface. For backwards compatibility, the cloud-controller-manager provided in the core Kubernetes project uses the same cloud libraries as kube-controller-manager
. Cloud providers already supported in Kubernetes core are expected to use the in-tree cloud-controller-manager to transition out of Kubernetes core.
Administration
Requirements
Every cloud has their own set of requirements for running their own cloud provider integration, it should not be too different from the requirements when running kube-controller-manager
. As a general rule of thumb you'll need:
- cloud authentication/authorization: your cloud may require a token or IAM rules to allow access to their APIs
- kubernetes authentication/authorization: cloud-controller-manager may need RBAC rules set to speak to the kubernetes apiserver
- high availability: like kube-controller-manager, you may want a high available setup for cloud controller manager using leader election (on by default).
Running cloud-controller-manager
Successfully running cloud-controller-manager requires some changes to your cluster configuration.
kube-apiserver
andkube-controller-manager
MUST NOT specify the--cloud-provider
flag. This ensures that it does not run any cloud specific loops that would be run by cloud controller manager. In the future, this flag will be deprecated and removed.kubelet
must run with--cloud-provider=external
. This is to ensure that the kubelet is aware that it must be initialized by the cloud controller manager before it is scheduled any work.
Keep in mind that setting up your cluster to use cloud controller manager will change your cluster behaviour in a few ways:
- kubelets specifying
--cloud-provider=external
will add a taintnode.cloudprovider.kubernetes.io/uninitialized
with an effectNoSchedule
during initialization. This marks the node as needing a second initialization from an external controller before it can be scheduled work. Note that in the event that cloud controller manager is not available, new nodes in the cluster will be left unschedulable. The taint is important since the scheduler may require cloud specific information about nodes such as their region or type (high cpu, gpu, high memory, spot instance, etc). - cloud information about nodes in the cluster will no longer be retrieved using local metadata, but instead all API calls to retrieve node information will go through cloud controller manager. This may mean you can restrict access to your cloud API on the kubelets for better security. For larger clusters you may want to consider if cloud controller manager will hit rate limits since it is now responsible for almost all API calls to your cloud from within the cluster.
The cloud controller manager can implement:
- Node controller - responsible for updating kubernetes nodes using cloud APIs and deleting kubernetes nodes that were deleted on your cloud.
- Service controller - responsible for loadbalancers on your cloud against services of type LoadBalancer.
- Route controller - responsible for setting up network routes on your cloud
- any other features you would like to implement if you are running an out-of-tree provider.
Examples
If you are using a cloud that is currently supported in Kubernetes core and would like to adopt cloud controller manager, see the cloud controller manager in kubernetes core.
For cloud controller managers not in Kubernetes core, you can find the respective projects in repositories maintained by cloud vendors or by SIGs.
For providers already in Kubernetes core, you can run the in-tree cloud controller manager as a DaemonSet in your cluster, use the following as a guideline:
# This is an example of how to setup cloud-controller-manager as a Daemonset in your cluster.
# It assumes that your masters can run pods and has the role node-role.kubernetes.io/master
# Note that this Daemonset will not work straight out of the box for your cloud, this is
# meant to be a guideline.
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cloud-controller-manager
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:cloud-controller-manager
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: cloud-controller-manager
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
k8s-app: cloud-controller-manager
name: cloud-controller-manager
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: cloud-controller-manager
template:
metadata:
labels:
k8s-app: cloud-controller-manager
spec:
serviceAccountName: cloud-controller-manager
containers:
- name: cloud-controller-manager
# for in-tree providers we use k8s.gcr.io/cloud-controller-manager
# this can be replaced with any other image for out-of-tree providers
image: k8s.gcr.io/cloud-controller-manager:v1.8.0
command:
- /usr/local/bin/cloud-controller-manager
- --cloud-provider=[YOUR_CLOUD_PROVIDER] # Add your own cloud provider here!
- --leader-elect=true
- --use-service-account-credentials
# these flags will vary for every cloud provider
- --allocate-node-cidrs=true
- --configure-cloud-routes=true
- --cluster-cidr=172.17.0.0/16
tolerations:
# this is required so CCM can bootstrap itself
- key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
effect: NoSchedule
# this is to have the daemonset runnable on master nodes
# the taint may vary depending on your cluster setup
- key: node-role.kubernetes.io/master
effect: NoSchedule
# this is to restrict CCM to only run on master nodes
# the node selector may vary depending on your cluster setup
nodeSelector:
node-role.kubernetes.io/master: ""
Limitations
Running cloud controller manager comes with a few possible limitations. Although these limitations are being addressed in upcoming releases, it's important that you are aware of these limitations for production workloads.
Support for Volumes
Cloud controller manager does not implement any of the volume controllers found in kube-controller-manager
as the volume integrations also require coordination with kubelets. As we evolve CSI (container storage interface) and add stronger support for flex volume plugins, necessary support will be added to cloud controller manager so that clouds can fully integrate with volumes. Learn more about out-of-tree CSI volume plugins here.
Scalability
The cloud-controller-manager queries your cloud provider's APIs to retrieve information for all nodes. For very large clusters, consider possible bottlenecks such as resource requirements and API rate limiting.
Chicken and Egg
The goal of the cloud controller manager project is to decouple development of cloud features from the core Kubernetes project. Unfortunately, many aspects of the Kubernetes project has assumptions that cloud provider features are tightly integrated into the project. As a result, adopting this new architecture can create several situations where a request is being made for information from a cloud provider, but the cloud controller manager may not be able to return that information without the original request being complete.
A good example of this is the TLS bootstrapping feature in the Kubelet. TLS bootstrapping assumes that the Kubelet has the ability to ask the cloud provider (or a local metadata service) for all its address types (private, public, etc) but cloud controller manager cannot set a node's address types without being initialized in the first place which requires that the kubelet has TLS certificates to communicate with the apiserver.
As this initiative evolves, changes will be made to address these issues in upcoming releases.
What's next
To build and develop your own cloud controller manager, read Developing Cloud Controller Manager.
4.2.12 - Configure Quotas for API Objects
This page shows how to configure quotas for API objects, including PersistentVolumeClaims and Services. A quota restricts the number of objects, of a particular type, that can be created in a namespace. You specify quotas in a ResourceQuota object.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace quota-object-example
Create a ResourceQuota
Here is the configuration file for a ResourceQuota object:
apiVersion: v1
kind: ResourceQuota
metadata:
name: object-quota-demo
spec:
hard:
persistentvolumeclaims: "1"
services.loadbalancers: "2"
services.nodeports: "0"
Create the ResourceQuota:
kubectl apply -f https://k8s.io/examples/admin/resource/quota-objects.yaml --namespace=quota-object-example
View detailed information about the ResourceQuota:
kubectl get resourcequota object-quota-demo --namespace=quota-object-example --output=yaml
The output shows that in the quota-object-example namespace, there can be at most one PersistentVolumeClaim, at most two Services of type LoadBalancer, and no Services of type NodePort.
status:
hard:
persistentvolumeclaims: "1"
services.loadbalancers: "2"
services.nodeports: "0"
used:
persistentvolumeclaims: "0"
services.loadbalancers: "0"
services.nodeports: "0"
Create a PersistentVolumeClaim
Here is the configuration file for a PersistentVolumeClaim object:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-quota-demo
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
Create the PersistentVolumeClaim:
kubectl apply -f https://k8s.io/examples/admin/resource/quota-objects-pvc.yaml --namespace=quota-object-example
Verify that the PersistentVolumeClaim was created:
kubectl get persistentvolumeclaims --namespace=quota-object-example
The output shows that the PersistentVolumeClaim exists and has status Pending:
NAME STATUS
pvc-quota-demo Pending
Attempt to create a second PersistentVolumeClaim
Here is the configuration file for a second PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-quota-demo-2
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
Attempt to create the second PersistentVolumeClaim:
kubectl apply -f https://k8s.io/examples/admin/resource/quota-objects-pvc-2.yaml --namespace=quota-object-example
The output shows that the second PersistentVolumeClaim was not created, because it would have exceeded the quota for the namespace.
persistentvolumeclaims "pvc-quota-demo-2" is forbidden:
exceeded quota: object-quota-demo, requested: persistentvolumeclaims=1,
used: persistentvolumeclaims=1, limited: persistentvolumeclaims=1
Notes
These are the strings used to identify API resources that can be constrained by quotas:
String | API Object |
---|---|
"pods" | Pod |
"services" | Service |
"replicationcontrollers" | ReplicationController |
"resourcequotas" | ResourceQuota |
"secrets" | Secret |
"configmaps" | ConfigMap |
"persistentvolumeclaims" | PersistentVolumeClaim |
"services.nodeports" | Service of type NodePort |
"services.loadbalancers" | Service of type LoadBalancer |
Clean up
Delete your namespace:
kubectl delete namespace quota-object-example
What's next
For cluster administrators
-
Configure Default Memory Requests and Limits for a Namespace
-
Configure Minimum and Maximum Memory Constraints for a Namespace
-
Configure Minimum and Maximum CPU Constraints for a Namespace
For app developers
4.2.13 - Control CPU Management Policies on the Node
Kubernetes v1.12 [beta]
Kubernetes keeps many aspects of how pods execute on nodes abstracted from the user. This is by design. However, some workloads require stronger guarantees in terms of latency and/or performance in order to operate acceptably. The kubelet provides methods to enable more complex workload placement policies while keeping the abstraction free from explicit placement directives.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
CPU Management Policies
By default, the kubelet uses CFS quota to enforce pod CPU limits. When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and thus work fine without any intervention.
However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, the kubelet allows alternative CPU management policies to determine some placement preferences on the node.
Configuration
The CPU Manager policy is set with the --cpu-manager-policy
kubelet
flag or the cpuManagerPolicy
field in KubeletConfiguration.
There are two supported policies:
none
: the default policy.static
: allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node.
The CPU manager periodically writes resource updates through the CRI in
order to reconcile in-memory CPU assignments with cgroupfs. The reconcile
frequency is set through a new Kubelet configuration value
--cpu-manager-reconcile-period
. If not specified, it defaults to the same
duration as --node-status-update-frequency
.
The behavior of the static policy can be fine-tuned using the --cpu-manager-policy-options
flag.
The flag takes a comma-separated list of key=value
policy options.
This feature can be disabled completely using the CPUManagerPolicyOptions
feature gate.
The policy options are split into two groups: alpha quality (hidden by default) and beta quality
(visible by default). The groups are guarded respectively by the CPUManagerPolicyAlphaOptions
and CPUManagerPolicyBetaOptions
feature gates. Diverging from the Kubernetes standard, these
feature gates guard groups of options, because it would have been too cumbersome to add a feature
gate for each individual option.
Changing the CPU Manager Policy
Since the CPU manger policy can only be applied when kubelet spawns new pods, simply changing from "none" to "static" won't apply to existing pods. So in order to properly change the CPU manager policy on a node, perform the following steps:
- Drain the node.
- Stop kubelet.
- Remove the old CPU manager state file. The path to this file is
/var/lib/kubelet/cpu_manager_state
by default. This clears the state maintained by the CPUManager so that the cpu-sets set up by the new policy won’t conflict with it. - Edit the kubelet configuration to change the CPU manager policy to the desired value.
- Start kubelet.
Repeat this process for every node that needs its CPU manager policy changed. Skipping this process will result in kubelet crashlooping with the following error:
could not restore state from checkpoint: configured policy "static" differs from state checkpoint policy "none", please drain this node and delete the CPU manager checkpoint file "/var/lib/kubelet/cpu_manager_state" before restarting Kubelet
None policy
The none
policy explicitly enables the existing default CPU
affinity scheme, providing no affinity beyond what the OS scheduler does
automatically. Limits on CPU usage for
Guaranteed pods and
Burstable pods
are enforced using CFS quota.
Static policy
The static
policy allows containers in Guaranteed
pods with integer CPU
requests
access to exclusive CPUs on the node. This exclusivity is enforced
using the cpuset cgroup controller.
cpu_manager_state
in the kubelet root directory.
This policy manages a shared pool of CPUs that initially contains all CPUs in the
node. The amount of exclusively allocatable CPUs is equal to the total
number of CPUs in the node minus any CPU reservations by the kubelet --kube-reserved
or
--system-reserved
options. From 1.17, the CPU reservation list can be specified
explicitly by kubelet --reserved-cpus
option. The explicit CPU list specified by
--reserved-cpus
takes precedence over the CPU reservation specified by
--kube-reserved
and --system-reserved
. CPUs reserved by these options are taken, in
integer quantity, from the initial shared pool in ascending order by physical
core ID. This shared pool is the set of CPUs on which any containers in
BestEffort
and Burstable
pods run. Containers in Guaranteed
pods with fractional
CPU requests
also run on CPUs in the shared pool. Only containers that are
both part of a Guaranteed
pod and have integer CPU requests
are assigned
exclusive CPUs.
--kube-reserved
and/or --system-reserved
or --reserved-cpus
when
the static policy is enabled. This is because zero CPU reservation would allow the shared
pool to become empty.
As Guaranteed
pods whose containers fit the requirements for being statically
assigned are scheduled to the node, CPUs are removed from the shared pool and
placed in the cpuset for the container. CFS quota is not used to bound
the CPU usage of these containers as their usage is bound by the scheduling domain
itself. In others words, the number of CPUs in the container cpuset is equal to the integer
CPU limit
specified in the pod spec. This static assignment increases CPU
affinity and decreases context switches due to throttling for the CPU-bound
workload.
Consider the containers in the following pod specs:
spec:
containers:
- name: nginx
image: nginx
This pod runs in the BestEffort
QoS class because no resource requests
or
limits
are specified. It runs in the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
This pod runs in the Burstable
QoS class because resource requests
do not
equal limits
and the cpu
quantity is not specified. It runs in the shared
pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
requests:
memory: "100Mi"
cpu: "1"
This pod runs in the Burstable
QoS class because resource requests
do not
equal limits
. It runs in the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
requests:
memory: "200Mi"
cpu: "2"
This pod runs in the Guaranteed
QoS class because requests
are equal to limits
.
And the container's resource limit for the CPU resource is an integer greater than
or equal to one. The nginx
container is granted 2 exclusive CPUs.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "1.5"
requests:
memory: "200Mi"
cpu: "1.5"
This pod runs in the Guaranteed
QoS class because requests
are equal to limits
.
But the container's resource limit for the CPU resource is a fraction. It runs in
the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
This pod runs in the Guaranteed
QoS class because only limits
are specified
and requests
are set equal to limits
when not explicitly specified. And the
container's resource limit for the CPU resource is an integer greater than or
equal to one. The nginx
container is granted 2 exclusive CPUs.
Static policy options
You can toggle groups of options on and off based upon their maturity level using the following feature gates:
CPUManagerPolicyBetaOptions
default enabled. Disable to hide beta-level options.CPUManagerPolicyAlphaOptions
default disabled. Enable to show alpha-level options. You will still have to enable each option using theCPUManagerPolicyOptions
kubelet option.
The following policy options exist for the static CPUManager
policy:
full-pcpus-only
(beta, visible by default)distribute-cpus-across-numa
(alpha, hidden by default)
If the full-pcpus-only
policy option is specified, the static policy will always allocate full physical cores.
By default, without this option, the static policy allocates CPUs using a topology-aware best-fit allocation.
On SMT enabled systems, the policy can allocate individual virtual cores, which correspond to hardware threads.
This can lead to different containers sharing the same physical cores; this behaviour in turn contributes
to the noisy neighbours problem.
With the option enabled, the pod will be admitted by the kubelet only if the CPU request of all its containers
can be fulfilled by allocating full physical cores.
If the pod does not pass the admission, it will be put in Failed state with the message SMTAlignmentError
.
If the distribute-cpus-across-numa
policy option is specified, the static
policy will evenly distribute CPUs across NUMA nodes in cases where more than
one NUMA node is required to satisfy the allocation.
By default, the CPUManager
will pack CPUs onto one NUMA node until it is
filled, with any remaining CPUs simply spilling over to the next NUMA node.
This can cause undesired bottlenecks in parallel code relying on barriers (and
similar synchronization primitives), as this type of code tends to run only as
fast as its slowest worker (which is slowed down by the fact that fewer CPUs
are available on at least one NUMA node).
By distributing CPUs evenly across NUMA nodes, application developers can more
easily ensure that no single worker suffers from NUMA effects more than any
other, improving the overall performance of these types of applications.
The full-pcpus-only
option can be enabled by adding full-pcups-only=true
to
the CPUManager policy options.
Likewise, the distribute-cpus-across-numa
option can be enabled by adding
distribute-cpus-across-numa=true
to the CPUManager policy options.
When both are set, they are "additive" in the sense that CPUs will be
distributed across NUMA nodes in chunks of full-pcpus rather than individual
cores.
4.2.14 - Control Topology Management Policies on a node
Kubernetes v1.18 [beta]
An increasing number of systems leverage a combination of CPUs and hardware accelerators to support latency-critical execution and high-throughput parallel computation. These include workloads in fields such as telecommunications, scientific computing, machine learning, financial services and data analytics. Such hybrid systems comprise a high performance environment.
In order to extract the best performance, optimizations related to CPU isolation, memory and device locality are required. However, in Kubernetes, these optimizations are handled by a disjoint set of components.
Topology Manager is a Kubelet component that aims to co-ordinate the set of components that are responsible for these optimizations.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.18. To check the version, enterkubectl version
.
How Topology Manager Works
Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make resource allocation decisions independently of each other. This can result in undesirable allocations on multiple-socketed systems, performance/latency sensitive applications will suffer due to these undesirable allocations. Undesirable in this case meaning for example, CPUs and devices being allocated from different NUMA Nodes thus, incurring additional latency.
The Topology Manager is a Kubelet component, which acts as a source of truth so that other Kubelet components can make topology aligned resource allocation choices.
The Topology Manager provides an interface for components, called Hint Providers, to send and receive topology information. Topology Manager has a set of node level policies which are explained below.
The Topology manager receives Topology information from the Hint Providers as a bitmask denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager policies perform a set of operations on the hints provided and converge on the hint determined by the policy to give the optimal result, if an undesirable hint is stored the preferred field for the hint will be set to false. In the current policies preferred is the narrowest preferred mask. The selected hint is stored as part of the Topology Manager. Depending on the policy configured the pod can be accepted or rejected from the node based on the selected hint. The hint is then stored in the Topology Manager for use by the Hint Providers when making the resource allocation decisions.
Enable the Topology Manager feature
Support for the Topology Manager requires TopologyManager
feature gate to be enabled. It is enabled by default starting with Kubernetes 1.18.
Topology Manager Scopes and Policies
The Topology Manager currently:
- Aligns Pods of all QoS classes.
- Aligns the requested resources that Hint Provider provides topology hints for.
If these conditions are met, the Topology Manager will align the requested resources.
In order to customise how this alignment is carried out, the Topology Manager provides two distinct knobs: scope
and policy
.
The scope
defines the granularity at which you would like resource alignment to be performed (e.g. at the pod
or container
level). And the policy
defines the actual strategy used to carry out the alignment (e.g. best-effort
, restricted
, single-numa-node
, etc.).
Details on the various scopes
and policies
available today can be found below.
Topology Manager Scopes
The Topology Manager can deal with the alignment of resources in a couple of distinct scopes:
container
(default)pod
Either option can be selected at a time of the kubelet startup, with --topology-manager-scope
flag.
container scope
The container
scope is used by default.
Within this scope, the Topology Manager performs a number of sequential resource alignments, i.e., for each container (in a pod) a separate alignment is computed. In other words, there is no notion of grouping the containers to a specific set of NUMA nodes, for this particular scope. In effect, the Topology Manager performs an arbitrary alignment of individual containers to NUMA nodes.
The notion of grouping the containers was endorsed and implemented on purpose in the following scope, for example the pod
scope.
pod scope
To select the pod
scope, start the kubelet with the command line option --topology-manager-scope=pod
.
This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is, the Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all containers) to either a single NUMA node or a common set of NUMA nodes. The following examples illustrate the alignments produced by the Topology Manager on different occasions:
- all containers can be and are allocated to a single NUMA node;
- all containers can be and are allocated to a shared set of NUMA nodes.
The total amount of particular resource demanded for the entire pod is calculated according to effective requests/limits formula, and thus, this total value is equal to the maximum of:
- the sum of all app container requests,
- the maximum of init container requests, for a resource.
Using the pod
scope in tandem with single-numa-node
Topology Manager policy is specifically valuable for workloads that are latency sensitive or for high-throughput applications that perform IPC. By combining both options, you are able to place all containers in a pod onto a single NUMA node; hence, the inter-NUMA communication overhead can be eliminated for that pod.
In the case of single-numa-node
policy, a pod is accepted only if a suitable set of NUMA nodes is present among possible allocations. Reconsider the example above:
- a set containing only a single NUMA node - it leads to pod being admitted,
- whereas a set containing more NUMA nodes - it results in pod rejection (because instead of one NUMA node, two or more NUMA nodes are required to satisfy the allocation).
To recap, Topology Manager first computes a set of NUMA nodes and then tests it against Topology Manager policy, which either leads to the rejection or admission of the pod.
Topology Manager Policies
Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, --topology-manager-policy
.
There are four supported policies:
none
(default)best-effort
restricted
single-numa-node
none policy
This is the default policy and does not perform any topology alignment.
best-effort policy
For each container in a Pod, the kubelet, with best-effort
topology
management policy, calls each Hint Provider to discover their resource availability.
Using this information, the Topology Manager stores the
preferred NUMA Node affinity for that container. If the affinity is not preferred,
Topology Manager will store this and admit the pod to the node anyway.
The Hint Providers can then use this information when making the resource allocation decision.
restricted policy
For each container in a Pod, the kubelet, with restricted
topology
management policy, calls each Hint Provider to discover their resource availability.
Using this information, the Topology Manager stores the
preferred NUMA Node affinity for that container. If the affinity is not preferred,
Topology Manager will reject this pod from the node. This will result in a pod in a Terminated
state with a pod admission failure.
Once the pod is in a Terminated
state, the Kubernetes scheduler will not attempt to reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod.
An external control loop could be also implemented to trigger a redeployment of pods that have the Topology Affinity
error.
If the pod is admitted, the Hint Providers can then use this information when making the resource allocation decision.
single-numa-node policy
For each container in a Pod, the kubelet, with single-numa-node
topology
management policy, calls each Hint Provider to discover their resource availability.
Using this information, the Topology Manager determines if a single NUMA Node affinity is possible.
If it is, Topology Manager will store this and the Hint Providers can then use this information when making the
resource allocation decision.
If, however, this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a Terminated
state with a pod admission failure.
Once the pod is in a Terminated
state, the Kubernetes scheduler will not attempt to reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of the Pod.
An external control loop could be also implemented to trigger a redeployment of pods that have the Topology Affinity
error.
Pod Interactions with Topology Manager Policies
Consider the containers in the following pod specs:
spec:
containers:
- name: nginx
image: nginx
This pod runs in the BestEffort
QoS class because no resource requests
or
limits
are specified.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
This pod runs in the Burstable
QoS class because requests are less than limits.
If the selected policy is anything other than none
, Topology Manager would consider these Pod specifications. The Topology Manager would consult the Hint Providers to get topology hints. In the case of the static
, the CPU Manager policy would return default topology hint, because these Pods do not have explicitly request CPU resources.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
This pod with integer CPU request runs in the Guaranteed
QoS class because requests
are equal to limits
.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
This pod with sharing CPU request runs in the Guaranteed
QoS class because requests
are equal to limits
.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
example.com/deviceA: "1"
example.com/deviceB: "1"
requests:
example.com/deviceA: "1"
example.com/deviceB: "1"
This pod runs in the BestEffort
QoS class because there are no CPU and memory requests.
The Topology Manager would consider the above pods. The Topology Manager would consult the Hint Providers, which are CPU and Device Manager to get topology hints for the pods.
In the case of the Guaranteed
pod with integer CPU request, the static
CPU Manager policy would return topology hints relating to the exclusive CPU and the Device Manager would send back hints for the requested device.
In the case of the Guaranteed
pod with sharing CPU request, the static
CPU Manager policy would return default topology hint as there is no exclusive CPU request and the Device Manager would send back hints for the requested device.
In the above two cases of the Guaranteed
pod, the none
CPU Manager policy would return default topology hint.
In the case of the BestEffort
pod, the static
CPU Manager policy would send back the default topology hint as there is no CPU request and the Device Manager would send back the hints for each of the requested devices.
Using this information the Topology Manager calculates the optimal hint for the pod and stores this information, which will be used by the Hint Providers when they are making their resource assignments.
Known Limitations
-
The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes there will be a state explosion when trying to enumerate the possible NUMA affinities and generating their hints.
-
The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager.
4.2.15 - Customizing DNS Service
This page explains how to configure your DNS Pod(s) and customize the DNS resolution process in your cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your cluster must be running the CoreDNS add-on.
Migrating to CoreDNS
explains how to use kubeadm
to migrate from kube-dns
.
Your Kubernetes server must be at or later than version v1.12.
To check the version, enter kubectl version
.
Introduction
DNS is a built-in Kubernetes service launched automatically using the addon manager cluster add-on.
As of Kubernetes v1.12, CoreDNS is the recommended DNS Server, replacing kube-dns. If your cluster
originally used kube-dns, you may still have kube-dns
deployed rather than CoreDNS.
kube-dns
in the metadata.name
field.This is so that there is greater interoperability with workloads that relied on the legacy
kube-dns
Service name to resolve addresses internal to the cluster. Using a Service named kube-dns
abstracts away the implementation detail of which DNS provider is running behind that common name.
If you are running CoreDNS as a Deployment, it will typically be exposed as a Kubernetes Service with a static IP address.
The kubelet passes DNS resolver information to each container with the --cluster-dns=<dns-service-ip>
flag.
DNS names also need domains. You configure the local domain in the kubelet
with the flag --cluster-domain=<default-local-domain>
.
The DNS server supports forward lookups (A and AAAA records), port lookups (SRV records), reverse IP address lookups (PTR records), and more. For more information, see DNS for Services and Pods.
If a Pod's dnsPolicy
is set to default
, it inherits the name resolution
configuration from the node that the Pod runs on. The Pod's DNS resolution
should behave the same as the node.
But see Known issues.
If you don't want this, or if you want a different DNS config for pods, you can
use the kubelet's --resolv-conf
flag. Set this flag to "" to prevent Pods from
inheriting DNS. Set it to a valid file path to specify a file other than
/etc/resolv.conf
for DNS inheritance.
CoreDNS
CoreDNS is a general-purpose authoritative DNS server that can serve as cluster DNS, complying with the dns specifications.
CoreDNS ConfigMap options
CoreDNS is a DNS server that is modular and pluggable, and each plugin adds new functionality to CoreDNS. This can be configured by maintaining a Corefile, which is the CoreDNS configuration file. As a cluster administrator, you can modify the ConfigMap for the CoreDNS Corefile to change how DNS service discovery behaves for that cluster.
In Kubernetes, CoreDNS is installed with the following default Corefile configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
The Corefile configuration includes the following plugins of CoreDNS:
- errors: Errors are logged to stdout.
- health: Health of CoreDNS is reported to
http://localhost:8080/health
. In this extended syntaxlameduck
will make the process unhealthy then wait for 5 seconds before the process is shut down. - ready: An HTTP endpoint on port 8181 will return 200 OK, when all plugins that are able to signal readiness have done so.
- kubernetes: CoreDNS will reply to DNS queries based on IP of the services and pods of Kubernetes. You can find more details about that plugin on the CoreDNS website.
ttl
allows you to set a custom TTL for responses. The default is 5 seconds. The minimum TTL allowed is 0 seconds, and the maximum is capped at 3600 seconds. Setting TTL to 0 will prevent records from being cached.
Thepods insecure
option is provided for backward compatibility with kube-dns. You can use thepods verified
option, which returns an A record only if there exists a pod in same namespace with matching IP. Thepods disabled
option can be used if you don't use pod records. - prometheus: Metrics of CoreDNS are available at
http://localhost:9153/metrics
in Prometheus format (also known as OpenMetrics). - forward: Any queries that are not within the cluster domain of Kubernetes will be forwarded to predefined resolvers (/etc/resolv.conf).
- cache: This enables a frontend cache.
- loop: Detects simple forwarding loops and halts the CoreDNS process if a loop is found.
- reload: Allows automatic reload of a changed Corefile. After you edit the ConfigMap configuration, allow two minutes for your changes to take effect.
- loadbalance: This is a round-robin DNS loadbalancer that randomizes the order of A, AAAA, and MX records in the answer.
You can modify the default CoreDNS behavior by modifying the ConfigMap.
Configuration of Stub-domain and upstream nameserver using CoreDNS
CoreDNS has the ability to configure stubdomains and upstream nameservers using the forward plugin.
Example
If a cluster operator has a Consul domain server located at 10.150.0.1, and all Consul names have the suffix .consul.local. To configure it in CoreDNS, the cluster administrator creates the following stanza in the CoreDNS ConfigMap.
consul.local:53 {
errors
cache 30
forward . 10.150.0.1
}
To explicitly force all non-cluster DNS lookups to go through a specific nameserver at 172.16.0.1, point the forward
to the nameserver instead of /etc/resolv.conf
forward . 172.16.0.1
The final ConfigMap along with the default Corefile
configuration looks like:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . 172.16.0.1
cache 30
loop
reload
loadbalance
}
consul.local:53 {
errors
cache 30
forward . 10.150.0.1
}
The kubeadm
tool supports automatic translation from the kube-dns ConfigMap
to the equivalent CoreDNS ConfigMap.
CoreDNS configuration equivalent to kube-dns
CoreDNS supports the features of kube-dns and more.
A ConfigMap created for kube-dns to support StubDomains
and upstreamNameservers
translates to the forward
plugin in CoreDNS.
Example
This example ConfigMap for kube-dns specifies stubdomains and upstreamnameservers:
apiVersion: v1
data:
stubDomains: |
{"abc.com" : ["1.2.3.4"], "my.cluster.local" : ["2.3.4.5"]}
upstreamNameservers: |
["8.8.8.8", "8.8.4.4"]
kind: ConfigMap
The equivalent configuration in CoreDNS creates a Corefile:
- For stubDomains:
abc.com:53 {
errors
cache 30
forward . 1.2.3.4
}
my.cluster.local:53 {
errors
cache 30
forward . 2.3.4.5
}
The complete Corefile with the default plugins:
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
federation cluster.local {
foo foo.feddomain.com
}
prometheus :9153
forward . 8.8.8.8 8.8.4.4
cache 30
}
abc.com:53 {
errors
cache 30
forward . 1.2.3.4
}
my.cluster.local:53 {
errors
cache 30
forward . 2.3.4.5
}
Migration to CoreDNS
To migrate from kube-dns to CoreDNS, a detailed blog article is available to help users adapt CoreDNS in place of kube-dns.
You can also migrate using the official CoreDNS deploy script.
What's next
4.2.16 - Debugging DNS Resolution
This page provides hints on diagnosing DNS problems.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your cluster must be configured to use the CoreDNS addon or its precursor, kube-dns.
Your Kubernetes server must be at or later than version v1.6.
To check the version, enter kubectl version
.
Create a simple Pod to use as a test environment
apiVersion: v1
kind: Pod
metadata:
name: dnsutils
namespace: default
spec:
containers:
- name: dnsutils
image: k8s.gcr.io/e2e-test-images/jessie-dnsutils:1.3
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
restartPolicy: Always
default
namespace. DNS name resolution for
services depends on the namespace of the pod. For more information, review
DNS for Services and Pods.
Use that manifest to create a Pod:
kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
pod/dnsutils created
…and verify its status:
kubectl get pods dnsutils
NAME READY STATUS RESTARTS AGE
dnsutils 1/1 Running 0 <some-time>
Once that Pod is running, you can exec nslookup
in that environment.
If you see something like the following, DNS is working correctly.
kubectl exec -i -t dnsutils -- nslookup kubernetes.default
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: kubernetes.default
Address 1: 10.0.0.1
If the nslookup
command fails, check the following:
Check the local DNS configuration first
Take a look inside the resolv.conf file. (See Customizing DNS Service and Known issues below for more information)
kubectl exec -ti dnsutils -- cat /etc/resolv.conf
Verify that the search path and name server are set up like the following (note that search path may vary for different cloud providers):
search default.svc.cluster.local svc.cluster.local cluster.local google.internal c.gce_project_id.internal
nameserver 10.0.0.10
options ndots:5
Errors such as the following indicate a problem with the CoreDNS (or kube-dns) add-on or with associated Services:
kubectl exec -i -t dnsutils -- nslookup kubernetes.default
Server: 10.0.0.10
Address 1: 10.0.0.10
nslookup: can't resolve 'kubernetes.default'
or
kubectl exec -i -t dnsutils -- nslookup kubernetes.default
Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
nslookup: can't resolve 'kubernetes.default'
Check if the DNS pod is running
Use the kubectl get pods
command to verify that the DNS pod is running.
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
...
coredns-7b96bf9f76-5hsxb 1/1 Running 0 1h
coredns-7b96bf9f76-mvmmt 1/1 Running 0 1h
...
k8s-app
is kube-dns
for both CoreDNS and kube-dns deployments.
If you see that no CoreDNS Pod is running or that the Pod has failed/completed, the DNS add-on may not be deployed by default in your current environment and you will have to deploy it manually.
Check for errors in the DNS pod
Use the kubectl logs
command to see logs for the DNS containers.
For CoreDNS:
kubectl logs --namespace=kube-system -l k8s-app=kube-dns
Here is an example of a healthy CoreDNS log:
.:53
2018/08/15 14:37:17 [INFO] CoreDNS-1.2.2
2018/08/15 14:37:17 [INFO] linux/amd64, go1.10.3, 2e322f6
CoreDNS-1.2.2
linux/amd64, go1.10.3, 2e322f6
2018/08/15 14:37:17 [INFO] plugin/reload: Running configuration MD5 = 24e6c59e83ce706f07bcc82c31b1ea1c
See if there are any suspicious or unexpected messages in the logs.
Is DNS service up?
Verify that the DNS service is up by using the kubectl get service
command.
kubectl get svc --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
kube-dns ClusterIP 10.0.0.10 <none> 53/UDP,53/TCP 1h
...
kube-dns
for both CoreDNS and kube-dns deployments.
If you have created the Service or in the case it should be created by default but it does not appear, see debugging Services for more information.
Are DNS endpoints exposed?
You can verify that DNS endpoints are exposed by using the kubectl get endpoints
command.
kubectl get endpoints kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 10.180.3.17:53,10.180.3.17:53 1h
If you do not see the endpoints, see the endpoints section in the debugging Services documentation.
For additional Kubernetes DNS examples, see the cluster-dns examples in the Kubernetes GitHub repository.
Are DNS queries being received/processed?
You can verify if queries are being received by CoreDNS by adding the log
plugin to the CoreDNS configuration (aka Corefile).
The CoreDNS Corefile is held in a ConfigMap named coredns
. To edit it, use the command:
kubectl -n kube-system edit configmap coredns
Then add log
in the Corefile section per the example below:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
log
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
After saving the changes, it may take up to minute or two for Kubernetes to propagate these changes to the CoreDNS pods.
Next, make some queries and view the logs per the sections above in this document. If CoreDNS pods are receiving the queries, you should see them in the logs.
Here is an example of a query in the log:
.:53
2018/08/15 14:37:15 [INFO] CoreDNS-1.2.0
2018/08/15 14:37:15 [INFO] linux/amd64, go1.10.3, 2e322f6
CoreDNS-1.2.0
linux/amd64, go1.10.3, 2e322f6
2018/09/07 15:29:04 [INFO] plugin/reload: Running configuration MD5 = 162475cdf272d8aa601e6fe67a6ad42f
2018/09/07 15:29:04 [INFO] Reloading complete
172.17.0.18:41675 - [07/Sep/2018:15:29:11 +0000] 59925 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd,ra 106 0.000066649s
Are you in the right namespace for the service?
DNS queries that don't specify a namespace are limited to the pod's namespace.
If the namespace of the pod and service differ, the DNS query must include the namespace of the service.
This query is limited to the pod's namespace:
kubectl exec -i -t dnsutils -- nslookup <service-name>
This query specifies the namespace:
kubectl exec -i -t dnsutils -- nslookup <service-name>.<namespace>
To learn more about name resolution, see DNS for Services and Pods.
Known issues
Some Linux distributions (e.g. Ubuntu) use a local DNS resolver by default (systemd-resolved).
Systemd-resolved moves and replaces /etc/resolv.conf
with a stub file that can cause a fatal forwarding
loop when resolving names in upstream servers. This can be fixed manually by using kubelet's --resolv-conf
flag
to point to the correct resolv.conf
(With systemd-resolved
, this is /run/systemd/resolve/resolv.conf
).
kubeadm automatically detects systemd-resolved
, and adjusts the kubelet flags accordingly.
Kubernetes installs do not configure the nodes' resolv.conf
files to use the
cluster DNS by default, because that process is inherently distribution-specific.
This should probably be implemented eventually.
Linux's libc (a.k.a. glibc) has a limit for the DNS nameserver
records to 3 by default. What's more, for the glibc versions which are older than glibc-2.17-222 (the new versions update see this issue), the allowed number of DNS search
records has been limited to 6 (see this bug from 2005). Kubernetes needs to consume 1 nameserver
record and 3 search
records. This means that if a local installation already uses 3 nameserver
s or uses more than 3 search
es while your glibc version is in the affected list, some of those settings will be lost. To work around the DNS nameserver
records limit, the node can run dnsmasq
, which will provide more nameserver
entries. You can also use kubelet's --resolv-conf
flag. To fix the DNS search
records limit, consider upgrading your linux distribution or upgrading to an unaffected version of glibc.
search
records.
If you are using Alpine version 3.3 or earlier as your base image, DNS may not work properly due to a known issue with Alpine. Kubernetes issue 30215 details more information on this.
What's next
4.2.17 - Declare Network Policy
This document helps you get started using the Kubernetes NetworkPolicy API to declare network policies that govern how pods communicate with each other.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.8. To check the version, enterkubectl version
.
Make sure you've configured a network provider with network policy support. There are a number of network providers that support NetworkPolicy, including:
Create an nginx
deployment and expose it via a service
To see how Kubernetes network policy works, start off by creating an nginx
Deployment.
kubectl create deployment nginx --image=nginx
deployment.apps/nginx created
Expose the Deployment through a Service called nginx
.
kubectl expose deployment nginx --port=80
service/nginx exposed
The above commands create a Deployment with an nginx Pod and expose the Deployment through a Service named nginx
. The nginx
Pod and Deployment are found in the default
namespace.
kubectl get svc,pod
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes 10.100.0.1 <none> 443/TCP 46m
service/nginx 10.100.0.16 <none> 80/TCP 33s
NAME READY STATUS RESTARTS AGE
pod/nginx-701339712-e0qfq 1/1 Running 0 35s
Test the service by accessing it from another Pod
You should be able to access the new nginx
service from other Pods. To access the nginx
Service from another Pod in the default
namespace, start a busybox container:
kubectl run busybox --rm -ti --image=busybox:1.28 -- /bin/sh
In your shell, run the following command:
wget --spider --timeout=1 nginx
Connecting to nginx (10.100.0.16:80)
remote file exists
Limit access to the nginx
service
To limit the access to the nginx
service so that only Pods with the label access: true
can query it, create a NetworkPolicy object as follows:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: access-nginx
spec:
podSelector:
matchLabels:
app: nginx
ingress:
- from:
- podSelector:
matchLabels:
access: "true"
The name of a NetworkPolicy object must be a valid DNS subdomain name.
podSelector
which selects the grouping of Pods to which the policy applies. You can see this policy selects Pods with the label app=nginx
. The label was automatically added to the Pod in the nginx
Deployment. An empty podSelector
selects all pods in the namespace.
Assign the policy to the service
Use kubectl to create a NetworkPolicy from the above nginx-policy.yaml
file:
kubectl apply -f https://k8s.io/examples/service/networking/nginx-policy.yaml
networkpolicy.networking.k8s.io/access-nginx created
Test access to the service when access label is not defined
When you attempt to access the nginx
Service from a Pod without the correct labels, the request times out:
kubectl run busybox --rm -ti --image=busybox:1.28 -- /bin/sh
In your shell, run the command:
wget --spider --timeout=1 nginx
Connecting to nginx (10.100.0.16:80)
wget: download timed out
Define access label and test again
You can create a Pod with the correct labels to see that the request is allowed:
kubectl run busybox --rm -ti --labels="access=true" --image=busybox:1.28 -- /bin/sh
In your shell, run the command:
wget --spider --timeout=1 nginx
Connecting to nginx (10.100.0.16:80)
remote file exists
4.2.18 - Developing Cloud Controller Manager
Kubernetes v1.11 [beta]
The cloud-controller-manager is a Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider's API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.
By decoupling the interoperability logic between Kubernetes and the underlying cloud infrastructure, the cloud-controller-manager component enables cloud providers to release features at a different pace compared to the main Kubernetes project.
Background
Since cloud providers develop and release at a different pace compared to the Kubernetes project, abstracting the provider-specific code to the cloud-controller-manager
binary allows cloud vendors to evolve independently from the core Kubernetes code.
The Kubernetes project provides skeleton cloud-controller-manager code with Go interfaces to allow you (or your cloud provider) to plug in your own implementations. This means that a cloud provider can implement a cloud-controller-manager by importing packages from Kubernetes core; each cloudprovider will register their own code by calling cloudprovider.RegisterCloudProvider
to update a global variable of available cloud providers.
Developing
Out of tree
To build an out-of-tree cloud-controller-manager for your cloud:
- Create a go package with an implementation that satisfies cloudprovider.Interface.
- Use
main.go
in cloud-controller-manager from Kubernetes core as a template for yourmain.go
. As mentioned above, the only difference should be the cloud package that will be imported. - Import your cloud package in
main.go
, ensure your package has aninit
block to runcloudprovider.RegisterCloudProvider
.
Many cloud providers publish their controller manager code as open source. If you are creating a new cloud-controller-manager from scratch, you could take an existing out-of-tree cloud controller manager as your starting point.
In tree
For in-tree cloud providers, you can run the in-tree cloud controller manager as a DaemonSet in your cluster. See Cloud Controller Manager Administration for more details.
4.2.19 - Enable Or Disable A Kubernetes API
This page shows how to enable or disable an API version from your cluster's control plane.
Specific API versions can be turned on or off by passing --runtime-config=api/<version>
as a
command line argument to the API server. The values for this argument are a comma-separated
list of API versions. Later values override earlier values.
The runtime-config
command line argument also supports 2 special keys:
api/all
, representing all known APIsapi/legacy
, representing only legacy APIs. Legacy APIs are any APIs that have been explicitly deprecated.
For example, to turning off all API versions except v1, pass --runtime-config=api/all=false,api/v1=true
to the kube-apiserver
.
What's next
Read the full documentation
for the kube-apiserver
component.
4.2.20 - Enabling Service Topology
Kubernetes v1.21 [deprecated]
This feature, specifically the alpha topologyKeys
field, is deprecated since
Kubernetes v1.21.
Topology Aware Hints,
introduced in Kubernetes v1.21, provide similar functionality.
Service Topology enables a Service to route traffic based upon the Node topology of the cluster. For example, a service can specify that traffic be preferentially routed to endpoints that are on the same Node as the client, or in the same availability zone.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.17. To check the version, enterkubectl version
.
The following prerequisites are needed in order to enable topology aware service routing:
- Kubernetes v1.17 or later
- Configure kube-proxy to run in iptables mode or IPVS mode
Enable Service Topology
Kubernetes v1.21 [deprecated]
To enable service topology, enable the ServiceTopology
feature gate for all Kubernetes components:
--feature-gates="ServiceTopology=true`
What's next
- Read about Topology Aware Hints, the replacement for the
topologyKeys
field. - Read about EndpointSlices
- Read about the Service Topology concept
- Read Connecting Applications with Services
4.2.21 - Encrypting Secret Data at Rest
This page shows how to enable and configure encryption of secret data at rest.
Before you begin
-
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.13. To check the version, enterkubectl version
. -
etcd v3.0 or later is required
Configuration and determining whether encryption at rest is already enabled
The kube-apiserver
process accepts an argument --encryption-provider-config
that controls how API data is encrypted in etcd.
The configuration is provided as an API named
EncryptionConfiguration
.
An example configuration is provided below.
kube-apiserver
component cannot
decrypt data stored in the etcd.
Understanding the encryption at rest configuration.
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- identity: {}
- aesgcm:
keys:
- name: key1
secret: c2VjcmV0IGlzIHNlY3VyZQ==
- name: key2
secret: dGhpcyBpcyBwYXNzd29yZA==
- aescbc:
keys:
- name: key1
secret: c2VjcmV0IGlzIHNlY3VyZQ==
- name: key2
secret: dGhpcyBpcyBwYXNzd29yZA==
- secretbox:
keys:
- name: key1
secret: YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXoxMjM0NTY=
Each resources
array item is a separate config and contains a complete configuration. The
resources.resources
field is an array of Kubernetes resource names (resource
or resource.group
)
that should be encrypted. The providers
array is an ordered list of the possible encryption
providers.
Only one provider type may be specified per entry (identity
or aescbc
may be provided,
but not both in the same item).
The first provider in the list is used to encrypt resources written into the storage. When reading
resources from storage, each provider that matches the stored data attempts in order to decrypt the
data. If no provider can read the stored data due to a mismatch in format or secret key, an error
is returned which prevents clients from accessing that resource.
For more detailed information about the EncryptionConfiguration
struct, please refer to the
encryption configuration API.
Providers:
Name | Encryption | Strength | Speed | Key Length | Other Considerations |
---|---|---|---|---|---|
identity |
None | N/A | N/A | N/A | Resources written as-is without encryption. When set as the first provider, the resource will be decrypted as new values are written. |
secretbox |
XSalsa20 and Poly1305 | Strong | Faster | 32-byte | A newer standard and may not be considered acceptable in environments that require high levels of review. |
aesgcm |
AES-GCM with random nonce | Must be rotated every 200k writes | Fastest | 16, 24, or 32-byte | Is not recommended for use except when an automated key rotation scheme is implemented. |
aescbc |
AES-CBC with PKCS#7 padding | Weak | Fast | 32-byte | Not recommended due to CBC's vulnerability to padding oracle attacks. |
kms |
Uses envelope encryption scheme: Data is encrypted by data encryption keys (DEKs) using AES-CBC with PKCS#7 padding, DEKs are encrypted by key encryption keys (KEKs) according to configuration in Key Management Service (KMS) | Strongest | Fast | 32-bytes | The recommended choice for using a third party tool for key management. Simplifies key rotation, with a new DEK generated for each encryption, and KEK rotation controlled by the user. Configure the KMS provider |
Each provider supports multiple keys - the keys are tried in order for decryption, and if the provider is the first provider, the first key is used for encryption.
kms
provider for additional security.
By default, the identity
provider is used to protect Secrets in etcd, which provides no
encryption. EncryptionConfiguration
was introduced to encrypt Secrets locally, with a locally
managed key.
Encrypting Secrets with a locally managed key protects against an etcd compromise, but it fails to protect against a host compromise. Since the encryption keys are stored on the host in the EncryptionConfiguration YAML file, a skilled attacker can access that file and extract the encryption keys.
Envelope encryption creates dependence on a separate key, not stored in Kubernetes. In this case,
an attacker would need to compromise etcd, the kubeapi-server
, and the third-party KMS provider to
retrieve the plaintext values, providing a higher level of security than locally stored encryption keys.
Encrypting your data
Create a new encryption config file:
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- aescbc:
keys:
- name: key1
secret: <BASE 64 ENCODED SECRET>
- identity: {}
To create a new Secret, perform the following steps:
-
Generate a 32-byte random key and base64 encode it. If you're on Linux or macOS, run the following command:
head -c 32 /dev/urandom | base64
-
Place that value in the
secret
field of theEncryptionConfiguration
struct. -
Set the
--encryption-provider-config
flag on thekube-apiserver
to point to the location of the config file. -
Restart your API server.
kube-apiserver
can read it.
Verifying that data is encrypted
Data is encrypted when written to etcd. After restarting your kube-apiserver
, any newly created or
updated Secret should be encrypted when stored. To check this, you can use the etcdctl
command line
program to retrieve the contents of your Secret.
-
Create a new Secret called
secret1
in thedefault
namespace:kubectl create secret generic secret1 -n default --from-literal=mykey=mydata
-
Using the
etcdctl
command line, read that Secret out of etcd:ETCDCTL_API=3 etcdctl get /registry/secrets/default/secret1 [...] | hexdump -C
where
[...]
must be the additional arguments for connecting to the etcd server. -
Verify the stored Secret is prefixed with
k8s:enc:aescbc:v1:
which indicates theaescbc
provider has encrypted the resulting data. -
Verify the Secret is correctly decrypted when retrieved via the API:
kubectl describe secret secret1 -n default
The output should contain
mykey: bXlkYXRh
, with contents ofmydata
encoded, check decoding a Secret to completely decode the Secret.
Ensure all Secrets are encrypted
Since Secrets are encrypted on write, performing an update on a Secret will encrypt that content.
kubectl get secrets --all-namespaces -o json | kubectl replace -f -
The command above reads all Secrets and then updates them to apply server side encryption.
Rotating a decryption key
Changing a Secret without incurring downtime requires a multi-step operation, especially in
the presence of a highly-available deployment where multiple kube-apiserver
processes are running.
- Generate a new key and add it as the second key entry for the current provider on all servers
- Restart all
kube-apiserver
processes to ensure each server can decrypt using the new key - Make the new key the first entry in the
keys
array so that it is used for encryption in the config - Restart all
kube-apiserver
processes to ensure each server now encrypts using the new key - Run
kubectl get secrets --all-namespaces -o json | kubectl replace -f -
to encrypt all existing Secrets with the new key - Remove the old decryption key from the config after you have backed up etcd with the new key in use and updated all Secrets
When running a single kube-apiserver
instance, step 2 may be skipped.
Decrypting all data
To disable encryption at rest, place the identity
provider as the first entry in the config
and restart all kube-apiserver
processes.
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- identity: {}
- aescbc:
keys:
- name: key1
secret: <BASE 64 ENCODED SECRET>
Then run the following command to force decrypt all Secrets:
kubectl get secrets --all-namespaces -o json | kubectl replace -f -
What's next
- Learn more about the EncryptionConfiguration configuration API (v1).
4.2.22 - Guaranteed Scheduling For Critical Add-On Pods
Kubernetes core components such as the API server, scheduler, and controller-manager run on a control plane node. However, add-ons must run on a regular cluster node. Some of these add-ons are critical to a fully functional cluster, such as metrics-server, DNS, and UI. A cluster may stop working properly if a critical add-on is evicted (either manually or as a side effect of another operation like upgrade) and becomes pending (for example when the cluster is highly utilized and either there are other pending pods that schedule into the space vacated by the evicted critical add-on pod or the amount of resources available on the node changed for some other reason).
Note that marking a pod as critical is not meant to prevent evictions entirely; it only prevents the pod from becoming permanently unavailable. A static pod marked as critical, can't be evicted. However, a non-static pods marked as critical are always rescheduled.
Marking pod as critical
To mark a Pod as critical, set priorityClassName for that Pod to system-cluster-critical
or system-node-critical
. system-node-critical
is the highest available priority, even higher than system-cluster-critical
.
4.2.23 - IP Masquerade Agent User Guide
This page shows how to configure and enable the ip-masq-agent
.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
IP Masquerade Agent User Guide
The ip-masq-agent
configures iptables rules to hide a pod's IP address behind the cluster node's IP address. This is typically done when sending traffic to destinations outside the cluster's pod CIDR range.
Key Terms
- NAT (Network Address Translation) Is a method of remapping one IP address to another by modifying either the source and/or destination address information in the IP header. Typically performed by a device doing IP routing.
- Masquerading A form of NAT that is typically used to perform a many to one address translation, where multiple source IP addresses are masked behind a single address, which is typically the device doing the IP routing. In Kubernetes this is the Node's IP address.
- CIDR (Classless Inter-Domain Routing) Based on the variable-length subnet masking, allows specifying arbitrary-length prefixes. CIDR introduced a new method of representation for IP addresses, now commonly known as CIDR notation, in which an address or routing prefix is written with a suffix indicating the number of bits of the prefix, such as 192.168.2.0/24.
- Link Local A link-local address is a network address that is valid only for communications within the network segment or the broadcast domain that the host is connected to. Link-local addresses for IPv4 are defined in the address block 169.254.0.0/16 in CIDR notation.
The ip-masq-agent configures iptables rules to handle masquerading node/pod IP addresses when sending traffic to destinations outside the cluster node's IP and the Cluster IP range. This essentially hides pod IP addresses behind the cluster node's IP address. In some environments, traffic to "external" addresses must come from a known machine address. For example, in Google Cloud, any traffic to the internet must come from a VM's IP. When containers are used, as in Google Kubernetes Engine, the Pod IP will be rejected for egress. To avoid this, we must hide the Pod IP behind the VM's own IP address - generally known as "masquerade". By default, the agent is configured to treat the three private IP ranges specified by RFC 1918 as non-masquerade CIDR. These ranges are 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16. The agent will also treat link-local (169.254.0.0/16) as a non-masquerade CIDR by default. The agent is configured to reload its configuration from the location /etc/config/ip-masq-agent every 60 seconds, which is also configurable.
The agent configuration file must be written in YAML or JSON syntax, and may contain three optional keys:
nonMasqueradeCIDRs
: A list of strings in CIDR notation that specify the non-masquerade ranges.masqLinkLocal
: A Boolean (true/false) which indicates whether to masquerade traffic to the link local prefix169.254.0.0/16
. False by default.resyncInterval
: A time interval at which the agent attempts to reload config from disk. For example: '30s', where 's' means seconds, 'ms' means milliseconds.
Traffic to 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16) ranges will NOT be masqueraded. Any other traffic (assumed to be internet) will be masqueraded. An example of a local destination from a pod could be its Node's IP address as well as another node's address or one of the IP addresses in Cluster's IP range. Any other traffic will be masqueraded by default. The below entries show the default set of rules that are applied by the ip-masq-agent:
iptables -t nat -L IP-MASQ-AGENT
RETURN all -- anywhere 169.254.0.0/16 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 10.0.0.0/8 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 172.16.0.0/12 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 192.168.0.0/16 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
MASQUERADE all -- anywhere anywhere /* ip-masq-agent: outbound traffic should be subject to MASQUERADE (this match must come after cluster-local CIDR matches) */ ADDRTYPE match dst-type !LOCAL
By default, in GCE/Google Kubernetes Engine, if network policy is enabled or
you are using a cluster CIDR not in the 10.0.0.0/8 range, the ip-masq-agent
will run in your cluster. If you are running in another environment,
you can add the ip-masq-agent
DaemonSet
to your cluster.
Create an ip-masq-agent
To create an ip-masq-agent, run the following kubectl command:
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/ip-masq-agent/master/ip-masq-agent.yaml
You must also apply the appropriate node label to any nodes in your cluster that you want the agent to run on.
kubectl label nodes my-node node.kubernetes.io/masq-agent-ds-ready=true
More information can be found in the ip-masq-agent documentation here
In most cases, the default set of rules should be sufficient; however, if this is not the case for your cluster, you can create and apply a ConfigMap to customize the IP ranges that are affected. For example, to allow only 10.0.0.0/8 to be considered by the ip-masq-agent, you can create the following ConfigMap in a file called "config".
It is important that the file is called config since, by default, that will be used as the key for lookup by the ip-masq-agent
:
nonMasqueradeCIDRs:
- 10.0.0.0/8
resyncInterval: 60s
Run the following command to add the config map to your cluster:
kubectl create configmap ip-masq-agent --from-file=config --namespace=kube-system
This will update a file located at /etc/config/ip-masq-agent
which is periodically checked every resyncInterval
and applied to the cluster node.
After the resync interval has expired, you should see the iptables rules reflect your changes:
iptables -t nat -L IP-MASQ-AGENT
Chain IP-MASQ-AGENT (1 references)
target prot opt source destination
RETURN all -- anywhere 169.254.0.0/16 /* ip-masq-agent: cluster-local traffic should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 10.0.0.0/8 /* ip-masq-agent: cluster-local
MASQUERADE all -- anywhere anywhere /* ip-masq-agent: outbound traffic should be subject to MASQUERADE (this match must come after cluster-local CIDR matches) */ ADDRTYPE match dst-type !LOCAL
By default, the link local range (169.254.0.0/16) is also handled by the ip-masq agent, which sets up the appropriate iptables rules. To have the ip-masq-agent ignore link local, you can set masqLinkLocal
to true in the ConfigMap.
nonMasqueradeCIDRs:
- 10.0.0.0/8
resyncInterval: 60s
masqLinkLocal: true
4.2.24 - Limit Storage Consumption
This example demonstrates how to limit the amount of storage consumed in a namespace.
The following resources are used in the demonstration: ResourceQuota, LimitRange, and PersistentVolumeClaim.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Scenario: Limiting Storage Consumption
The cluster-admin is operating a cluster on behalf of a user population and the admin wants to control how much storage a single namespace can consume in order to control cost.
The admin would like to limit:
- The number of persistent volume claims in a namespace
- The amount of storage each claim can request
- The amount of cumulative storage the namespace can have
LimitRange to limit requests for storage
Adding a LimitRange
to a namespace enforces storage request sizes to a minimum and maximum. Storage is requested
via PersistentVolumeClaim
. The admission controller that enforces limit ranges will reject any PVC that is above or below
the values set by the admin.
In this example, a PVC requesting 10Gi of storage would be rejected because it exceeds the 2Gi max.
apiVersion: v1
kind: LimitRange
metadata:
name: storagelimits
spec:
limits:
- type: PersistentVolumeClaim
max:
storage: 2Gi
min:
storage: 1Gi
Minimum storage requests are used when the underlying storage provider requires certain minimums. For example, AWS EBS volumes have a 1Gi minimum requirement.
StorageQuota to limit PVC count and cumulative storage capacity
Admins can limit the number of PVCs in a namespace as well as the cumulative capacity of those PVCs. New PVCs that exceed either maximum value will be rejected.
In this example, a 6th PVC in the namespace would be rejected because it exceeds the maximum count of 5. Alternatively, a 5Gi maximum quota when combined with the 2Gi max limit above, cannot have 3 PVCs where each has 2Gi. That would be 6Gi requested for a namespace capped at 5Gi.
apiVersion: v1
kind: ResourceQuota
metadata:
name: storagequota
spec:
hard:
persistentvolumeclaims: "5"
requests.storage: "5Gi"
Summary
A limit range can put a ceiling on how much storage is requested while a resource quota can effectively cap the storage consumed by a namespace through claim counts and cumulative storage capacity. The allows a cluster-admin to plan their cluster's storage budget without risk of any one project going over their allotment.
4.2.25 - Migrate Replicated Control Plane To Use Cloud Controller Manager
Kubernetes v1.22 [beta]
The cloud-controller-manager is a Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider's API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.
By decoupling the interoperability logic between Kubernetes and the underlying cloud infrastructure, the cloud-controller-manager component enables cloud providers to release features at a different pace compared to the main Kubernetes project.
Background
As part of the cloud provider extraction effort, all cloud specific controllers must be moved out of the kube-controller-manager
. All existing clusters that run cloud controllers in the kube-controller-manager
must migrate to instead run the controllers in a cloud provider specific cloud-controller-manager
.
Leader Migration provides a mechanism in which HA clusters can safely migrate "cloud specific" controllers between the kube-controller-manager
and the cloud-controller-manager
via a shared resource lock between the two components while upgrading the replicated control plane. For a single-node control plane, or if unavailability of controller managers can be tolerated during the upgrade, Leader Migration is not needed and this guide can be ignored.
Leader Migration can be enabled by setting --enable-leader-migration
on kube-controller-manager
or cloud-controller-manager
. Leader Migration only applies during the upgrade and can be safely disabled or left enabled after the upgrade is complete.
This guide walks you through the manual process of upgrading the control plane from kube-controller-manager
with built-in cloud provider to running both kube-controller-manager
and cloud-controller-manager
. If you use a tool to administrator the cluster, please refer to the documentation of the tool and the cloud provider for more details.
Before you begin
It is assumed that the control plane is running Kubernetes version N and to be upgraded to version N + 1. Although it is possible to migrate within the same version, ideally the migration should be performed as part of an upgrade so that changes of configuration can be aligned to each release. The exact versions of N and N + 1 depend on each cloud provider. For example, if a cloud provider builds a cloud-controller-manager
to work with Kubernetes 1.22, then N can be 1.21 and N + 1 can be 1.22.
The control plane nodes should run kube-controller-manager
with Leader Election enabled through --leader-elect=true
. As of version N, an in-tree cloud privider must be set with --cloud-provider
flag and cloud-controller-manager
should not yet be deployed.
The out-of-tree cloud provider must have built a cloud-controller-manager
with Leader Migration implementation. If the cloud provider imports k8s.io/cloud-provider
and k8s.io/controller-manager
of version v0.21.0 or later, Leader Migration will be available. However, for version before v0.22.0, Leader Migration is alpha and requires feature gate ControllerManagerLeaderMigration
to be enabled.
This guide assumes that kubelet of each control plane node starts kube-controller-manager
and cloud-controller-manager
as static pods defined by their manifests. If the components run in a different setting, please adjust the steps accordingly.
For authorization, this guide assumes that the cluster uses RBAC. If another authorization mode grants permissions to kube-controller-manager
and cloud-controller-manager
components, please grant the needed access in a way that matches the mode.
Grant access to Migration Lease
The default permissions of the controller manager allow only accesses to their main Lease. In order for the migration to work, accesses to another Lease are required.
You can grant kube-controller-manager
full access to the leases API by modifying the system::leader-locking-kube-controller-manager
role. This task guide assumes that the name of the migration lease is cloud-provider-extraction-migration
.
kubectl patch -n kube-system role 'system::leader-locking-kube-controller-manager' -p '{"rules": [ {"apiGroups":[ "coordination.k8s.io"], "resources": ["leases"], "resourceNames": ["cloud-provider-extraction-migration"], "verbs": ["create", "list", "get", "update"] } ]}' --type=merge
Do the same to the system::leader-locking-cloud-controller-manager
role.
kubectl patch -n kube-system role 'system::leader-locking-cloud-controller-manager' -p '{"rules": [ {"apiGroups":[ "coordination.k8s.io"], "resources": ["leases"], "resourceNames": ["cloud-provider-extraction-migration"], "verbs": ["create", "list", "get", "update"] } ]}' --type=merge
Initial Leader Migration configuration
Leader Migration optionally takes a configuration file representing the state of controller-to-manager assignment. At this moment, with in-tree cloud provider, kube-controller-manager
runs route
, service
, and cloud-node-lifecycle
. The following example configuration shows the assignment.
Leader Migration can be enabled without a configuration. Please see Default Configuration for details.
kind: LeaderMigrationConfiguration
apiVersion: controllermanager.config.k8s.io/v1beta1
leaderName: cloud-provider-extraction-migration
resourceLock: leases
controllerLeaders:
- name: route
component: kube-controller-manager
- name: service
component: kube-controller-manager
- name: cloud-node-lifecycle
component: kube-controller-manager
On each control plane node, save the content to /etc/leadermigration.conf
, and update the manifest of kube-controller-manager
so that the file is mounted inside the container at the same location. Also, update the same manifest to add the following arguments:
--enable-leader-migration
to enable Leader Migration on the controller manager--leader-migration-config=/etc/leadermigration.conf
to set configuration file
Restart kube-controller-manager
on each node. At this moment, kube-controller-manager
has leader migration enabled and is ready for the migration.
Deploy Cloud Controller Manager
In version N + 1, the desired state of controller-to-manager assignment can be represented by a new configuration file, shown as follows. Please note component
field of each controllerLeaders
changing from kube-controller-manager
to cloud-controller-manager
.
kind: LeaderMigrationConfiguration
apiVersion: controllermanager.config.k8s.io/v1beta1
leaderName: cloud-provider-extraction-migration
resourceLock: leases
controllerLeaders:
- name: route
component: cloud-controller-manager
- name: service
component: cloud-controller-manager
- name: cloud-node-lifecycle
component: cloud-controller-manager
When creating control plane nodes of version N + 1, the content should be deploy to /etc/leadermigration.conf
. The manifest of cloud-controller-manager
should be updated to mount the configuration file in the same manner as kube-controller-manager
of version N. Similarly, add --feature-gates=ControllerManagerLeaderMigration=true
,--enable-leader-migration
, and --leader-migration-config=/etc/leadermigration.conf
to the arguments of cloud-controller-manager
.
Create a new control plane node of version N + 1 with the updated cloud-controller-manager
manifest, and with the --cloud-provider
flag unset for kube-controller-manager
. kube-controller-manager
of version N + 1 MUST NOT have Leader Migration enabled because, with an external cloud provider, it does not run the migrated controllers anymore and thus it is not involved in the migration.
Please refer to Cloud Controller Manager Administration for more detail on how to deploy cloud-controller-manager
.
Upgrade Control Plane
The control plane now contains nodes of both version N and N + 1. The nodes of version N run kube-controller-manager
only, and these of version N + 1 run both kube-controller-manager
and cloud-controller-manager
. The migrated controllers, as specified in the configuration, are running under either kube-controller-manager
of version N or cloud-controller-manager
of version N + 1 depending on which controller manager holds the migration lease. No controller will ever be running under both controller managers at any time.
In a rolling manner, create a new control plane node of version N + 1 and bring down one of version N + 1 until the control plane contains only nodes of version N + 1.
If a rollback from version N + 1 to N is required, add nodes of version N with Leader Migration enabled for kube-controller-manager
back to the control plane, replacing one of version N + 1 each time until there are only nodes of version N.
(Optional) Disable Leader Migration
Now that the control plane has been upgraded to run both kube-controller-manager
and cloud-controller-manager
of version N + 1, Leader Migration has finished its job and can be safely disabled to save one Lease resource. It is safe to re-enable Leader Migration for the rollback in the future.
In a rolling manager, update manifest of cloud-controller-manager
to unset both --enable-leader-migration
and --leader-migration-config=
flag, also remove the mount of /etc/leadermigration.conf
, and finally remove /etc/leadermigration.conf
. To re-enable Leader Migration, recreate the configuration file and add its mount and the flags that enable Leader Migration back to cloud-controller-manager
.
Default Configuration
Starting Kubernetes 1.22, Leader Migration provides a default configuration suitable for the default controller-to-manager assignment.
The default configuration can be enabled by setting --enable-leader-migration
but without --leader-migration-config=
.
For kube-controller-manager
and cloud-controller-manager
, if there are no flags that enable any in-tree cloud provider or change ownership of controllers, the default configuration can be used to avoid manual creation of the configuration file.
What's next
- Read the Controller Manager Leader Migration enhancement proposal
4.2.26 - Namespaces Walkthrough
Kubernetes namespaces help different projects, teams, or customers to share a Kubernetes cluster.
It does this by providing the following:
- A scope for Names.
- A mechanism to attach authorization and policy to a subsection of the cluster.
Use of multiple namespaces is optional.
This example demonstrates how to use Kubernetes namespaces to subdivide your cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Prerequisites
This example assumes the following:
- You have an existing Kubernetes cluster.
- You have a basic understanding of Kubernetes Pods, Services, and Deployments.
Understand the default namespace
By default, a Kubernetes cluster will instantiate a default namespace when provisioning the cluster to hold the default set of Pods, Services, and Deployments used by the cluster.
Assuming you have a fresh cluster, you can inspect the available namespaces by doing the following:
kubectl get namespaces
NAME STATUS AGE
default Active 13m
Create new namespaces
For this exercise, we will create two additional Kubernetes namespaces to hold our content.
Let's imagine a scenario where an organization is using a shared Kubernetes cluster for development and production use cases.
The development team would like to maintain a space in the cluster where they can get a view on the list of Pods, Services, and Deployments they use to build and run their application. In this space, Kubernetes resources come and go, and the restrictions on who can or cannot modify resources are relaxed to enable agile development.
The operations team would like to maintain a space in the cluster where they can enforce strict procedures on who can or cannot manipulate the set of Pods, Services, and Deployments that run the production site.
One pattern this organization could follow is to partition the Kubernetes cluster into two namespaces: development
and production
.
Let's create two new namespaces to hold our work.
Use the file namespace-dev.json
which describes a development
namespace:
{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"name": "development",
"labels": {
"name": "development"
}
}
}
Create the development
namespace using kubectl.
kubectl create -f https://k8s.io/examples/admin/namespace-dev.json
Save the following contents into file namespace-prod.json
which describes a production
namespace:
{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"name": "production",
"labels": {
"name": "production"
}
}
}
And then let's create the production
namespace using kubectl.
kubectl create -f https://k8s.io/examples/admin/namespace-prod.json
To be sure things are right, let's list all of the namespaces in our cluster.
kubectl get namespaces --show-labels
NAME STATUS AGE LABELS
default Active 32m <none>
development Active 29s name=development
production Active 23s name=production
Create pods in each namespace
A Kubernetes namespace provides the scope for Pods, Services, and Deployments in the cluster.
Users interacting with one namespace do not see the content in another namespace.
To demonstrate this, let's spin up a simple Deployment and Pods in the development
namespace.
We first check what is the current context:
kubectl config view
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: REDACTED
server: https://130.211.122.180
name: lithe-cocoa-92103_kubernetes
contexts:
- context:
cluster: lithe-cocoa-92103_kubernetes
user: lithe-cocoa-92103_kubernetes
name: lithe-cocoa-92103_kubernetes
current-context: lithe-cocoa-92103_kubernetes
kind: Config
preferences: {}
users:
- name: lithe-cocoa-92103_kubernetes
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
- name: lithe-cocoa-92103_kubernetes-basic-auth
user:
password: h5M0FtUUIflBSdI7
username: admin
kubectl config current-context
lithe-cocoa-92103_kubernetes
The next step is to define a context for the kubectl client to work in each namespace. The value of "cluster" and "user" fields are copied from the current context.
kubectl config set-context dev --namespace=development \
--cluster=lithe-cocoa-92103_kubernetes \
--user=lithe-cocoa-92103_kubernetes
kubectl config set-context prod --namespace=production \
--cluster=lithe-cocoa-92103_kubernetes \
--user=lithe-cocoa-92103_kubernetes
By default, the above commands adds two contexts that are saved into file
.kube/config
. You can now view the contexts and alternate against the two
new request contexts depending on which namespace you wish to work against.
To view the new contexts:
kubectl config view
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: REDACTED
server: https://130.211.122.180
name: lithe-cocoa-92103_kubernetes
contexts:
- context:
cluster: lithe-cocoa-92103_kubernetes
user: lithe-cocoa-92103_kubernetes
name: lithe-cocoa-92103_kubernetes
- context:
cluster: lithe-cocoa-92103_kubernetes
namespace: development
user: lithe-cocoa-92103_kubernetes
name: dev
- context:
cluster: lithe-cocoa-92103_kubernetes
namespace: production
user: lithe-cocoa-92103_kubernetes
name: prod
current-context: lithe-cocoa-92103_kubernetes
kind: Config
preferences: {}
users:
- name: lithe-cocoa-92103_kubernetes
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
- name: lithe-cocoa-92103_kubernetes-basic-auth
user:
password: h5M0FtUUIflBSdI7
username: admin
Let's switch to operate in the development
namespace.
kubectl config use-context dev
You can verify your current context by doing the following:
kubectl config current-context
dev
At this point, all requests we make to the Kubernetes cluster from the command line are scoped to the development
namespace.
Let's create some contents.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: snowflake
name: snowflake
spec:
replicas: 2
selector:
matchLabels:
app: snowflake
template:
metadata:
labels:
app: snowflake
spec:
containers:
- image: k8s.gcr.io/serve_hostname
imagePullPolicy: Always
name: snowflake
Apply the manifest to create a Deployment
kubectl apply -f https://k8s.io/examples/admin/snowflake-deployment.yaml
We have created a deployment whose replica size is 2 that is running the pod called snowflake
with a basic container that serves the hostname.
kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
snowflake 2/2 2 2 2m
kubectl get pods -l app=snowflake
NAME READY STATUS RESTARTS AGE
snowflake-3968820950-9dgr8 1/1 Running 0 2m
snowflake-3968820950-vgc4n 1/1 Running 0 2m
And this is great, developers are able to do what they want, and they do not have to worry about affecting content in the production
namespace.
Let's switch to the production
namespace and show how resources in one namespace are hidden from the other.
kubectl config use-context prod
The production
namespace should be empty, and the following commands should return nothing.
kubectl get deployment
kubectl get pods
Production likes to run cattle, so let's create some cattle pods.
kubectl create deployment cattle --image=k8s.gcr.io/serve_hostname --replicas=5
kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
cattle 5/5 5 5 10s
kubectl get pods -l app=cattle
NAME READY STATUS RESTARTS AGE
cattle-2263376956-41xy6 1/1 Running 0 34s
cattle-2263376956-kw466 1/1 Running 0 34s
cattle-2263376956-n4v97 1/1 Running 0 34s
cattle-2263376956-p5p3i 1/1 Running 0 34s
cattle-2263376956-sxpth 1/1 Running 0 34s
At this point, it should be clear that the resources users create in one namespace are hidden from the other namespace.
As the policy support in Kubernetes evolves, we will extend this scenario to show how you can provide different authorization rules for each namespace.
4.2.27 - Operating etcd clusters for Kubernetes
etcd is a consistent and highly-available key value store used as Kubernetes' backing store for all cluster data.
If your Kubernetes cluster uses etcd as its backing store, make sure you have a back up plan for those data.
You can find in-depth information about etcd in the official documentation.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Prerequisites
-
Run etcd as a cluster of odd members.
-
etcd is a leader-based distributed system. Ensure that the leader periodically send heartbeats on time to all followers to keep the cluster stable.
-
Ensure that no resource starvation occurs.
Performance and stability of the cluster is sensitive to network and disk I/O. Any resource starvation can lead to heartbeat timeout, causing instability of the cluster. An unstable etcd indicates that no leader is elected. Under such circumstances, a cluster cannot make any changes to its current state, which implies no new pods can be scheduled.
-
Keeping etcd clusters stable is critical to the stability of Kubernetes clusters. Therefore, run etcd clusters on dedicated machines or isolated environments for guaranteed resource requirements.
-
The minimum recommended version of etcd to run in production is
3.2.10+
.
Resource requirements
Operating etcd with limited resources is suitable only for testing purposes. For deploying in production, advanced hardware configuration is required. Before deploying etcd in production, see resource requirement reference.
Starting etcd clusters
This section covers starting a single-node and multi-node etcd cluster.
Single-node etcd cluster
Use a single-node etcd cluster only for testing purpose.
-
Run the following:
etcd --listen-client-urls=http://$PRIVATE_IP:2379 \ --advertise-client-urls=http://$PRIVATE_IP:2379
-
Start the Kubernetes API server with the flag
--etcd-servers=$PRIVATE_IP:2379
.Make sure
PRIVATE_IP
is set to your etcd client IP.
Multi-node etcd cluster
For durability and high availability, run etcd as a multi-node cluster in production and back it up periodically. A five-member cluster is recommended in production. For more information, see FAQ documentation.
Configure an etcd cluster either by static member information or by dynamic discovery. For more information on clustering, see etcd clustering documentation.
For an example, consider a five-member etcd cluster running with the following
client URLs: http://$IP1:2379
, http://$IP2:2379
, http://$IP3:2379
,
http://$IP4:2379
, and http://$IP5:2379
. To start a Kubernetes API server:
-
Run the following:
etcd --listen-client-urls=http://$IP1:2379,http://$IP2:2379,http://$IP3:2379,http://$IP4:2379,http://$IP5:2379 --advertise-client-urls=http://$IP1:2379,http://$IP2:2379,http://$IP3:2379,http://$IP4:2379,http://$IP5:2379
-
Start the Kubernetes API servers with the flag
--etcd-servers=$IP1:2379,$IP2:2379,$IP3:2379,$IP4:2379,$IP5:2379
.Make sure the
IP<n>
variables are set to your client IP addresses.
Multi-node etcd cluster with load balancer
To run a load balancing etcd cluster:
- Set up an etcd cluster.
- Configure a load balancer in front of the etcd cluster.
For example, let the address of the load balancer be
$LB
. - Start Kubernetes API Servers with the flag
--etcd-servers=$LB:2379
.
Securing etcd clusters
Access to etcd is equivalent to root permission in the cluster so ideally only the API server should have access to it. Considering the sensitivity of the data, it is recommended to grant permission to only those nodes that require access to etcd clusters.
To secure etcd, either set up firewall rules or use the security features
provided by etcd. etcd security features depend on x509 Public Key
Infrastructure (PKI). To begin, establish secure communication channels by
generating a key and certificate pair. For example, use key pairs peer.key
and peer.cert
for securing communication between etcd members, and
client.key
and client.cert
for securing communication between etcd and its
clients. See the example scripts
provided by the etcd project to generate key pairs and CA files for client
authentication.
Securing communication
To configure etcd with secure peer communication, specify flags
--peer-key-file=peer.key
and --peer-cert-file=peer.cert
, and use HTTPS as
the URL schema.
Similarly, to configure etcd with secure client communication, specify flags
--key-file=k8sclient.key
and --cert-file=k8sclient.cert
, and use HTTPS as
the URL schema. Here is an example on a client command that uses secure
communication:
ETCDCTL_API=3 etcdctl --endpoints 10.2.0.9:2379 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
member list
Limiting access of etcd clusters
After configuring secure communication, restrict the access of etcd cluster to only the Kubernetes API servers. Use TLS authentication to do so.
For example, consider key pairs k8sclient.key
and k8sclient.cert
that are
trusted by the CA etcd.ca
. When etcd is configured with --client-cert-auth
along with TLS, it verifies the certificates from clients by using system CAs
or the CA passed in by --trusted-ca-file
flag. Specifying flags
--client-cert-auth=true
and --trusted-ca-file=etcd.ca
will restrict the
access to clients with the certificate k8sclient.cert
.
Once etcd is configured correctly, only clients with valid certificates can
access it. To give Kubernetes API servers the access, configure them with the
flags --etcd-certfile=k8sclient.cert
,--etcd-keyfile=k8sclient.key
and
--etcd-cafile=ca.cert
.
Replacing a failed etcd member
etcd cluster achieves high availability by tolerating minor member failures. However, to improve the overall health of the cluster, replace failed members immediately. When multiple members fail, replace them one by one. Replacing a failed member involves two steps: removing the failed member and adding a new member.
Though etcd keeps unique member IDs internally, it is recommended to use a
unique name for each member to avoid human errors. For example, consider a
three-member etcd cluster. Let the URLs be, member1=http://10.0.0.1
,
member2=http://10.0.0.2
, and member3=http://10.0.0.3
. When member1
fails,
replace it with member4=http://10.0.0.4
.
-
Get the member ID of the failed
member1
:etcdctl --endpoints=http://10.0.0.2,http://10.0.0.3 member list
The following message is displayed:
8211f1d0f64f3269, started, member1, http://10.0.0.1:2380, http://10.0.0.1:2379 91bc3c398fb3c146, started, member2, http://10.0.0.2:2380, http://10.0.0.2:2379 fd422379fda50e48, started, member3, http://10.0.0.3:2380, http://10.0.0.3:2379
-
Remove the failed member:
etcdctl member remove 8211f1d0f64f3269
The following message is displayed:
Removed member 8211f1d0f64f3269 from cluster
-
Add the new member:
etcdctl member add member4 --peer-urls=http://10.0.0.4:2380
The following message is displayed:
Member 2be1eb8f84b7f63e added to cluster ef37ad9dc622a7c4
-
Start the newly added member on a machine with the IP
10.0.0.4
:export ETCD_NAME="member4" export ETCD_INITIAL_CLUSTER="member2=http://10.0.0.2:2380,member3=http://10.0.0.3:2380,member4=http://10.0.0.4:2380" export ETCD_INITIAL_CLUSTER_STATE=existing etcd [flags]
-
Do either of the following:
- Update the
--etcd-servers
flag for the Kubernetes API servers to make Kubernetes aware of the configuration changes, then restart the Kubernetes API servers. - Update the load balancer configuration if a load balancer is used in the deployment.
- Update the
For more information on cluster reconfiguration, see etcd reconfiguration documentation.
Backing up an etcd cluster
All Kubernetes objects are stored on etcd. Periodically backing up the etcd cluster data is important to recover Kubernetes clusters under disaster scenarios, such as losing all control plane nodes. The snapshot file contains all the Kubernetes states and critical information. In order to keep the sensitive Kubernetes data safe, encrypt the snapshot files.
Backing up an etcd cluster can be accomplished in two ways: etcd built-in snapshot and volume snapshot.
Built-in snapshot
etcd supports built-in snapshot. A snapshot may either be taken from a live
member with the etcdctl snapshot save
command or by copying the
member/snap/db
file from an etcd
data directory
that is not currently used by an etcd process. Taking the snapshot will
not affect the performance of the member.
Below is an example for taking a snapshot of the keyspace served by
$ENDPOINT
to the file snapshotdb
:
ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshotdb
Verify the snapshot:
ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshotdb
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| fe01cf57 | 10 | 7 | 2.1 MB |
+----------+----------+------------+------------+
Volume snapshot
If etcd is running on a storage volume that supports backup, such as Amazon Elastic Block Store, back up etcd data by taking a snapshot of the storage volume.
Snapshot using etcdctl options
We can also take the snapshot using various options given by etcdctl. For example
ETCDCTL_API=3 etcdctl -h
will list various options available from etcdctl. For example, you can take a snapshot by specifying the endpoint, certificates etc as shown below:
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=<trusted-ca-file> --cert=<cert-file> --key=<key-file> \
snapshot save <backup-file-location>
where trusted-ca-file
, cert-file
and key-file
can be obtained from the description of the etcd Pod.
Scaling up etcd clusters
Scaling up etcd clusters increases availability by trading off performance. Scaling does not increase cluster performance nor capability. A general rule is not to scale up or down etcd clusters. Do not configure any auto scaling groups for etcd clusters. It is highly recommended to always run a static five-member etcd cluster for production Kubernetes clusters at any officially supported scale.
A reasonable scaling is to upgrade a three-member cluster to a five-member one, when more reliability is desired. See etcd reconfiguration documentation for information on how to add members into an existing cluster.
Restoring an etcd cluster
etcd supports restoring from snapshots that are taken from an etcd process of the major.minor version. Restoring a version from a different patch version of etcd also is supported. A restore operation is employed to recover the data of a failed cluster.
Before starting the restore operation, a snapshot file must be present. It can either be a snapshot file from a previous backup operation, or from a remaining data directory. Here is an example:
ETCDCTL_API=3 etcdctl --endpoints 10.2.0.9:2379 snapshot restore snapshotdb
Another example for restoring using etcdctl options:
ETCDCTL_API=3 etcdctl --data-dir <data-dir-location> snapshot restore snapshotdb
For more information and examples on restoring a cluster from a snapshot file, see etcd disaster recovery documentation.
If the access URLs of the restored cluster is changed from the previous
cluster, the Kubernetes API server must be reconfigured accordingly. In this
case, restart Kubernetes API servers with the flag
--etcd-servers=$NEW_ETCD_CLUSTER
instead of the flag
--etcd-servers=$OLD_ETCD_CLUSTER
. Replace $NEW_ETCD_CLUSTER
and
$OLD_ETCD_CLUSTER
with the respective IP addresses. If a load balancer is
used in front of an etcd cluster, you might need to update the load balancer
instead.
If the majority of etcd members have permanently failed, the etcd cluster is considered failed. In this scenario, Kubernetes cannot make any changes to its current state. Although the scheduled pods might continue to run, no new pods can be scheduled. In such cases, recover the etcd cluster and potentially reconfigure Kubernetes API servers to fix the issue.
If any API servers are running in your cluster, you should not attempt to restore instances of etcd. Instead, follow these steps to restore etcd:
- stop all API server instances
- restore state in all etcd instances
- restart all API server instances
We also recommend restarting any components (e.g. kube-scheduler
,
kube-controller-manager
, kubelet
) to ensure that they don't rely on some
stale data. Note that in practice, the restore takes a bit of time. During the
restoration, critical components will lose leader lock and restart themselves.
Upgrading etcd clusters
For more details on etcd upgrade, please refer to the etcd upgrades documentation.
4.2.28 - Reconfigure a Node's Kubelet in a Live Cluster
Kubernetes v1.22 [deprecated]
Dynamic Kubelet Configuration allows you to change the configuration of each kubelet in a running Kubernetes cluster, by deploying a ConfigMap and configuring each Node to use it.
KubeletConfiguration
.
Before you begin
You need to have a Kubernetes cluster.
You also need kubectl
, installed and configured to communicate with your cluster.
Make sure that you are using a version of kubectl
that is
compatible with your cluster.
Your Kubernetes server must be at or later than version v1.11.
To check the version, enter kubectl version
.
Some of the examples use the command line tool
jq. You do not need jq
to complete the task,
because there are manual alternatives.
For each node that you're reconfiguring, you must set the kubelet
--dynamic-config-dir
flag to a writable directory.
Reconfiguring the kubelet on a running node in your cluster
Basic workflow overview
The basic workflow for configuring a kubelet in a live cluster is as follows:
- Write a YAML or JSON configuration file containing the kubelet's configuration.
- Wrap this file in a ConfigMap and save it to the Kubernetes control plane.
- Update the kubelet's corresponding Node object to use this ConfigMap.
Each kubelet watches a configuration reference on its respective Node object. When this reference changes, the kubelet downloads the new configuration, updates a local reference to refer to the file, and exits. For the feature to work correctly, you must be running an OS-level service manager (such as systemd), which will restart the kubelet if it exits. When the kubelet is restarted, it will begin using the new configuration.
The new configuration completely overrides configuration provided by --config
,
and is overridden by command-line flags. Unspecified values in the new configuration
will receive default values appropriate to the configuration version
(e.g. kubelet.config.k8s.io/v1beta1
), unless overridden by flags.
The status of the Node's kubelet configuration is reported via
Node.Status.Config
. Once you have updated a Node to use the new
ConfigMap, you can observe this status to confirm that the Node is using the
intended configuration.
This document describes editing Nodes using kubectl edit
.
There are other ways to modify a Node's spec, including kubectl patch
, for
example, which facilitate scripted workflows.
This document only describes a single Node consuming each ConfigMap. Keep in mind that it is also valid for multiple Nodes to consume the same ConfigMap.
kubectl
's --append-hash
option,
and incrementally roll out updates to Node.Spec.ConfigSource
.
Automatic RBAC rules for Node Authorizer
Previously, you were required to manually create RBAC rules to allow Nodes to access their assigned ConfigMaps. The Node Authorizer now automatically configures these rules.
Generating a file that contains the current configuration
The Dynamic Kubelet Configuration feature allows you to provide an override for the entire configuration object, rather than a per-field overlay. This is a simpler model that makes it easier to trace the source of configuration values and debug issues. The compromise, however, is that you must start with knowledge of the existing configuration to ensure that you only change the fields you intend to change.
The kubelet loads settings from its configuration file, but you can set command line flags to override the configuration in the file. This means that if you only know the contents of the configuration file, and you don't know the command line overrides, then you do not know the running configuration either.
Because you need to know the running configuration in order to override it,
you can fetch the running configuration from the kubelet. You can generate a
config file containing a Node's current configuration by accessing the kubelet's
configz
endpoint, through kubectl proxy
. The next section explains how to
do this.
configz
endpoint is there to help with debugging, and is not
a stable part of kubelet behavior.
Do not rely on the behavior of this endpoint for production scenarios or for
use with automated tools.
For more information on configuring the kubelet via a configuration file, see Set kubelet parameters via a config file).
Generate the configuration file
jq
command to streamline working with JSON.
To follow the tasks as written, you need to have jq
installed. You can
adapt the steps if you prefer to extract the kubeletconfig
subobject manually.
-
Choose a Node to reconfigure. In this example, the name of this Node is referred to as
NODE_NAME
. -
Start the kubectl proxy in the background using the following command:
kubectl proxy --port=8001 &
-
Run the following command to download and unpack the configuration from the
configz
endpoint. The command is long, so be careful when copying and pasting. If you use zsh, note that common zsh configurations add backslashes to escape the opening and closing curly braces around the variable name in the URL. For example:${NODE_NAME}
will be rewritten as$\{NODE_NAME\}
during the paste. You must remove the backslashes before running the command, or the command will fail.NODE_NAME="the-name-of-the-node-you-are-reconfiguring"; curl -sSL "http://localhost:8001/api/v1/nodes/${NODE_NAME}/proxy/configz" | jq '.kubeletconfig|.kind="KubeletConfiguration"|.apiVersion="kubelet.config.k8s.io/v1beta1"' > kubelet_configz_${NODE_NAME}
kind
and apiVersion
to the downloaded
object, because those fields are not reported by the configz
endpoint.
Edit the configuration file
Using a text editor, change one of the parameters in the
file generated by the previous procedure. For example, you
might edit the parameter eventRecordQPS
, that controls
rate limiting for event recording.
Push the configuration file to the control plane
Push the edited configuration file to the control plane with the following command:
kubectl -n kube-system create configmap my-node-config --from-file=kubelet=kubelet_configz_${NODE_NAME} --append-hash -o yaml
This is an example of a valid response:
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: 2017-09-14T20:23:33Z
name: my-node-config-gkt4c2m4b2
namespace: kube-system
resourceVersion: "119980"
uid: 946d785e-998a-11e7-a8dd-42010a800006
data:
kubelet: |
{...}
You created that ConfigMap inside the kube-system
namespace because the kubelet
is a Kubernetes system component.
The --append-hash
option appends a short checksum of the ConfigMap contents
to the name. This is convenient for an edit-then-push workflow, because it
automatically, yet deterministically, generates new names for new resources.
The name that includes this generated hash is referred to as CONFIG_MAP_NAME
in the following examples.
Set the Node to use the new configuration
Edit the Node's reference to point to the new ConfigMap with the following command:
kubectl edit node ${NODE_NAME}
In your text editor, add the following YAML under spec
:
configSource:
configMap:
name: CONFIG_MAP_NAME # replace CONFIG_MAP_NAME with the name of the ConfigMap
namespace: kube-system
kubeletConfigKey: kubelet
You must specify all three of name
, namespace
, and kubeletConfigKey
.
The kubeletConfigKey
parameter shows the kubelet which key of the ConfigMap
contains its config.
Observe that the Node begins using the new configuration
Retrieve the Node using the kubectl get node ${NODE_NAME} -o yaml
command and inspect
Node.Status.Config
. The config sources corresponding to the active
,
assigned
, and lastKnownGood
configurations are reported in the status.
- The
active
configuration is the version the kubelet is currently running with. - The
assigned
configuration is the latest version the kubelet has resolved based onNode.Spec.ConfigSource
. - The
lastKnownGood
configuration is the version the kubelet will fall back to if an invalid config is assigned inNode.Spec.ConfigSource
.
ThelastKnownGood
configuration might not be present if it is set to its default value,
the local config deployed with the node. The status will update lastKnownGood
to
match a valid assigned
config after the kubelet becomes comfortable with the config.
The details of how the kubelet determines a config should become the lastKnownGood
are
not guaranteed by the API, but is currently implemented as a 10-minute grace period.
You can use the following command (using jq
) to filter down
to the config status:
kubectl get no ${NODE_NAME} -o json | jq '.status.config'
The following is an example response:
{
"active": {
"configMap": {
"kubeletConfigKey": "kubelet",
"name": "my-node-config-9mbkccg2cc",
"namespace": "kube-system",
"resourceVersion": "1326",
"uid": "705ab4f5-6393-11e8-b7cc-42010a800002"
}
},
"assigned": {
"configMap": {
"kubeletConfigKey": "kubelet",
"name": "my-node-config-9mbkccg2cc",
"namespace": "kube-system",
"resourceVersion": "1326",
"uid": "705ab4f5-6393-11e8-b7cc-42010a800002"
}
},
"lastKnownGood": {
"configMap": {
"kubeletConfigKey": "kubelet",
"name": "my-node-config-9mbkccg2cc",
"namespace": "kube-system",
"resourceVersion": "1326",
"uid": "705ab4f5-6393-11e8-b7cc-42010a800002"
}
}
}
(if you do not have jq
, you can look at the whole response and find Node.Status.Config
by eye).
If an error occurs, the kubelet reports it in the Node.Status.Config.Error
structure. Possible errors are listed in
Understanding Node.Status.Config.Error messages.
You can search for the identical text in the kubelet log for additional details
and context about the error.
Make more changes
Follow the workflow above to make more changes and push them again. Each time
you push a ConfigMap with new contents, the --append-hash
kubectl option creates
the ConfigMap with a new name. The safest rollout strategy is to first create a
new ConfigMap, and then update the Node to use the new ConfigMap.
Reset the Node to use its local default configuration
To reset the Node to use the configuration it was provisioned with, edit the
Node using kubectl edit node ${NODE_NAME}
and remove the
Node.Spec.ConfigSource
field.
Observe that the Node is using its local default configuration
After removing this subfield, Node.Status.Config
eventually becomes
empty, since all config sources have been reset to nil
, which indicates that
the local default config is assigned
, active
, and lastKnownGood
, and no
error is reported.
kubectl patch
example
You can change a Node's configSource using several different mechanisms.
This example uses kubectl patch
:
kubectl patch node ${NODE_NAME} -p "{\"spec\":{\"configSource\":{\"configMap\":{\"name\":\"${CONFIG_MAP_NAME}\",\"namespace\":\"kube-system\",\"kubeletConfigKey\":\"kubelet\"}}}}"
Understanding how the kubelet checkpoints config
When a new config is assigned to the Node, the kubelet downloads and unpacks the
config payload as a set of files on the local disk. The kubelet also records metadata
that locally tracks the assigned and last-known-good config sources, so that the
kubelet knows which config to use across restarts, even if the API server becomes
unavailable. After checkpointing a config and the relevant metadata, the kubelet
exits if it detects that the assigned config has changed. When the kubelet is
restarted by the OS-level service manager (such as systemd
), it reads the new
metadata and uses the new config.
The recorded metadata is fully resolved, meaning that it contains all necessary
information to choose a specific config version - typically a UID
and ResourceVersion
.
This is in contrast to Node.Spec.ConfigSource
, where the intended config is declared
via the idempotent namespace/name
that identifies the target ConfigMap; the kubelet
tries to use the latest version of this ConfigMap.
When you are debugging problems on a node, you can inspect the kubelet's config metadata and checkpoints. The structure of the kubelet's checkpointing directory is:
- --dynamic-config-dir (root for managing dynamic config)
| - meta
| - assigned (encoded kubeletconfig/v1beta1.SerializedNodeConfigSource object, indicating the assigned config)
| - last-known-good (encoded kubeletconfig/v1beta1.SerializedNodeConfigSource object, indicating the last-known-good config)
| - checkpoints
| - uid1 (dir for versions of object identified by uid1)
| - resourceVersion1 (dir for unpacked files from resourceVersion1 of object with uid1)
| - ...
| - ...
Understanding Node.Status.Config.Error
messages
The following table describes error messages that can occur when using Dynamic Kubelet Config. You can search for the identical text in the Kubelet log for additional details and context about the error.
Error Message | Possible Causes |
---|---|
failed to load config, see Kubelet log for details | The kubelet likely could not parse the downloaded config payload, or encountered a filesystem error attempting to load the payload from disk. |
failed to validate config, see Kubelet log for details | The configuration in the payload, combined with any command-line flag overrides, and the sum of feature gates from flags, the config file, and the remote payload, was determined to be invalid by the kubelet. |
invalid NodeConfigSource, exactly one subfield must be non-nil, but all were nil | Since Node.Spec.ConfigSource is validated by the API server to contain at least one non-nil subfield, this likely means that the kubelet is older than the API server and does not recognize a newer source type. |
failed to sync: failed to download config, see Kubelet log for details | The kubelet could not download the config. It is possible that Node.Spec.ConfigSource could not be resolved to a concrete API object, or that network errors disrupted the download attempt. The kubelet will retry the download when in this error state. |
failed to sync: internal failure, see Kubelet log for details | The kubelet encountered some internal problem and failed to update its config as a result. Examples include filesystem errors and reading objects from the internal informer cache. |
internal failure, see Kubelet log for details | The kubelet encountered some internal problem while manipulating config, outside of the configuration sync loop. |
What's next
- Set kubelet parameters via a config file explains the supported way to configure a kubelet.
- See the reference documentation for Node, including the
configSource
field within the Node's .spec - Learn more about kubelet configuration by checking the
KubeletConfiguration
reference.
4.2.29 - Reserve Compute Resources for System Daemons
Kubernetes nodes can be scheduled to Capacity
. Pods can consume all the
available capacity on a node by default. This is an issue because nodes
typically run quite a few system daemons that power the OS and Kubernetes
itself. Unless resources are set aside for these system daemons, pods and system
daemons compete for resources and lead to resource starvation issues on the
node.
The kubelet
exposes a feature named 'Node Allocatable' that helps to reserve
compute resources for system daemons. Kubernetes recommends cluster
administrators to configure 'Node Allocatable' based on their workload density
on each node.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.8. To check the version, enterkubectl version
.
Your Kubernetes server must be at or later than version 1.17 to use
the kubelet command line option --reserved-cpus
to set an
explicitly reserved CPU list.
Node Allocatable
'Allocatable' on a Kubernetes node is defined as the amount of compute resources that are available for pods. The scheduler does not over-subscribe 'Allocatable'. 'CPU', 'memory' and 'ephemeral-storage' are supported as of now.
Node Allocatable is exposed as part of v1.Node
object in the API and as part
of kubectl describe node
in the CLI.
Resources can be reserved for two categories of system daemons in the kubelet
.
Enabling QoS and Pod level cgroups
To properly enforce node allocatable constraints on the node, you must
enable the new cgroup hierarchy via the --cgroups-per-qos
flag. This flag is
enabled by default. When enabled, the kubelet
will parent all end-user pods
under a cgroup hierarchy managed by the kubelet
.
Configuring a cgroup driver
The kubelet
supports manipulation of the cgroup hierarchy on
the host using a cgroup driver. The driver is configured via the
--cgroup-driver
flag.
The supported values are the following:
cgroupfs
is the default driver that performs direct manipulation of the cgroup filesystem on the host in order to manage cgroup sandboxes.systemd
is an alternative driver that manages cgroup sandboxes using transient slices for resources that are supported by that init system.
Depending on the configuration of the associated container runtime,
operators may have to choose a particular cgroup driver to ensure
proper system behavior. For example, if operators use the systemd
cgroup driver provided by the containerd
runtime, the kubelet
must
be configured to use the systemd
cgroup driver.
Kube Reserved
- Kubelet Flag:
--kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]
- Kubelet Flag:
--kube-reserved-cgroup=
kube-reserved
is meant to capture resource reservation for kubernetes system
daemons like the kubelet
, container runtime
, node problem detector
, etc.
It is not meant to reserve resources for system daemons that are run as pods.
kube-reserved
is typically a function of pod density
on the nodes.
In addition to cpu
, memory
, and ephemeral-storage
, pid
may be
specified to reserve the specified number of process IDs for
kubernetes system daemons.
To optionally enforce kube-reserved
on kubernetes system daemons, specify the parent
control group for kube daemons as the value for --kube-reserved-cgroup
kubelet
flag.
It is recommended that the kubernetes system daemons are placed under a top
level control group (runtime.slice
on systemd machines for example). Each
system daemon should ideally run within its own child control group. Refer to
the design proposal
for more details on recommended control group hierarchy.
Note that Kubelet does not create --kube-reserved-cgroup
if it doesn't
exist. Kubelet will fail if an invalid cgroup is specified.
System Reserved
- Kubelet Flag:
--system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]
- Kubelet Flag:
--system-reserved-cgroup=
system-reserved
is meant to capture resource reservation for OS system daemons
like sshd
, udev
, etc. system-reserved
should reserve memory
for the
kernel
too since kernel
memory is not accounted to pods in Kubernetes at this time.
Reserving resources for user login sessions is also recommended (user.slice
in
systemd world).
In addition to cpu
, memory
, and ephemeral-storage
, pid
may be
specified to reserve the specified number of process IDs for OS system
daemons.
To optionally enforce system-reserved
on system daemons, specify the parent
control group for OS system daemons as the value for --system-reserved-cgroup
kubelet flag.
It is recommended that the OS system daemons are placed under a top level
control group (system.slice
on systemd machines for example).
Note that kubelet
does not create --system-reserved-cgroup
if it doesn't
exist. kubelet
will fail if an invalid cgroup is specified.
Explicitly Reserved CPU List
Kubernetes v1.17 [stable]
Kubelet Flag: --reserved-cpus=0-3
reserved-cpus
is meant to define an explicit CPU set for OS system daemons and
kubernetes system daemons. reserved-cpus
is for systems that do not intend to
define separate top level cgroups for OS system daemons and kubernetes system daemons
with regard to cpuset resource.
If the Kubelet does not have --system-reserved-cgroup
and --kube-reserved-cgroup
,
the explicit cpuset provided by reserved-cpus
will take precedence over the CPUs
defined by --kube-reserved
and --system-reserved
options.
This option is specifically designed for Telco/NFV use cases where uncontrolled interrupts/timers may impact the workload performance. you can use this option to define the explicit cpuset for the system/kubernetes daemons as well as the interrupts/timers, so the rest CPUs on the system can be used exclusively for workloads, with less impact from uncontrolled interrupts/timers. To move the system daemon, kubernetes daemons and interrupts/timers to the explicit cpuset defined by this option, other mechanism outside Kubernetes should be used. For example: in Centos, you can do this using the tuned toolset.
Eviction Thresholds
Kubelet Flag: --eviction-hard=[memory.available<500Mi]
Memory pressure at the node level leads to System OOMs which affects the entire
node and all pods running on it. Nodes can go offline temporarily until memory
has been reclaimed. To avoid (or reduce the probability of) system OOMs kubelet
provides out of resource
management. Evictions are
supported for memory
and ephemeral-storage
only. By reserving some memory via
--eviction-hard
flag, the kubelet
attempts to evict pods whenever memory
availability on the node drops below the reserved value. Hypothetically, if
system daemons did not exist on a node, pods cannot use more than capacity - eviction-hard
. For this reason, resources reserved for evictions are not
available for pods.
Enforcing Node Allocatable
Kubelet Flag: --enforce-node-allocatable=pods[,][system-reserved][,][kube-reserved]
The scheduler treats 'Allocatable' as the available capacity
for pods.
kubelet
enforce 'Allocatable' across pods by default. Enforcement is performed
by evicting pods whenever the overall usage across all pods exceeds
'Allocatable'. More details on eviction policy can be found
on the node pressure eviction
page. This enforcement is controlled by
specifying pods
value to the kubelet flag --enforce-node-allocatable
.
Optionally, kubelet
can be made to enforce kube-reserved
and
system-reserved
by specifying kube-reserved
& system-reserved
values in
the same flag. Note that to enforce kube-reserved
or system-reserved
,
--kube-reserved-cgroup
or --system-reserved-cgroup
needs to be specified
respectively.
General Guidelines
System daemons are expected to be treated similar to 'Guaranteed' pods. System
daemons can burst within their bounding control groups and this behavior needs
to be managed as part of kubernetes deployments. For example, kubelet
should
have its own control group and share kube-reserved
resources with the
container runtime. However, Kubelet cannot burst and use up all available Node
resources if kube-reserved
is enforced.
Be extra careful while enforcing system-reserved
reservation since it can lead
to critical system services being CPU starved, OOM killed, or unable
to fork on the node. The
recommendation is to enforce system-reserved
only if a user has profiled their
nodes exhaustively to come up with precise estimates and is confident in their
ability to recover if any process in that group is oom-killed.
- To begin with enforce 'Allocatable' on
pods
. - Once adequate monitoring and alerting is in place to track kube system
daemons, attempt to enforce
kube-reserved
based on usage heuristics. - If absolutely necessary, enforce
system-reserved
over time.
The resource requirements of kube system daemons may grow over time as more and
more features are added. Over time, kubernetes project will attempt to bring
down utilization of node system daemons, but that is not a priority as of now.
So expect a drop in Allocatable
capacity in future releases.
Example Scenario
Here is an example to illustrate Node Allocatable computation:
- Node has
32Gi
ofmemory
,16 CPUs
and100Gi
ofStorage
--kube-reserved
is set tocpu=1,memory=2Gi,ephemeral-storage=1Gi
--system-reserved
is set tocpu=500m,memory=1Gi,ephemeral-storage=1Gi
--eviction-hard
is set tomemory.available<500Mi,nodefs.available<10%
Under this scenario, 'Allocatable' will be 14.5 CPUs, 28.5Gi of memory and
88Gi
of local storage.
Scheduler ensures that the total memory requests
across all pods on this node does
not exceed 28.5Gi and storage doesn't exceed 88Gi.
Kubelet evicts pods whenever the overall memory usage across pods exceeds 28.5Gi,
or if overall disk usage exceeds 88Gi If all processes on the node consume as
much CPU as they can, pods together cannot consume more than 14.5 CPUs.
If kube-reserved
and/or system-reserved
is not enforced and system daemons
exceed their reservation, kubelet
evicts pods whenever the overall node memory
usage is higher than 31.5Gi or storage
is greater than 90Gi.
4.2.30 - Running Kubernetes Node Components as a Non-root User
Kubernetes v1.22 [alpha]
This document describes how to run Kubernetes Node components such as kubelet, CRI, OCI, and CNI without root privileges, by using a user namespace.
This technique is also known as rootless mode.
This document describes how to run Kubernetes Node components (and hence pods) as a non-root user.
If you are just looking for how to run a pod as a non-root user, see SecurityContext.
Before you begin
Your Kubernetes server must be at or later than version 1.22.
To check the version, enter kubectl version
.
- Enable Cgroup v2
- Enable systemd with user session
- Configure several sysctl values, depending on host Linux distribution
- Ensure that your unprivileged user is listed in
/etc/subuid
and/etc/subgid
- Enable the
KubeletInUserNamespace
feature gate
Running Kubernetes inside Rootless Docker/Podman
kind
kind supports running Kubernetes inside Rootless Docker or Rootless Podman.
See Running kind with Rootless Docker.
minikube
minikube also supports running Kubernetes inside Rootless Docker.
See the page about the docker driver in the Minikube documentation.
Rootless Podman is not supported.
Running Kubernetes inside Unprivileged Containers
sysbox
Sysbox is an open-source container runtime (similar to "runc") that supports running system-level workloads such as Docker and Kubernetes inside unprivileged containers isolated with the Linux user namespace.
See Sysbox Quick Start Guide: Kubernetes-in-Docker for more info.
Sysbox supports running Kubernetes inside unprivileged containers without
requiring Cgroup v2 and without the KubeletInUserNamespace
feature gate. It
does this by exposing specially crafted /proc
and /sys
filesystems inside
the container plus several other advanced OS virtualization techniques.
Running Rootless Kubernetes directly on a host
K3s
K3s experimentally supports rootless mode.
See Running K3s with Rootless mode for the usage.
Usernetes
Usernetes is a reference distribution of Kubernetes that can be installed under $HOME
directory without the root privilege.
Usernetes supports both containerd and CRI-O as CRI runtimes. Usernetes supports multi-node clusters using Flannel (VXLAN).
See the Usernetes repo for the usage.
Manually deploy a node that runs the kubelet in a user namespace
This section provides hints for running Kubernetes in a user namespace manually.
Creating a user namespace
The first step is to create a user namespace.
If you are trying to run Kubernetes in a user-namespaced container such as Rootless Docker/Podman or LXC/LXD, you are all set, and you can go to the next subsection.
Otherwise you have to create a user namespace by yourself, by calling unshare(2)
with CLONE_NEWUSER
.
A user namespace can be also unshared by using command line tools such as:
After unsharing the user namespace, you will also have to unshare other namespaces such as mount namespace.
You do not need to call chroot()
nor pivot_root()
after unsharing the mount namespace,
however, you have to mount writable filesystems on several directories in the namespace.
At least, the following directories need to be writable in the namespace (not outside the namespace):
/etc
/run
/var/logs
/var/lib/kubelet
/var/lib/cni
/var/lib/containerd
(for containerd)/var/lib/containers
(for CRI-O)
Creating a delegated cgroup tree
In addition to the user namespace, you also need to have a writable cgroup tree with cgroup v2.
If you are trying to run Kubernetes in Rootless Docker/Podman or LXC/LXD on a systemd-based host, you are all set.
Otherwise you have to create a systemd unit with Delegate=yes
property to delegate a cgroup tree with writable permission.
On your node, systemd must already be configured to allow delegation; for more details, see cgroup v2 in the Rootless Containers documentation.
Configuring network
The network namespace of the Node components has to have a non-loopback interface, which can be for example configured with slirp4netns, VPNKit, or lxc-user-nic(1).
The network namespaces of the Pods can be configured with regular CNI plugins. For multi-node networking, Flannel (VXLAN, 8472/UDP) is known to work.
Ports such as the kubelet port (10250/TCP) and NodePort
service ports have to be exposed from the Node network namespace to
the host with an external port forwarder, such as RootlessKit, slirp4netns, or
socat(1).
You can use the port forwarder from K3s.
See Running K3s in Rootless Mode
for more details.
The implementation can be found in the pkg/rootlessports
package of k3s.
Configuring CRI
The kubelet relies on a container runtime. You should deploy a container runtime such as containerd or CRI-O and ensure that it is running within the user namespace before the kubelet starts.
Running CRI plugin of containerd in a user namespace is supported since containerd 1.4.
Running containerd within a user namespace requires the following configurations.
version = 2
[plugins."io.containerd.grpc.v1.cri"]
# Disable AppArmor
disable_apparmor = true
# Ignore an error during setting oom_score_adj
restrict_oom_score_adj = true
# Disable hugetlb cgroup v2 controller (because systemd does not support delegating hugetlb controller)
disable_hugetlb_controller = true
[plugins."io.containerd.grpc.v1.cri".containerd]
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be disabled
snapshotter = "fuse-overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# We use cgroupfs that is delegated by systemd, so we do not use SystemdCgroup driver
# (unless you run another systemd in the namespace)
SystemdCgroup = false
The default path of the configuration file is /etc/containerd/config.toml
.
The path can be specified with containerd -c /path/to/containerd/config.toml
.
Running CRI-O in a user namespace is supported since CRI-O 1.22.
CRI-O requires an environment variable _CRIO_ROOTLESS=1
to be set.
The following configurations are also recommended:
[crio]
storage_driver = "overlay"
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be disabled
storage_option = ["overlay.mount_program=/usr/local/bin/fuse-overlayfs"]
[crio.runtime]
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
cgroup_manager = "cgroupfs"
The default path of the configuration file is /etc/crio/crio.conf
.
The path can be specified with crio --config /path/to/crio/crio.conf
.
Configuring kubelet
Running kubelet in a user namespace requires the following configuration:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
KubeletInUserNamespace: true
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
cgroupDriver: "cgroupfs"
When the KubeletInUserNamespace
feature gate is enabled, the kubelet ignores errors
that may happen during setting the following sysctl values on the node.
vm.overcommit_memory
vm.panic_on_oom
kernel.panic
kernel.panic_on_oops
kernel.keys.root_maxkeys
kernel.keys.root_maxbytes
.
Within a user namespace, the kubelet also ignores any error raised from trying to open /dev/kmsg
.
This feature gate also allows kube-proxy to ignore an error during setting RLIMIT_NOFILE
.
The KubeletInUserNamespace
feature gate was introduced in Kubernetes v1.22 with "alpha" status.
Running kubelet in a user namespace without using this feature gate is also possible by mounting a specially crafted proc filesystem (as done by Sysbox), but not officially supported.
Configuring kube-proxy
Running kube-proxy in a user namespace requires the following configuration:
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "iptables" # or "userspace"
conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
maxPerCore: 0
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
tcpCloseWaitTimeout: 0s
Caveats
-
Most of "non-local" volume drivers such as
nfs
andiscsi
do not work. Local volumes likelocal
,hostPath
,emptyDir
,configMap
,secret
, anddownwardAPI
are known to work. -
Some CNI plugins may not work. Flannel (VXLAN) is known to work.
For more on this, see the Caveats and Future work page on the rootlesscontaine.rs website.
See Also
4.2.31 - Safely Drain a Node
This page shows how to safely drain a node, optionally respecting the PodDisruptionBudget you have defined.
Before you begin
Your Kubernetes server must be at or later than version 1.5.
To check the version, enter kubectl version
.
This task also assumes that you have met the following prerequisites:
- You do not require your applications to be highly available during the node drain, or
- You have read about the PodDisruptionBudget concept, and have configured PodDisruptionBudgets for applications that need them.
(Optional) Configure a disruption budget
To ensure that your workloads remain available during maintenance, you can configure a PodDisruptionBudget.
If availability is important for any applications that run or could run on the node(s) that you are draining, configure a PodDisruptionBudgets first and the continue following this guide.
Use kubectl drain
to remove a node from service
You can use kubectl drain
to safely evict all of your pods from a
node before you perform maintenance on the node (e.g. kernel upgrade,
hardware maintenance, etc.). Safe evictions allow the pod's containers
to gracefully terminate
and will respect the PodDisruptionBudgets you have specified.
kubectl drain
ignores certain system pods on the node
that cannot be killed; see
the kubectl drain
documentation for more details.
When kubectl drain
returns successfully, that indicates that all of
the pods (except the ones excluded as described in the previous paragraph)
have been safely evicted (respecting the desired graceful termination period,
and respecting the PodDisruptionBudget you have defined). It is then safe to
bring down the node by powering down its physical machine or, if running on a
cloud platform, deleting its virtual machine.
First, identify the name of the node you wish to drain. You can list all of the nodes in your cluster with
kubectl get nodes
Next, tell Kubernetes to drain the node:
kubectl drain <node name>
Once it returns (without giving an error), you can power down the node (or equivalently, if on a cloud platform, delete the virtual machine backing the node). If you leave the node in the cluster during the maintenance operation, you need to run
kubectl uncordon <node name>
afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.
Draining multiple nodes in parallel
The kubectl drain
command should only be issued to a single node at a
time. However, you can run multiple kubectl drain
commands for
different nodes in parallel, in different terminals or in the
background. Multiple drain commands running concurrently will still
respect the PodDisruptionBudget you specify.
For example, if you have a StatefulSet with three replicas and have
set a PodDisruptionBudget for that set specifying minAvailable: 2
,
kubectl drain
only evicts a pod from the StatefulSet if all three
replicas pods are ready; if then you issue multiple drain commands in
parallel, Kubernetes respects the PodDisruptionBudget and ensure
that only 1 (calculated as replicas - minAvailable
) Pod is unavailable
at any given time. Any drains that would cause the number of ready
replicas to fall below the specified budget are blocked.
The Eviction API
If you prefer not to use kubectl drain (such as to avoid calling to an external command, or to get finer control over the pod eviction process), you can also programmatically cause evictions using the eviction API.
For more information, see API-initiated eviction.
What's next
- Follow steps to protect your application by configuring a Pod Disruption Budget.
4.2.32 - Securing a Cluster
This document covers topics related to protecting a cluster from accidental or malicious access and provides recommendations on overall security.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Controlling access to the Kubernetes API
As Kubernetes is entirely API-driven, controlling and limiting who can access the cluster and what actions they are allowed to perform is the first line of defense.
Use Transport Layer Security (TLS) for all API traffic
Kubernetes expects that all API communication in the cluster is encrypted by default with TLS, and the majority of installation methods will allow the necessary certificates to be created and distributed to the cluster components. Note that some components and installation methods may enable local ports over HTTP and administrators should familiarize themselves with the settings of each component to identify potentially unsecured traffic.
API Authentication
Choose an authentication mechanism for the API servers to use that matches the common access patterns when you install a cluster. For instance, small, single-user clusters may wish to use a simple certificate or static Bearer token approach. Larger clusters may wish to integrate an existing OIDC or LDAP server that allow users to be subdivided into groups.
All API clients must be authenticated, even those that are part of the infrastructure like nodes, proxies, the scheduler, and volume plugins. These clients are typically service accounts or use x509 client certificates, and they are created automatically at cluster startup or are setup as part of the cluster installation.
Consult the authentication reference document for more information.
API Authorization
Once authenticated, every API call is also expected to pass an authorization check. Kubernetes ships an integrated Role-Based Access Control (RBAC) component that matches an incoming user or group to a set of permissions bundled into roles. These permissions combine verbs (get, create, delete) with resources (pods, services, nodes) and can be namespace-scoped or cluster-scoped. A set of out-of-the-box roles are provided that offer reasonable default separation of responsibility depending on what actions a client might want to perform. It is recommended that you use the Node and RBAC authorizers together, in combination with the NodeRestriction admission plugin.
As with authentication, simple and broad roles may be appropriate for smaller clusters, but as more users interact with the cluster, it may become necessary to separate teams into separate namespaces with more limited roles.
With authorization, it is important to understand how updates on one object may cause actions in other places. For instance, a user may not be able to create pods directly, but allowing them to create a deployment, which creates pods on their behalf, will let them create those pods indirectly. Likewise, deleting a node from the API will result in the pods scheduled to that node being terminated and recreated on other nodes. The out-of-the box roles represent a balance between flexibility and common use cases, but more limited roles should be carefully reviewed to prevent accidental escalation. You can make roles specific to your use case if the out-of-box ones don't meet your needs.
Consult the authorization reference section for more information.
Controlling access to the Kubelet
Kubelets expose HTTPS endpoints which grant powerful control over the node and containers. By default Kubelets allow unauthenticated access to this API.
Production clusters should enable Kubelet authentication and authorization.
Consult the Kubelet authentication/authorization reference for more information.
Controlling the capabilities of a workload or user at runtime
Authorization in Kubernetes is intentionally high level, focused on coarse actions on resources. More powerful controls exist as policies to limit by use case how those objects act on the cluster, themselves, and other resources.
Limiting resource usage on a cluster
Resource quota limits the number or capacity of resources granted to a namespace. This is most often used to limit the amount of CPU, memory, or persistent disk a namespace can allocate, but can also control how many pods, services, or volumes exist in each namespace.
Limit ranges restrict the maximum or minimum size of some of the resources above, to prevent users from requesting unreasonably high or low values for commonly reserved resources like memory, or to provide default limits when none are specified.
Controlling what privileges containers run with
A pod definition contains a security context
that allows it to request access to run as a specific Linux user on a node (like root),
access to run privileged or access the host network, and other controls that would otherwise
allow it to run unfettered on a hosting node. Pod security policies
can limit which users or service accounts can provide dangerous security context settings. For example, pod security policies can limit volume mounts, especially hostPath
, which are aspects of a pod that should be controlled.
Generally, most application workloads need limited access to host resources so they can successfully run as a root process (uid 0) without access to host information. However, considering the privileges associated with the root user, you should write application containers to run as a non-root user. Similarly, administrators who wish to prevent client applications from escaping their containers should use a restrictive pod security policy.
Preventing containers from loading unwanted kernel modules
The Linux kernel automatically loads kernel modules from disk if needed in certain circumstances, such as when a piece of hardware is attached or a filesystem is mounted. Of particular relevance to Kubernetes, even unprivileged processes can cause certain network-protocol-related kernel modules to be loaded, just by creating a socket of the appropriate type. This may allow an attacker to exploit a security hole in a kernel module that the administrator assumed was not in use.
To prevent specific modules from being automatically loaded, you can uninstall them from
the node, or add rules to block them. On most Linux distributions, you can do that by
creating a file such as /etc/modprobe.d/kubernetes-blacklist.conf
with contents like:
# DCCP is unlikely to be needed, has had multiple serious
# vulnerabilities, and is not well-maintained.
blacklist dccp
# SCTP is not used in most Kubernetes clusters, and has also had
# vulnerabilities in the past.
blacklist sctp
To block module loading more generically, you can use a Linux Security Module (such as
SELinux) to completely deny the module_request
permission to containers, preventing the
kernel from loading modules for containers under any circumstances. (Pods would still be
able to use modules that had been loaded manually, or modules that were loaded by the
kernel on behalf of some more-privileged process.)
Restricting network access
The network policies for a namespace allows application authors to restrict which pods in other namespaces may access pods and ports within their namespaces. Many of the supported Kubernetes networking providers now respect network policy.
Quota and limit ranges can also be used to control whether users may request node ports or load-balanced services, which on many clusters can control whether those users applications are visible outside of the cluster.
Additional protections may be available that control network rules on a per-plugin or per- environment basis, such as per-node firewalls, physically separating cluster nodes to prevent cross talk, or advanced networking policy.
Restricting cloud metadata API access
Cloud platforms (AWS, Azure, GCE, etc.) often expose metadata services locally to instances. By default these APIs are accessible by pods running on an instance and can contain cloud credentials for that node, or provisioning data such as kubelet credentials. These credentials can be used to escalate within the cluster or to other cloud services under the same account.
When running Kubernetes on a cloud platform, limit permissions given to instance credentials, use network policies to restrict pod access to the metadata API, and avoid using provisioning data to deliver secrets.
Controlling which nodes pods may access
By default, there are no restrictions on which nodes may run a pod. Kubernetes offers a rich set of policies for controlling placement of pods onto nodes and the taint-based pod placement and eviction that are available to end users. For many clusters use of these policies to separate workloads can be a convention that authors adopt or enforce via tooling.
As an administrator, a beta admission plugin PodNodeSelector
can be used to force pods
within a namespace to default or require a specific node selector, and if end users cannot
alter namespaces, this can strongly limit the placement of all of the pods in a specific workload.
Protecting cluster components from compromise
This section describes some common patterns for protecting clusters from compromise.
Restrict access to etcd
Write access to the etcd backend for the API is equivalent to gaining root on the entire cluster, and read access can be used to escalate fairly quickly. Administrators should always use strong credentials from the API servers to their etcd server, such as mutual auth via TLS client certificates, and it is often recommended to isolate the etcd servers behind a firewall that only the API servers may access.
Enable audit logging
The audit logger is a beta feature that records actions taken by the API for later analysis in the event of a compromise. It is recommended to enable audit logging and archive the audit file on a secure server.
Restrict access to alpha or beta features
Alpha and beta Kubernetes features are in active development and may have limitations or bugs that result in security vulnerabilities. Always assess the value an alpha or beta feature may provide against the possible risk to your security posture. When in doubt, disable features you do not use.
Rotate infrastructure credentials frequently
The shorter the lifetime of a secret or credential the harder it is for an attacker to make use of that credential. Set short lifetimes on certificates and automate their rotation. Use an authentication provider that can control how long issued tokens are available and use short lifetimes where possible. If you use service-account tokens in external integrations, plan to rotate those tokens frequently. For example, once the bootstrap phase is complete, a bootstrap token used for setting up nodes should be revoked or its authorization removed.
Review third party integrations before enabling them
Many third party integrations to Kubernetes may alter the security profile of your cluster. When enabling an integration, always review the permissions that an extension requests before granting it access. For example, many security integrations may request access to view all secrets on your cluster which is effectively making that component a cluster admin. When in doubt, restrict the integration to functioning in a single namespace if possible.
Components that create pods may also be unexpectedly powerful if they can do so inside namespaces
like the kube-system
namespace, because those pods can gain access to service account secrets
or run with elevated permissions if those service accounts are granted access to permissive
pod security policies.
Encrypt secrets at rest
In general, the etcd database will contain any information accessible via the Kubernetes API and may grant an attacker significant visibility into the state of your cluster. Always encrypt your backups using a well reviewed backup and encryption solution, and consider using full disk encryption where possible.
Kubernetes supports encryption at rest, a feature
introduced in 1.7, and beta since 1.13. This will encrypt Secret
resources in etcd, preventing
parties that gain access to your etcd backups from viewing the content of those secrets. While
this feature is currently beta, it offers an additional level of defense when backups
are not encrypted or an attacker gains read access to etcd.
Receiving alerts for security updates and reporting vulnerabilities
Join the kubernetes-announce group for emails about security announcements. See the security reporting page for more on how to report vulnerabilities.
4.2.33 - Set Kubelet parameters via a config file
A subset of the Kubelet's configuration parameters may be set via an on-disk config file, as a substitute for command-line flags.
Providing parameters via a config file is the recommended approach because it simplifies node deployment and configuration management.
Create the config file
The subset of the Kubelet's configuration that can be configured via a file
is defined by the
KubeletConfiguration
struct.
The configuration file must be a JSON or YAML representation of the parameters in this struct. Make sure the Kubelet has read permissions on the file.
Here is an example of what this file might look like:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8",
port: 20250,
serializeImagePulls: false,
evictionHard:
memory.available: "200Mi"
In the example, the Kubelet is configured to serve on IP address 192.168.0.8 and port 20250, pull images in parallel, and evict Pods when available memory drops below 200Mi. All other Kubelet configuration values are left at their built-in defaults, unless overridden by flags. Command line flags which target the same value as a config file will override that value.
Start a Kubelet process configured via the config file
kubeadmin init
.
See configuring kubelet using kubeadm for details.
Start the Kubelet with the --config
flag set to the path of the Kubelet's config file.
The Kubelet will then load its config from this file.
Note that command line flags which target the same value as a config file will override that value. This helps ensure backwards compatibility with the command-line API.
Note that relative file paths in the Kubelet config file are resolved relative to the location of the Kubelet config file, whereas relative paths in command line flags are resolved relative to the Kubelet's current working directory.
Note that some default values differ between command-line flags and the Kubelet config file.
If --config
is provided and the values are not specified via the command line, the
defaults for the KubeletConfiguration
version apply.
In the above example, this version is kubelet.config.k8s.io/v1beta1
.
What's next
- Learn more about kubelet configuration by checking the
KubeletConfiguration
reference.
4.2.34 - Share a Cluster with Namespaces
This page shows how to view, work in, and delete namespaces. The page also shows how to use Kubernetes namespaces to subdivide your cluster.
Before you begin
- Have an existing Kubernetes cluster.
- You have a basic understanding of Kubernetes Pods, Services, and Deployments.
Viewing namespaces
- List the current namespaces in a cluster using:
kubectl get namespaces
NAME STATUS AGE
default Active 11d
kube-system Active 11d
kube-public Active 11d
Kubernetes starts with three initial namespaces:
default
The default namespace for objects with no other namespacekube-system
The namespace for objects created by the Kubernetes systemkube-public
This namespace is created automatically and is readable by all users (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.
You can also get the summary of a specific namespace using:
kubectl get namespaces <name>
Or you can get detailed information with:
kubectl describe namespaces <name>
Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Resource Limits
Type Resource Min Max Default
---- -------- --- --- ---
Container cpu - - 100m
Note that these details show both resource quota (if present) as well as resource limit ranges.
Resource quota tracks aggregate usage of resources in the Namespace and allows cluster operators to define Hard resource usage limits that a Namespace may consume.
A limit range defines min/max constraints on the amount of resources a single entity can consume in a Namespace.
See Admission control: Limit Range
A namespace can be in one of two phases:
Active
the namespace is in useTerminating
the namespace is being deleted, and can not be used for new objects
For more details, see Namespace in the API reference.
Creating a new namespace
kube-
, since it is reserved for Kubernetes system namespaces.
-
Create a new YAML file called
my-namespace.yaml
with the contents:apiVersion: v1 kind: Namespace metadata: name: <insert-namespace-name-here>
Then run:
kubectl create -f ./my-namespace.yaml
-
Alternatively, you can create namespace using below command:
kubectl create namespace <insert-namespace-name-here>
The name of your namespace must be a valid DNS label.
There's an optional field finalizers
, which allows observables to purge resources whenever the namespace is deleted. Keep in mind that if you specify a nonexistent finalizer, the namespace will be created but will get stuck in the Terminating
state if the user tries to delete it.
More information on finalizers
can be found in the namespace design doc.
Deleting a namespace
Delete a namespace with
kubectl delete namespaces <insert-some-namespace-name>
This delete is asynchronous, so for a time you will see the namespace in the Terminating
state.
Subdividing your cluster using Kubernetes namespaces
-
Understand the default namespace
By default, a Kubernetes cluster will instantiate a default namespace when provisioning the cluster to hold the default set of Pods, Services, and Deployments used by the cluster.
Assuming you have a fresh cluster, you can introspect the available namespaces by doing the following:
kubectl get namespaces
NAME STATUS AGE default Active 13m
-
Create new namespaces
For this exercise, we will create two additional Kubernetes namespaces to hold our content.
In a scenario where an organization is using a shared Kubernetes cluster for development and production use cases:
The development team would like to maintain a space in the cluster where they can get a view on the list of Pods, Services, and Deployments they use to build and run their application. In this space, Kubernetes resources come and go, and the restrictions on who can or cannot modify resources are relaxed to enable agile development.
The operations team would like to maintain a space in the cluster where they can enforce strict procedures on who can or cannot manipulate the set of Pods, Services, and Deployments that run the production site.
One pattern this organization could follow is to partition the Kubernetes cluster into two namespaces:
development
andproduction
.Let's create two new namespaces to hold our work.
Create the
development
namespace using kubectl:kubectl create -f https://k8s.io/examples/admin/namespace-dev.json
And then let's create the
production
namespace using kubectl:kubectl create -f https://k8s.io/examples/admin/namespace-prod.json
To be sure things are right, list all of the namespaces in our cluster.
kubectl get namespaces --show-labels
NAME STATUS AGE LABELS default Active 32m <none> development Active 29s name=development production Active 23s name=production
-
Create pods in each namespace
A Kubernetes namespace provides the scope for Pods, Services, and Deployments in the cluster.
Users interacting with one namespace do not see the content in another namespace.
To demonstrate this, let's spin up a simple Deployment and Pods in the
development
namespace.kubectl create deployment snowflake --image=k8s.gcr.io/serve_hostname -n=development --replicas=2
We have created a deployment whose replica size is 2 that is running the pod called
snowflake
with a basic container that serves the hostname.kubectl get deployment -n=development
NAME READY UP-TO-DATE AVAILABLE AGE snowflake 2/2 2 2 2m
kubectl get pods -l app=snowflake -n=development
NAME READY STATUS RESTARTS AGE snowflake-3968820950-9dgr8 1/1 Running 0 2m snowflake-3968820950-vgc4n 1/1 Running 0 2m
And this is great, developers are able to do what they want, and they do not have to worry about affecting content in the
production
namespace.Let's switch to the
production
namespace and show how resources in one namespace are hidden from the other.The
production
namespace should be empty, and the following commands should return nothing.kubectl get deployment -n=production kubectl get pods -n=production
Production likes to run cattle, so let's create some cattle pods.
kubectl create deployment cattle --image=k8s.gcr.io/serve_hostname -n=production kubectl scale deployment cattle --replicas=5 -n=production kubectl get deployment -n=production
NAME READY UP-TO-DATE AVAILABLE AGE cattle 5/5 5 5 10s
kubectl get pods -l app=cattle -n=production
NAME READY STATUS RESTARTS AGE cattle-2263376956-41xy6 1/1 Running 0 34s cattle-2263376956-kw466 1/1 Running 0 34s cattle-2263376956-n4v97 1/1 Running 0 34s cattle-2263376956-p5p3i 1/1 Running 0 34s cattle-2263376956-sxpth 1/1 Running 0 34s
At this point, it should be clear that the resources users create in one namespace are hidden from the other namespace.
As the policy support in Kubernetes evolves, we will extend this scenario to show how you can provide different authorization rules for each namespace.
Understanding the motivation for using namespaces
A single cluster should be able to satisfy the needs of multiple users or groups of users (henceforth a 'user community').
Kubernetes namespaces help different projects, teams, or customers to share a Kubernetes cluster.
It does this by providing the following:
- A scope for Names.
- A mechanism to attach authorization and policy to a subsection of the cluster.
Use of multiple namespaces is optional.
Each user community wants to be able to work in isolation from other communities.
Each user community has its own:
- resources (pods, services, replication controllers, etc.)
- policies (who can or cannot perform actions in their community)
- constraints (this community is allowed this much quota, etc.)
A cluster operator may create a Namespace for each unique user community.
The Namespace provides a unique scope for:
- named resources (to avoid basic naming collisions)
- delegated management authority to trusted users
- ability to limit community resource consumption
Use cases include:
- As a cluster operator, I want to support multiple user communities on a single cluster.
- As a cluster operator, I want to delegate authority to partitions of the cluster to trusted users in those communities.
- As a cluster operator, I want to limit the amount of resources each community can consume in order to limit the impact to other communities using the cluster.
- As a cluster user, I want to interact with resources that are pertinent to my user community in isolation of what other user communities are doing on the cluster.
Understanding namespaces and DNS
When you create a Service, it creates a corresponding DNS entry.
This entry is of the form <service-name>.<namespace-name>.svc.cluster.local
, which means
that if a container uses <service-name>
it will resolve to the service which
is local to a namespace. This is useful for using the same configuration across
multiple namespaces such as Development, Staging and Production. If you want to reach
across namespaces, you need to use the fully qualified domain name (FQDN).
What's next
- Learn more about setting the namespace preference.
- Learn more about setting the namespace for a request
- See namespaces design.
4.2.35 - Upgrade A Cluster
This page provides an overview of the steps you should follow to upgrade a Kubernetes cluster.
The way that you upgrade a cluster depends on how you initially deployed it and on any subsequent changes.
At a high level, the steps you perform are:
- Upgrade the control plane
- Upgrade the nodes in your cluster
- Upgrade clients such as kubectl
- Adjust manifests and other resources based on the API changes that accompany the new Kubernetes version
Before you begin
You must have an existing cluster. This page is about upgrading from Kubernetes 1.22 to Kubernetes 1.23. If your cluster is not currently running Kubernetes 1.22 then please check the documentation for the version of Kubernetes that you plan to upgrade to.
Upgrade approaches
kubeadm
If your cluster was deployed using the kubeadm
tool, refer to
Upgrading kubeadm clusters
for detailed information on how to upgrade the cluster.
Once you have upgraded the cluster, remember to
install the latest version of kubectl
.
Manual deployments
You should manually update the control plane following this sequence:
- etcd (all instances)
- kube-apiserver (all control plane hosts)
- kube-controller-manager
- kube-scheduler
- cloud controller manager, if you use one
At this point you should
install the latest version of kubectl
.
For each node in your cluster, drain that node and then either replace it with a new node that uses the 1.23 kubelet, or upgrade the kubelet on that node and bring the node back into service.
Other deployments
Refer to the documentation for your cluster deployment tool to learn the recommended set up steps for maintenance.
Post-upgrade tasks
Switch your cluster's storage API version
The objects that are serialized into etcd for a cluster's internal representation of the Kubernetes resources active in the cluster are written using a particular version of the API.
When the supported API changes, these objects may need to be rewritten in the newer API. Failure to do this will eventually result in resources that are no longer decodable or usable by the Kubernetes API server.
For each affected object, fetch it using the latest supported API and then write it back also using the latest supported API.
Update manifests
Upgrading to a new Kubernetes version can provide new APIs.
You can use kubectl convert
command to convert manifests between different API versions.
For example:
kubectl convert -f pod.yaml --output-version v1
The kubectl
tool replaces the contents of pod.yaml
with a manifest that sets kind
to
Pod (unchanged), but with a revised apiVersion
.
4.2.36 - Use Cascading Deletion in a Cluster
This page shows you how to specify the type of cascading deletion to use in your cluster during garbage collection.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
You also need to create a sample Deployment to experiment with the different types of cascading deletion. You will need to recreate the Deployment for each type.
Check owner references on your pods
Check that the ownerReferences
field is present on your pods:
kubectl get pods -l app=nginx --output=yaml
The output has an ownerReferences
field similar to this:
apiVersion: v1
...
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: nginx-deployment-6b474476c4
uid: 4fdcd81c-bd5d-41f7-97af-3a3b759af9a7
...
Use foreground cascading deletion
By default, Kubernetes uses background cascading deletion
to delete dependents of an object. You can switch to foreground cascading deletion
using either kubectl
or the Kubernetes API, depending on the Kubernetes
version your cluster runs.
To check the version, enter kubectl version
.
You can delete objects using foreground cascading deletion using kubectl
or the
Kubernetes API.
Using kubectl
Run the following command:
kubectl delete deployment nginx-deployment --cascade=foreground
Using the Kubernetes API
-
Start a local proxy session:
kubectl proxy --port=8080
-
Use
curl
to trigger deletion:curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/deployments/nginx-deployment \ -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \ -H "Content-Type: application/json"
The output contains a
foregroundDeletion
finalizer like this:"kind": "Deployment", "apiVersion": "apps/v1", "metadata": { "name": "nginx-deployment", "namespace": "default", "uid": "d1ce1b02-cae8-4288-8a53-30e84d8fa505", "resourceVersion": "1363097", "creationTimestamp": "2021-07-08T20:24:37Z", "deletionTimestamp": "2021-07-08T20:27:39Z", "finalizers": [ "foregroundDeletion" ] ...
You can delete objects using foreground cascading deletion by calling the Kubernetes API.
For details, read the documentation for your Kubernetes version.
-
Start a local proxy session:
kubectl proxy --port=8080
-
Use
curl
to trigger deletion:curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/deployments/nginx-deployment \ -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \ -H "Content-Type: application/json"
The output contains a
foregroundDeletion
finalizer like this:"kind": "Deployment", "apiVersion": "apps/v1", "metadata": { "name": "nginx-deployment", "namespace": "default", "uid": "d1ce1b02-cae8-4288-8a53-30e84d8fa505", "resourceVersion": "1363097", "creationTimestamp": "2021-07-08T20:24:37Z", "deletionTimestamp": "2021-07-08T20:27:39Z", "finalizers": [ "foregroundDeletion" ] ...
Use background cascading deletion
- Create a sample Deployment.
- Use either
kubectl
or the Kubernetes API to delete the Deployment, depending on the Kubernetes version your cluster runs. To check the version, enterkubectl version
.
You can delete objects using background cascading deletion using kubectl
or the Kubernetes API.
Kubernetes uses background cascading deletion by default, and does so
even if you run the following commands without the --cascade
flag or the
propagationPolicy
argument.
Using kubectl
Run the following command:
kubectl delete deployment nginx-deployment --cascade=background
Using the Kubernetes API
-
Start a local proxy session:
kubectl proxy --port=8080
-
Use
curl
to trigger deletion:curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/deployments/nginx-deployment \ -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Background"}' \ -H "Content-Type: application/json"
The output is similar to this:
"kind": "Status", "apiVersion": "v1", ... "status": "Success", "details": { "name": "nginx-deployment", "group": "apps", "kind": "deployments", "uid": "cc9eefb9-2d49-4445-b1c1-d261c9396456" }
Kubernetes uses background cascading deletion by default, and does so
even if you run the following commands without the --cascade
flag or the
propagationPolicy: Background
argument.
For details, read the documentation for your Kubernetes version.
Using kubectl
Run the following command:
kubectl delete deployment nginx-deployment --cascade=true
Using the Kubernetes API
-
Start a local proxy session:
kubectl proxy --port=8080
-
Use
curl
to trigger deletion:curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/deployments/nginx-deployment \ -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Background"}' \ -H "Content-Type: application/json"
The output is similar to this:
"kind": "Status", "apiVersion": "v1", ... "status": "Success", "details": { "name": "nginx-deployment", "group": "apps", "kind": "deployments", "uid": "cc9eefb9-2d49-4445-b1c1-d261c9396456" }
Delete owner objects and orphan dependents
By default, when you tell Kubernetes to delete an object, the
controller also deletes
dependent objects. You can make Kubernetes orphan these dependents using
kubectl
or the Kubernetes API, depending on the Kubernetes version your
cluster runs.
To check the version, enter kubectl version
.
Using kubectl
Run the following command:
kubectl delete deployment nginx-deployment --cascade=orphan
Using the Kubernetes API
-
Start a local proxy session:
kubectl proxy --port=8080
-
Use
curl
to trigger deletion:curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/deployments/nginx-deployment \ -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Orphan"}' \ -H "Content-Type: application/json"
The output contains
orphan
in thefinalizers
field, similar to this:"kind": "Deployment", "apiVersion": "apps/v1", "namespace": "default", "uid": "6f577034-42a0-479d-be21-78018c466f1f", "creationTimestamp": "2021-07-09T16:46:37Z", "deletionTimestamp": "2021-07-09T16:47:08Z", "deletionGracePeriodSeconds": 0, "finalizers": [ "orphan" ], ...
For details, read the documentation for your Kubernetes version.
Using kubectl
Run the following command:
kubectl delete deployment nginx-deployment --cascade=orphan
Using the Kubernetes API
-
Start a local proxy session:
kubectl proxy --port=8080
-
Use
curl
to trigger deletion:curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/deployments/nginx-deployment \ -d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Orphan"}' \ -H "Content-Type: application/json"
The output contains
orphan
in thefinalizers
field, similar to this:"kind": "Deployment", "apiVersion": "apps/v1", "namespace": "default", "uid": "6f577034-42a0-479d-be21-78018c466f1f", "creationTimestamp": "2021-07-09T16:46:37Z", "deletionTimestamp": "2021-07-09T16:47:08Z", "deletionGracePeriodSeconds": 0, "finalizers": [ "orphan" ], ...
You can check that the Pods managed by the Deployment are still running:
kubectl get pods -l app=nginx
What's next
- Learn about owners and dependents in Kubernetes.
- Learn about Kubernetes finalizers.
- Learn about garbage collection.
4.2.37 - Using a KMS provider for data encryption
This page shows how to configure a Key Management Service (KMS) provider and plugin to enable secret data encryption.
Before you begin
-
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
. -
Kubernetes version 1.10.0 or later is required
-
etcd v3 or later is required
Kubernetes v1.12 [beta]
The KMS encryption provider uses an envelope encryption scheme to encrypt data in etcd. The data is encrypted using a data encryption key (DEK); a new DEK is generated for each encryption. The DEKs are encrypted with a key encryption key (KEK) that is stored and managed in a remote KMS. The KMS provider uses gRPC to communicate with a specific KMS plugin. The KMS plugin, which is implemented as a gRPC server and deployed on the same host(s) as the Kubernetes master(s), is responsible for all communication with the remote KMS.
Configuring the KMS provider
To configure a KMS provider on the API server, include a provider of type kms
in the providers array in the encryption configuration file and set the following properties:
name
: Display name of the KMS plugin.endpoint
: Listen address of the gRPC server (KMS plugin). The endpoint is a UNIX domain socket.cachesize
: Number of data encryption keys (DEKs) to be cached in the clear. When cached, DEKs can be used without another call to the KMS; whereas DEKs that are not cached require a call to the KMS to unwrap.timeout
: How long should kube-apiserver wait for kms-plugin to respond before returning an error (default is 3 seconds).
See Understanding the encryption at rest configuration.
Implementing a KMS plugin
To implement a KMS plugin, you can develop a new plugin gRPC server or enable a KMS plugin already provided by your cloud provider. You then integrate the plugin with the remote KMS and deploy it on the Kubernetes master.
Enabling the KMS supported by your cloud provider
Refer to your cloud provider for instructions on enabling the cloud provider-specific KMS plugin.
Developing a KMS plugin gRPC server
You can develop a KMS plugin gRPC server using a stub file available for Go. For other languages, you use a proto file to create a stub file that you can use to develop the gRPC server code.
-
Using Go: Use the functions and data structures in the stub file: service.pb.go to develop the gRPC server code
-
Using languages other than Go: Use the protoc compiler with the proto file: service.proto to generate a stub file for the specific language
Then use the functions and data structures in the stub file to develop the server code.
Notes:
-
kms plugin version:
v1beta1
In response to procedure call Version, a compatible KMS plugin should return v1beta1 as VersionResponse.version.
-
message version:
v1beta1
All messages from KMS provider have the version field set to current version v1beta1.
-
protocol: UNIX domain socket (
unix
)The gRPC server should listen at UNIX domain socket.
Integrating a KMS plugin with the remote KMS
The KMS plugin can communicate with the remote KMS using any protocol supported by the KMS. All configuration data, including authentication credentials the KMS plugin uses to communicate with the remote KMS, are stored and managed by the KMS plugin independently. The KMS plugin can encode the ciphertext with additional metadata that may be required before sending it to the KMS for decryption.
Deploying the KMS plugin
Ensure that the KMS plugin runs on the same host(s) as the Kubernetes master(s).
Encrypting your data with the KMS provider
To encrypt the data:
-
Create a new encryption configuration file using the appropriate properties for the
kms
provider:apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources: - resources: - secrets providers: - kms: name: myKmsPlugin endpoint: unix:///tmp/socketfile.sock cachesize: 100 timeout: 3s - identity: {}
-
Set the
--encryption-provider-config
flag on the kube-apiserver to point to the location of the configuration file. -
Restart your API server.
Verifying that the data is encrypted
Data is encrypted when written to etcd. After restarting your kube-apiserver
,
any newly created or updated secret should be encrypted when stored. To verify,
you can use the etcdctl
command line program to retrieve the contents of your secret.
-
Create a new secret called secret1 in the default namespace:
kubectl create secret generic secret1 -n default --from-literal=mykey=mydata
-
Using the etcdctl command line, read that secret out of etcd:
ETCDCTL_API=3 etcdctl get /kubernetes.io/secrets/default/secret1 [...] | hexdump -C
where
[...]
must be the additional arguments for connecting to the etcd server. -
Verify the stored secret is prefixed with
k8s:enc:kms:v1:
, which indicates that thekms
provider has encrypted the resulting data. -
Verify that the secret is correctly decrypted when retrieved via the API:
kubectl describe secret secret1 -n default
should match
mykey: mydata
Ensuring all secrets are encrypted
Because secrets are encrypted on write, performing an update on a secret encrypts that content.
The following command reads all secrets and then updates them to apply server side encryption. If an error occurs due to a conflicting write, retry the command. For larger clusters, you may wish to subdivide the secrets by namespace or script an update.
kubectl get secrets --all-namespaces -o json | kubectl replace -f -
Switching from a local encryption provider to the KMS provider
To switch from a local encryption provider to the kms
provider and re-encrypt all of the secrets:
-
Add the
kms
provider as the first entry in the configuration file as shown in the following example.apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources: - resources: - secrets providers: - kms: name : myKmsPlugin endpoint: unix:///tmp/socketfile.sock cachesize: 100 - aescbc: keys: - name: key1 secret: <BASE 64 ENCODED SECRET>
-
Restart all kube-apiserver processes.
-
Run the following command to force all secrets to be re-encrypted using the
kms
provider.kubectl get secrets --all-namespaces -o json| kubectl replace -f -
Disabling encryption at rest
To disable encryption at rest:
-
Place the
identity
provider as the first entry in the configuration file:apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources: - resources: - secrets providers: - identity: {} - kms: name : myKmsPlugin endpoint: unix:///tmp/socketfile.sock cachesize: 100
-
Restart all kube-apiserver processes.
-
Run the following command to force all secrets to be decrypted.
kubectl get secrets --all-namespaces -o json | kubectl replace -f -
4.2.38 - Using CoreDNS for Service Discovery
This page describes the CoreDNS upgrade process and how to install CoreDNS instead of kube-dns.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.9. To check the version, enterkubectl version
.
About CoreDNS
CoreDNS is a flexible, extensible DNS server that can serve as the Kubernetes cluster DNS. Like Kubernetes, the CoreDNS project is hosted by the CNCF.
You can use CoreDNS instead of kube-dns in your cluster by replacing kube-dns in an existing deployment, or by using tools like kubeadm that will deploy and upgrade the cluster for you.
Installing CoreDNS
For manual deployment or replacement of kube-dns, see the documentation at the CoreDNS GitHub project.
Migrating to CoreDNS
Upgrading an existing cluster with kubeadm
In Kubernetes version 1.21, kubeadm removed its support for kube-dns
as a DNS application.
For kubeadm
v1.23, the only supported cluster DNS application
is CoreDNS.
You can move to CoreDNS when you use kubeadm
to upgrade a cluster that is
using kube-dns
. In this case, kubeadm
generates the CoreDNS configuration
("Corefile") based upon the kube-dns
ConfigMap, preserving configurations for
stub domains, and upstream name server.
Upgrading CoreDNS
You can check the version of CoreDNS that kubeadm installs for each version of Kubernetes in the page CoreDNS version in Kubernetes.
CoreDNS can be upgraded manually in case you want to only upgrade CoreDNS or use your own custom image. There is a helpful guideline and walkthrough available to ensure a smooth upgrade. Make sure the existing CoreDNS configuration ("Corefile") is retained when upgrading your cluster.
If you are upgrading your cluster using the kubeadm
tool, kubeadm
can take care of retaining the existing CoreDNS configuration automatically.
Tuning CoreDNS
When resource utilisation is a concern, it may be useful to tune the configuration of CoreDNS. For more details, check out the documentation on scaling CoreDNS.
What's next
You can configure CoreDNS to support many more use cases than
kube-dns does by modifying the CoreDNS configuration ("Corefile").
For more information, see the documentation
for the kubernetes
CoreDNS plugin, or read the
Custom DNS Entries for Kubernetes.
in the CoreDNS blog.
4.2.39 - Using NodeLocal DNSCache in Kubernetes clusters
Kubernetes v1.18 [stable]
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Introduction
NodeLocal DNSCache improves Cluster DNS performance by running a dns caching agent on cluster nodes as a DaemonSet. In today's architecture, Pods in ClusterFirst DNS mode reach out to a kube-dns serviceIP for DNS queries. This is translated to a kube-dns/CoreDNS endpoint via iptables rules added by kube-proxy. With this new architecture, Pods will reach out to the dns caching agent running on the same node, thereby avoiding iptables DNAT rules and connection tracking. The local caching agent will query kube-dns service for cache misses of cluster hostnames(cluster.local suffix by default).
Motivation
-
With the current DNS architecture, it is possible that Pods with the highest DNS QPS have to reach out to a different node, if there is no local kube-dns/CoreDNS instance. Having a local cache will help improve the latency in such scenarios.
-
Skipping iptables DNAT and connection tracking will help reduce conntrack races and avoid UDP DNS entries filling up conntrack table.
-
Connections from local caching agent to kube-dns service can be upgraded to TCP. TCP conntrack entries will be removed on connection close in contrast with UDP entries that have to timeout (default
nf_conntrack_udp_timeout
is 30 seconds) -
Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped UDP packets and DNS timeouts usually up to 30s (3 retries + 10s timeout). Since the nodelocal cache listens for UDP DNS queries, applications don't need to be changed.
-
Metrics & visibility into dns requests at a node level.
-
Negative caching can be re-enabled, thereby reducing number of queries to kube-dns service.
Architecture Diagram
This is the path followed by DNS Queries after NodeLocal DNSCache is enabled:
Configuration
This feature can be enabled using the following steps:
-
Prepare a manifest similar to the sample
nodelocaldns.yaml
and save it asnodelocaldns.yaml.
-
If using IPv6, the CoreDNS configuration file need to enclose all the IPv6 addresses into square brackets if used in IP:Port format. If you are using the sample manifest from the previous point, this will require to modify the configuration line L70 like this
health [__PILLAR__LOCAL__DNS__]:8080
-
Substitute the variables in the manifest with the right values:
-
kubedns=
kubectl get svc kube-dns -n kube-system -o jsonpath={.spec.clusterIP}
-
domain=
<cluster-domain>
-
localdns=
<node-local-address>
<cluster-domain>
is "cluster.local" by default.<node-local-address>
is the local listen IP address chosen for NodeLocal DNSCache.-
If kube-proxy is running in IPTABLES mode:
sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/__PILLAR__DNS__DOMAIN__/$domain/g; s/__PILLAR__DNS__SERVER__/$kubedns/g" nodelocaldns.yaml
__PILLAR__CLUSTER__DNS__
and__PILLAR__UPSTREAM__SERVERS__
will be populated by the node-local-dns pods. In this mode, node-local-dns pods listen on both the kube-dns service IP as well as<node-local-address>
, so pods can lookup DNS records using either IP address. -
If kube-proxy is running in IPVS mode:
sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/__PILLAR__DNS__DOMAIN__/$domain/g; s/,__PILLAR__DNS__SERVER__//g; s/__PILLAR__CLUSTER__DNS__/$kubedns/g" nodelocaldns.yaml
In this mode, node-local-dns pods listen only on
<node-local-address>
. The node-local-dns interface cannot bind the kube-dns cluster IP since the interface used for IPVS loadbalancing already uses this address.__PILLAR__UPSTREAM__SERVERS__
will be populated by the node-local-dns pods.
-
-
Run
kubectl create -f nodelocaldns.yaml
-
If using kube-proxy in IPVS mode,
--cluster-dns
flag to kubelet needs to be modified to use<node-local-address>
that NodeLocal DNSCache is listening on. Otherwise, there is no need to modify the value of the--cluster-dns
flag, since NodeLocal DNSCache listens on both the kube-dns service IP as well as<node-local-address>
.
Once enabled, node-local-dns Pods will run in the kube-system namespace on each of the cluster nodes. This Pod runs CoreDNS in cache mode, so all CoreDNS metrics exposed by the different plugins will be available on a per-node basis.
You can disable this feature by removing the DaemonSet, using kubectl delete -f <manifest>
. You should also revert any changes you made to the kubelet configuration.
StubDomains and Upstream server Configuration
StubDomains and upstream servers specified in the kube-dns
ConfigMap in the kube-system
namespace
are automatically picked up by node-local-dns
pods. The ConfigMap contents need to follow the format
shown in the example.
The node-local-dns
ConfigMap can also be modified directly with the stubDomain configuration
in the Corefile format. Some cloud providers might not allow modifying node-local-dns
ConfigMap directly.
In those cases, the kube-dns
ConfigMap can be updated.
Setting memory limits
node-local-dns pods use memory for storing cache entries and processing queries. Since they do not watch Kubernetes objects, the cluster size or the number of Services/Endpoints do not directly affect memory usage. Memory usage is influenced by the DNS query pattern. From CoreDNS docs,
The default cache size is 10000 entries, which uses about 30 MB when completely filled.
This would be the memory usage for each server block (if the cache gets completely filled). Memory usage can be reduced by specifying smaller cache sizes.
The number of concurrent queries is linked to the memory demand, because each extra
goroutine used for handling a query requires an amount of memory. You can set an upper limit
using the max_concurrent
option in the forward plugin.
If a node-local-dns pod attempts to use more memory than is available (because of total system resources, or because of a configured resource limit), the operating system may shut down that pod's container. If this happens, the container that is terminated (“OOMKilled”) does not clean up the custom packet filtering rules that it previously added during startup. The node-local-dns container should get restarted (since managed as part of a DaemonSet), but this will lead to a brief DNS downtime each time that the container fails: the packet filtering rules direct DNS queries to a local Pod that is unhealthy.
You can determine a suitable memory limit by running node-local-dns pods without a limit and measuring the peak usage. You can also set up and use a VerticalPodAutoscaler in recommender mode, and then check its recommendations.
4.2.40 - Using sysctls in a Kubernetes Cluster
Kubernetes v1.21 [stable]
This document describes how to configure and use kernel parameters within a Kubernetes cluster using the sysctl interface.
/
or .
as separators for sysctl names.
For example, you can represent the same sysctl name as kernel.shm_rmid_forced
using a
period as the separator, or as kernel/shm_rmid_forced
using a slash as a separator.
For more sysctl parameter conversion method details, please refer to
the page sysctl.d(5) from
the Linux man-pages project.
Setting Sysctls for a Pod and PodSecurityPolicy features do not yet support
setting sysctls with slashes.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
For some steps, you also need to be able to reconfigure the command line options for the kubelets running on your cluster.
Listing all Sysctl Parameters
In Linux, the sysctl interface allows an administrator to modify kernel
parameters at runtime. Parameters are available via the /proc/sys/
virtual
process file system. The parameters cover various subsystems such as:
- kernel (common prefix:
kernel.
) - networking (common prefix:
net.
) - virtual memory (common prefix:
vm.
) - MDADM (common prefix:
dev.
) - More subsystems are described in Kernel docs.
To get a list of all parameters, you can run
sudo sysctl -a
Enabling Unsafe Sysctls
Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing, a safe sysctl must be properly isolated between pods on the same node. This means that setting a safe sysctl for one pod
- must not have any influence on any other pod on the node
- must not allow to harm the node's health
- must not allow to gain CPU or memory resources outside of the resource limits of a pod.
By far, most of the namespaced sysctls are not necessarily considered safe. The following sysctls are supported in the safe set:
kernel.shm_rmid_forced
,net.ipv4.ip_local_port_range
,net.ipv4.tcp_syncookies
,net.ipv4.ping_group_range
(since Kubernetes 1.18),net.ipv4.ip_unprivileged_port_start
(since Kubernetes 1.22).
net.ipv4.tcp_syncookies
is not namespaced on Linux kernel version 4.4 or lower.
This list will be extended in future Kubernetes versions when the kubelet supports better isolation mechanisms.
All safe sysctls are enabled by default.
All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.
With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very special situations such as high-performance or real-time application tuning. Unsafe sysctls are enabled on a node-by-node basis with a flag of the kubelet; for example:
kubelet --allowed-unsafe-sysctls \
'kernel.msg*,net.core.somaxconn' ...
For Minikube, this can be done via the extra-config
flag:
minikube start --extra-config="kubelet.allowed-unsafe-sysctls=kernel.msg*,net.core.somaxconn"...
Only namespaced sysctls can be enabled this way.
Setting Sysctls for a Pod
A number of sysctls are namespaced in today's Linux kernels. This means that they can be set independently for each pod on a node. Only namespaced sysctls are configurable via the pod securityContext within Kubernetes.
The following sysctls are known to be namespaced. This list could change in future versions of the Linux kernel.
kernel.shm*
,kernel.msg*
,kernel.sem
,fs.mqueue.*
,- The parameters under
net.*
that can be set in container networking namespace. However, there are exceptions (e.g.,net.netfilter.nf_conntrack_max
andnet.netfilter.nf_conntrack_expect_max
can be set in container networking namespace but they are unnamespaced).
Sysctls with no namespace are called node-level sysctls. If you need to set them, you must manually configure them on each node's operating system, or by using a DaemonSet with privileged containers.
Use the pod securityContext to configure namespaced sysctls. The securityContext applies to all containers in the same pod.
This example uses the pod securityContext to set a safe sysctl
kernel.shm_rmid_forced
and two unsafe sysctls net.core.somaxconn
and
kernel.msgmax
. There is no distinction between safe and unsafe sysctls in
the specification.
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.core.somaxconn
value: "1024"
- name: kernel.msgmax
value: "65536"
...
It is good practice to consider nodes with special sysctl settings as tainted within a cluster, and only schedule pods onto them which need those sysctl settings. It is suggested to use the Kubernetes taints and toleration feature to implement this.
A pod with the unsafe sysctls will fail to launch on any node which has not enabled those two unsafe sysctls explicitly. As with node-level sysctls it is recommended to use taints and toleration feature or taints on nodes to schedule those pods onto the right nodes.
PodSecurityPolicy
Kubernetes v1.21 [deprecated]
You can further control which sysctls can be set in pods by specifying lists of
sysctls or sysctl patterns in the forbiddenSysctls
and/or
allowedUnsafeSysctls
fields of the PodSecurityPolicy. A sysctl pattern ends
with a *
character, such as kernel.*
. A *
character on its own matches
all sysctls.
By default, all safe sysctls are allowed.
Both forbiddenSysctls
and allowedUnsafeSysctls
are lists of plain sysctl names
or sysctl patterns (which end with *
). The string *
matches all sysctls.
The forbiddenSysctls
field excludes specific sysctls. You can forbid a
combination of safe and unsafe sysctls in the list. To forbid setting any
sysctls, use *
on its own.
If you specify any unsafe sysctl in the allowedUnsafeSysctls
field and it is
not present in the forbiddenSysctls
field, that sysctl can be used in Pods
using this PodSecurityPolicy. To allow all unsafe sysctls in the
PodSecurityPolicy to be set, use *
on its own.
Do not configure these two fields such that there is overlap, meaning that a given sysctl is both allowed and forbidden.
allowedUnsafeSysctls
field
in a PodSecurityPolicy, any pod using such a sysctl will fail to start
if the sysctl is not allowed via the --allowed-unsafe-sysctls
kubelet
flag as well on that node.
This example allows unsafe sysctls prefixed with kernel.msg
to be set and
disallows setting of the kernel.shm_rmid_forced
sysctl.
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: sysctl-psp
spec:
allowedUnsafeSysctls:
- kernel.msg*
forbiddenSysctls:
- kernel.shm_rmid_forced
...
4.2.41 - Utilizing the NUMA-aware Memory Manager
Kubernetes v1.22 [beta]
The Kubernetes Memory Manager enables the feature of guaranteed memory (and hugepages)
allocation for pods in the Guaranteed
QoS class.
The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod. The Memory Manager feeds the central manager (Topology Manager) with these affinity hints. Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node.
Moreover, the Memory Manager ensures that the memory which a pod requests is allocated from a minimum number of NUMA nodes.
The Memory Manager is only pertinent to Linux based hosts.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.21. To check the version, enterkubectl version
.
To align memory resources with other requested resources in a Pod spec:
- the CPU Manager should be enabled and proper CPU Manager policy should be configured on a Node. See control CPU Management Policies;
- the Topology Manager should be enabled and proper Topology Manager policy should be configured on a Node. See control Topology Management Policies.
Starting from v1.22, the Memory Manager is enabled by default through MemoryManager
feature gate.
Preceding v1.22, the kubelet
must be started with the following flag:
--feature-gates=MemoryManager=true
in order to enable the Memory Manager feature.
How Memory Manager Operates?
The Memory Manager currently offers the guaranteed memory (and hugepages) allocation
for Pods in Guaranteed QoS class.
To immediately put the Memory Manager into operation follow the guidelines in the section
Memory Manager configuration, and subsequently,
prepare and deploy a Guaranteed
pod as illustrated in the section
Placing a Pod in the Guaranteed QoS class.
The Memory Manager is a Hint Provider, and it provides topology hints for
the Topology Manager which then aligns the requested resources according to these topology hints.
It also enforces cgroups
(i.e. cpuset.mems
) for pods.
The complete flow diagram concerning pod admission and deployment process is illustrated in
Memory Manager KEP: Design Overview and below:
During this process, the Memory Manager updates its internal counters stored in Node Map and Memory Maps to manage guaranteed memory allocation.
The Memory Manager updates the Node Map during the startup and runtime as follows.
Startup
This occurs once a node administrator employs --reserved-memory
(section
Reserved memory flag).
In this case, the Node Map becomes updated to reflect this reservation as illustrated in
Memory Manager KEP: Memory Maps at start-up (with examples).
The administrator must provide --reserved-memory
flag when Static
policy is configured.
Runtime
Reference Memory Manager KEP: Memory Maps at runtime (with examples) illustrates how a successful pod deployment affects the Node Map, and it also relates to how potential Out-of-Memory (OOM) situations are handled further by Kubernetes or operating system.
Important topic in the context of Memory Manager operation is the management of NUMA groups. Each time pod's memory request is in excess of single NUMA node capacity, the Memory Manager attempts to create a group that comprises several NUMA nodes and features extend memory capacity. The problem has been solved as elaborated in Memory Manager KEP: How to enable the guaranteed memory allocation over many NUMA nodes?. Also, reference Memory Manager KEP: Simulation - how the Memory Manager works? (by examples) illustrates how the management of groups occurs.
Memory Manager configuration
Other Managers should be first pre-configured. Next, the Memory Manger feature should be enabled
and be run with Static
policy (section Static policy).
Optionally, some amount of memory can be reserved for system or kubelet processes to increase
node stability (section Reserved memory flag).
Policies
Memory Manager supports two policies. You can select a policy via a kubelet
flag --memory-manager-policy
:
None
(default)Static
None policy
This is the default policy and does not affect the memory allocation in any way. It acts the same as if the Memory Manager is not present at all.
The None
policy returns default topology hint. This special hint denotes that Hint Provider
(Memory Manger in this case) has no preference for NUMA affinity with any resource.
Static policy
In the case of the Guaranteed
pod, the Static
Memory Manger policy returns topology hints
relating to the set of NUMA nodes where the memory can be guaranteed,
and reserves the memory through updating the internal NodeMap object.
In the case of the BestEffort
or Burstable
pod, the Static
Memory Manager policy sends back
the default topology hint as there is no request for the guaranteed memory,
and does not reserve the memory in the internal NodeMap object.
Reserved memory flag
The Node Allocatable mechanism is commonly used by node administrators to reserve K8S node system resources for the kubelet or operating system processes in order to enhance the node stability. A dedicated set of flags can be used for this purpose to set the total amount of reserved memory for a node. This pre-configured value is subsequently utilized to calculate the real amount of node's "allocatable" memory available to pods.
The Kubernetes scheduler incorporates "allocatable" to optimise pod scheduling process.
The foregoing flags include --kube-reserved
, --system-reserved
and --eviction-threshold
.
The sum of their values will account for the total amount of reserved memory.
A new --reserved-memory
flag was added to Memory Manager to allow for this total reserved memory
to be split (by a node administrator) and accordingly reserved across many NUMA nodes.
The flag specifies a comma-separated list of memory reservations of different memory types per NUMA node. Memory reservations across multiple NUMA nodes can be specified using semicolon as separator. This parameter is only useful in the context of the Memory Manager feature. The Memory Manager will not use this reserved memory for the allocation of container workloads.
For example, if you have a NUMA node "NUMA0" with 10Gi
of memory available, and
the --reserved-memory
was specified to reserve 1Gi
of memory at "NUMA0",
the Memory Manager assumes that only 9Gi
is available for containers.
You can omit this parameter, however, you should be aware that the quantity of reserved memory
from all NUMA nodes should be equal to the quantity of memory specified by the
Node Allocatable feature.
If at least one node allocatable parameter is non-zero, you will need to specify
--reserved-memory
for at least one NUMA node.
In fact, eviction-hard
threshold value is equal to 100Mi
by default, so
if Static
policy is used, --reserved-memory
is obligatory.
Also, avoid the following configurations:
- duplicates, i.e. the same NUMA node or memory type, but with a different value;
- setting zero limit for any of memory types;
- NUMA node IDs that do not exist in the machine hardware;
- memory type names different than
memory
orhugepages-<size>
(hugepages of particular<size>
should also exist).
Syntax:
--reserved-memory N:memory-type1=value1,memory-type2=value2,...
N
(integer) - NUMA node index, e.g.0
memory-type
(string) - represents memory type:memory
- conventional memoryhugepages-2Mi
orhugepages-1Gi
- hugepages
value
(string) - the quantity of reserved memory, e.g.1Gi
Example usage:
--reserved-memory 0:memory=1Gi,hugepages-1Gi=2Gi
or
--reserved-memory 0:memory=1Gi --reserved-memory 1:memory=2Gi
or
--reserved-memory '0:memory=1Gi;1:memory=2Gi'
When you specify values for --reserved-memory
flag, you must comply with the setting that
you prior provided via Node Allocatable Feature flags.
That is, the following rule must be obeyed for each memory type:
sum(reserved-memory(i)) = kube-reserved + system-reserved + eviction-threshold
,
where i
is an index of a NUMA node.
If you do not follow the formula above, the Memory Manager will show an error on startup.
In other words, the example above illustrates that for the conventional memory (type=memory
),
we reserve 3Gi
in total, i.e.:
sum(reserved-memory(i)) = reserved-memory(0) + reserved-memory(1) = 1Gi + 2Gi = 3Gi
An example of kubelet command-line arguments relevant to the node Allocatable configuration:
--kube-reserved=cpu=500m,memory=50Mi
--system-reserved=cpu=123m,memory=333Mi
--eviction-hard=memory.available<500Mi
--reserved-memory
by that hard eviction threshold. Otherwise, the kubelet will not start Memory Manager and
display an error.
Here is an example of a correct configuration:
--feature-gates=MemoryManager=true
--kube-reserved=cpu=4,memory=4Gi
--system-reserved=cpu=1,memory=1Gi
--memory-manager-policy=Static
--reserved-memory '0:memory=3Gi;1:memory=2148Mi'
Let us validate the configuration above:
kube-reserved + system-reserved + eviction-hard(default) = reserved-memory(0) + reserved-memory(1)
4GiB + 1GiB + 100MiB = 3GiB + 2148MiB
5120MiB + 100MiB = 3072MiB + 2148MiB
5220MiB = 5220MiB
(which is correct)
Placing a Pod in the Guaranteed QoS class
If the selected policy is anything other than None
, the Memory Manager identifies pods
that are in the Guaranteed
QoS class.
The Memory Manager provides specific topology hints to the Topology Manager for each Guaranteed
pod.
For pods in a QoS class other than Guaranteed
, the Memory Manager provides default topology hints
to the Topology Manager.
The following excerpts from pod manifests assign a pod to the Guaranteed
QoS class.
Pod with integer CPU(s) runs in the Guaranteed
QoS class, when requests
are equal to limits
:
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
Also, a pod sharing CPU(s) runs in the Guaranteed
QoS class, when requests
are equal to limits
.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
Notice that both CPU and memory requests must be specified for a Pod to lend it to Guaranteed QoS class.
Troubleshooting
The following means can be used to troubleshoot the reason why a pod could not be deployed or became rejected at a node:
- pod status - indicates topology affinity errors
- system logs - include valuable information for debugging, e.g., about generated hints
- state file - the dump of internal state of the Memory Manager (includes Node Map and Memory Maps)
- starting from v1.22, the device plugin resource API can be used to retrieve information about the memory reserved for containers
Pod status (TopologyAffinityError)
This error typically occurs in the following situations:
- a node has not enough resources available to satisfy the pod's request
- the pod's request is rejected due to particular Topology Manager policy constraints
The error appears in the status of a pod:
kubectl get pods
NAME READY STATUS RESTARTS AGE
guaranteed 0/1 TopologyAffinityError 0 113s
Use kubectl describe pod <id>
or kubectl get events
to obtain detailed error message:
Warning TopologyAffinityError 10m kubelet, dell8 Resources cannot be allocated with Topology locality
System logs
Search system logs with respect to a particular pod.
The set of hints that Memory Manager generated for the pod can be found in the logs. Also, the set of hints generated by CPU Manager should be present in the logs.
Topology Manager merges these hints to calculate a single best hint. The best hint should be also present in the logs.
The best hint indicates where to allocate all the resources. Topology Manager tests this hint against its current policy, and based on the verdict, it either admits the pod to the node or rejects it.
Also, search the logs for occurrences associated with the Memory Manager,
e.g. to find out information about cgroups
and cpuset.mems
updates.
Examine the memory manager state on a node
Let us first deploy a sample Guaranteed
pod whose specification is as follows:
apiVersion: v1
kind: Pod
metadata:
name: guaranteed
spec:
containers:
- name: guaranteed
image: consumer
imagePullPolicy: Never
resources:
limits:
cpu: "2"
memory: 150Gi
requests:
cpu: "2"
memory: 150Gi
command: ["sleep","infinity"]
Next, let us log into the node where it was deployed and examine the state file in
/var/lib/kubelet/memory_manager_state
:
{
"policyName":"Static",
"machineState":{
"0":{
"numberOfAssignments":1,
"memoryMap":{
"hugepages-1Gi":{
"total":0,
"systemReserved":0,
"allocatable":0,
"reserved":0,
"free":0
},
"memory":{
"total":134987354112,
"systemReserved":3221225472,
"allocatable":131766128640,
"reserved":131766128640,
"free":0
}
},
"nodes":[
0,
1
]
},
"1":{
"numberOfAssignments":1,
"memoryMap":{
"hugepages-1Gi":{
"total":0,
"systemReserved":0,
"allocatable":0,
"reserved":0,
"free":0
},
"memory":{
"total":135286722560,
"systemReserved":2252341248,
"allocatable":133034381312,
"reserved":29295144960,
"free":103739236352
}
},
"nodes":[
0,
1
]
}
},
"entries":{
"fa9bdd38-6df9-4cf9-aa67-8c4814da37a8":{
"guaranteed":[
{
"numaAffinity":[
0,
1
],
"type":"memory",
"size":161061273600
}
]
}
},
"checksum":4142013182
}
It can be deduced from the state file that the pod was pinned to both NUMA nodes, i.e.:
"numaAffinity":[
0,
1
],
Pinned term means that pod's memory consumption is constrained (through cgroups
configuration)
to these NUMA nodes.
This automatically implies that Memory Manager instantiated a new group that
comprises these two NUMA nodes, i.e. 0
and 1
indexed NUMA nodes.
Notice that the management of groups is handled in a relatively complex manner, and further elaboration is provided in Memory Manager KEP in this and this sections.
In order to analyse memory resources available in a group,the corresponding entries from NUMA nodes belonging to the group must be added up.
For example, the total amount of free "conventional" memory in the group can be computed
by adding up the free memory available at every NUMA node in the group,
i.e., in the "memory"
section of NUMA node 0
("free":0
) and NUMA node 1
("free":103739236352
).
So, the total amount of free "conventional" memory in this group is equal to 0 + 103739236352
bytes.
The line "systemReserved":3221225472
indicates that the administrator of this node reserved
3221225472
bytes (i.e. 3Gi
) to serve kubelet and system processes at NUMA node 0
,
by using --reserved-memory
flag.
Device plugin resource API
By employing the API,
the information about reserved memory for each container can be retrieved, which is contained
in protobuf ContainerMemory
message.
This information can be retrieved solely for pods in Guaranteed QoS class.
What's next
- Memory Manager KEP: Design Overview
- Memory Manager KEP: Memory Maps at start-up (with examples)
- Memory Manager KEP: Memory Maps at runtime (with examples)
- Memory Manager KEP: Simulation - how the Memory Manager works? (by examples)
- Memory Manager KEP: The Concept of Node Map and Memory Maps
- Memory Manager KEP: How to enable the guaranteed memory allocation over many NUMA nodes?
4.3 - Configure Pods and Containers
4.3.1 - Assign Memory Resources to Containers and Pods
This page shows how to assign a memory request and a memory limit to a Container. A Container is guaranteed to have as much memory as it requests, but is not allowed to use more memory than its limit.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Each node in your cluster must have at least 300 MiB of memory.
A few of the steps on this page require you to run the metrics-server service in your cluster. If you have the metrics-server running, you can skip those steps.
If you are running Minikube, run the following command to enable the metrics-server:
minikube addons enable metrics-server
To see whether the metrics-server is running, or another provider of the resource metrics
API (metrics.k8s.io
), run the following command:
kubectl get apiservices
If the resource metrics API is available, the output includes a
reference to metrics.k8s.io
.
NAME
v1beta1.metrics.k8s.io
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace mem-example
Specify a memory request and a memory limit
To specify a memory request for a Container, include the resources:requests
field
in the Container's resource manifest. To specify a memory limit, include resources:limits
.
In this exercise, you create a Pod that has one Container. The Container has a memory request of 100 MiB and a memory limit of 200 MiB. Here's the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: memory-demo
namespace: mem-example
spec:
containers:
- name: memory-demo-ctr
image: polinux/stress
resources:
requests:
memory: "100Mi"
limits:
memory: "200Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
The args
section in the configuration file provides arguments for the Container when it starts.
The "--vm-bytes", "150M"
arguments tell the Container to attempt to allocate 150 MiB of memory.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/resource/memory-request-limit.yaml --namespace=mem-example
Verify that the Pod Container is running:
kubectl get pod memory-demo --namespace=mem-example
View detailed information about the Pod:
kubectl get pod memory-demo --output=yaml --namespace=mem-example
The output shows that the one Container in the Pod has a memory request of 100 MiB and a memory limit of 200 MiB.
...
resources:
requests:
memory: 100Mi
limits:
memory: 200Mi
...
Run kubectl top
to fetch the metrics for the pod:
kubectl top pod memory-demo --namespace=mem-example
The output shows that the Pod is using about 162,900,000 bytes of memory, which is about 150 MiB. This is greater than the Pod's 100 MiB request, but within the Pod's 200 MiB limit.
NAME CPU(cores) MEMORY(bytes)
memory-demo <something> 162856960
Delete your Pod:
kubectl delete pod memory-demo --namespace=mem-example
Exceed a Container's memory limit
A Container can exceed its memory request if the Node has memory available. But a Container is not allowed to use more than its memory limit. If a Container allocates more memory than its limit, the Container becomes a candidate for termination. If the Container continues to consume memory beyond its limit, the Container is terminated. If a terminated Container can be restarted, the kubelet restarts it, as with any other type of runtime failure.
In this exercise, you create a Pod that attempts to allocate more memory than its limit. Here is the configuration file for a Pod that has one Container with a memory request of 50 MiB and a memory limit of 100 MiB:
apiVersion: v1
kind: Pod
metadata:
name: memory-demo-2
namespace: mem-example
spec:
containers:
- name: memory-demo-2-ctr
image: polinux/stress
resources:
requests:
memory: "50Mi"
limits:
memory: "100Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "250M", "--vm-hang", "1"]
In the args
section of the configuration file, you can see that the Container
will attempt to allocate 250 MiB of memory, which is well above the 100 MiB limit.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/resource/memory-request-limit-2.yaml --namespace=mem-example
View detailed information about the Pod:
kubectl get pod memory-demo-2 --namespace=mem-example
At this point, the Container might be running or killed. Repeat the preceding command until the Container is killed:
NAME READY STATUS RESTARTS AGE
memory-demo-2 0/1 OOMKilled 1 24s
Get a more detailed view of the Container status:
kubectl get pod memory-demo-2 --output=yaml --namespace=mem-example
The output shows that the Container was killed because it is out of memory (OOM):
lastState:
terminated:
containerID: 65183c1877aaec2e8427bc95609cc52677a454b56fcb24340dbd22917c23b10f
exitCode: 137
finishedAt: 2017-06-20T20:52:19Z
reason: OOMKilled
startedAt: null
The Container in this exercise can be restarted, so the kubelet restarts it. Repeat this command several times to see that the Container is repeatedly killed and restarted:
kubectl get pod memory-demo-2 --namespace=mem-example
The output shows that the Container is killed, restarted, killed again, restarted again, and so on:
kubectl get pod memory-demo-2 --namespace=mem-example
NAME READY STATUS RESTARTS AGE
memory-demo-2 0/1 OOMKilled 1 37s
kubectl get pod memory-demo-2 --namespace=mem-example
NAME READY STATUS RESTARTS AGE
memory-demo-2 1/1 Running 2 40s
View detailed information about the Pod history:
kubectl describe pod memory-demo-2 --namespace=mem-example
The output shows that the Container starts and fails repeatedly:
... Normal Created Created container with id 66a3a20aa7980e61be4922780bf9d24d1a1d8b7395c09861225b0eba1b1f8511
... Warning BackOff Back-off restarting failed container
View detailed information about your cluster's Nodes:
kubectl describe nodes
The output includes a record of the Container being killed because of an out-of-memory condition:
Warning OOMKilling Memory cgroup out of memory: Kill process 4481 (stress) score 1994 or sacrifice child
Delete your Pod:
kubectl delete pod memory-demo-2 --namespace=mem-example
Specify a memory request that is too big for your Nodes
Memory requests and limits are associated with Containers, but it is useful to think of a Pod as having a memory request and limit. The memory request for the Pod is the sum of the memory requests for all the Containers in the Pod. Likewise, the memory limit for the Pod is the sum of the limits of all the Containers in the Pod.
Pod scheduling is based on requests. A Pod is scheduled to run on a Node only if the Node has enough available memory to satisfy the Pod's memory request.
In this exercise, you create a Pod that has a memory request so big that it exceeds the capacity of any Node in your cluster. Here is the configuration file for a Pod that has one Container with a request for 1000 GiB of memory, which likely exceeds the capacity of any Node in your cluster.
apiVersion: v1
kind: Pod
metadata:
name: memory-demo-3
namespace: mem-example
spec:
containers:
- name: memory-demo-3-ctr
image: polinux/stress
resources:
requests:
memory: "1000Gi"
limits:
memory: "1000Gi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/resource/memory-request-limit-3.yaml --namespace=mem-example
View the Pod status:
kubectl get pod memory-demo-3 --namespace=mem-example
The output shows that the Pod status is PENDING. That is, the Pod is not scheduled to run on any Node, and it will remain in the PENDING state indefinitely:
kubectl get pod memory-demo-3 --namespace=mem-example
NAME READY STATUS RESTARTS AGE
memory-demo-3 0/1 Pending 0 25s
View detailed information about the Pod, including events:
kubectl describe pod memory-demo-3 --namespace=mem-example
The output shows that the Container cannot be scheduled because of insufficient memory on the Nodes:
Events:
... Reason Message
------ -------
... FailedScheduling No nodes are available that match all of the following predicates:: Insufficient memory (3).
Memory units
The memory resource is measured in bytes. You can express memory as a plain integer or a fixed-point integer with one of these suffixes: E, P, T, G, M, K, Ei, Pi, Ti, Gi, Mi, Ki. For example, the following represent approximately the same value:
128974848, 129e6, 129M , 123Mi
Delete your Pod:
kubectl delete pod memory-demo-3 --namespace=mem-example
If you do not specify a memory limit
If you do not specify a memory limit for a Container, one of the following situations applies:
-
The Container has no upper bound on the amount of memory it uses. The Container could use all of the memory available on the Node where it is running which in turn could invoke the OOM Killer. Further, in case of an OOM Kill, a container with no resource limits will have a greater chance of being killed.
-
The Container is running in a namespace that has a default memory limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the memory limit.
Motivation for memory requests and limits
By configuring memory requests and limits for the Containers that run in your cluster, you can make efficient use of the memory resources available on your cluster's Nodes. By keeping a Pod's memory request low, you give the Pod a good chance of being scheduled. By having a memory limit that is greater than the memory request, you accomplish two things:
- The Pod can have bursts of activity where it makes use of memory that happens to be available.
- The amount of memory a Pod can use during a burst is limited to some reasonable amount.
Clean up
Delete your namespace. This deletes all the Pods that you created for this task:
kubectl delete namespace mem-example
What's next
For app developers
For cluster administrators
4.3.2 - Assign CPU Resources to Containers and Pods
This page shows how to assign a CPU request and a CPU limit to a container. Containers cannot use more CPU than the configured limit. Provided the system has CPU time free, a container is guaranteed to be allocated as much CPU as it requests.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Your cluster must have at least 1 CPU available for use to run the task examples.
A few of the steps on this page require you to run the metrics-server service in your cluster. If you have the metrics-server running, you can skip those steps.
If you are running Minikube, run the following command to enable metrics-server:
minikube addons enable metrics-server
To see whether metrics-server (or another provider of the resource metrics
API, metrics.k8s.io
) is running, type the following command:
kubectl get apiservices
If the resource metrics API is available, the output will include a
reference to metrics.k8s.io
.
NAME
v1beta1.metrics.k8s.io
Create a namespace
Create a Namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace cpu-example
Specify a CPU request and a CPU limit
To specify a CPU request for a container, include the resources:requests
field
in the Container resource manifest. To specify a CPU limit, include resources:limits
.
In this exercise, you create a Pod that has one container. The container has a request of 0.5 CPU and a limit of 1 CPU. Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr
image: vish/stress
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2"
The args
section of the configuration file provides arguments for the container when it starts.
The -cpus "2"
argument tells the Container to attempt to use 2 CPUs.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/resource/cpu-request-limit.yaml --namespace=cpu-example
Verify that the Pod is running:
kubectl get pod cpu-demo --namespace=cpu-example
View detailed information about the Pod:
kubectl get pod cpu-demo --output=yaml --namespace=cpu-example
The output shows that the one container in the Pod has a CPU request of 500 milliCPU and a CPU limit of 1 CPU.
resources:
limits:
cpu: "1"
requests:
cpu: 500m
Use kubectl top
to fetch the metrics for the pod:
kubectl top pod cpu-demo --namespace=cpu-example
This example output shows that the Pod is using 974 milliCPU, which is slightly less than the limit of 1 CPU specified in the Pod configuration.
NAME CPU(cores) MEMORY(bytes)
cpu-demo 974m <something>
Recall that by setting -cpu "2"
, you configured the Container to attempt to use 2 CPUs, but the Container is only being allowed to use about 1 CPU. The container's CPU use is being throttled, because the container is attempting to use more CPU resources than its limit.
CPU units
The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to:
- 1 AWS vCPU
- 1 GCP Core
- 1 Azure vCore
- 1 Hyperthread on a bare-metal Intel processor with Hyperthreading
Fractional values are allowed. A Container that requests 0.5 CPU is guaranteed half as much CPU as a Container that requests 1 CPU. You can use the suffix m to mean milli. For example 100m CPU, 100 milliCPU, and 0.1 CPU are all the same. Precision finer than 1m is not allowed.
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.
Delete your Pod:
kubectl delete pod cpu-demo --namespace=cpu-example
Specify a CPU request that is too big for your Nodes
CPU requests and limits are associated with Containers, but it is useful to think of a Pod as having a CPU request and limit. The CPU request for a Pod is the sum of the CPU requests for all the Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of the CPU limits for all the Containers in the Pod.
Pod scheduling is based on requests. A Pod is scheduled to run on a Node only if the Node has enough CPU resources available to satisfy the Pod CPU request.
In this exercise, you create a Pod that has a CPU request so big that it exceeds the capacity of any Node in your cluster. Here is the configuration file for a Pod that has one Container. The Container requests 100 CPU, which is likely to exceed the capacity of any Node in your cluster.
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo-2
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr-2
image: vish/stress
resources:
limits:
cpu: "100"
requests:
cpu: "100"
args:
- -cpus
- "2"
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/resource/cpu-request-limit-2.yaml --namespace=cpu-example
View the Pod status:
kubectl get pod cpu-demo-2 --namespace=cpu-example
The output shows that the Pod status is Pending. That is, the Pod has not been scheduled to run on any Node, and it will remain in the Pending state indefinitely:
NAME READY STATUS RESTARTS AGE
cpu-demo-2 0/1 Pending 0 7m
View detailed information about the Pod, including events:
kubectl describe pod cpu-demo-2 --namespace=cpu-example
The output shows that the Container cannot be scheduled because of insufficient CPU resources on the Nodes:
Events:
Reason Message
------ -------
FailedScheduling No nodes are available that match all of the following predicates:: Insufficient cpu (3).
Delete your Pod:
kubectl delete pod cpu-demo-2 --namespace=cpu-example
If you do not specify a CPU limit
If you do not specify a CPU limit for a Container, then one of these situations applies:
-
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running.
-
The Container is running in a namespace that has a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
If you specify a CPU limit but do not specify a CPU request
If you specify a CPU limit for a Container but do not specify a CPU request, Kubernetes automatically assigns a CPU request that matches the limit. Similarly, if a Container specifies its own memory limit, but does not specify a memory request, Kubernetes automatically assigns a memory request that matches the limit.
Motivation for CPU requests and limits
By configuring the CPU requests and limits of the Containers that run in your cluster, you can make efficient use of the CPU resources available on your cluster Nodes. By keeping a Pod CPU request low, you give the Pod a good chance of being scheduled. By having a CPU limit that is greater than the CPU request, you accomplish two things:
- The Pod can have bursts of activity where it makes use of CPU resources that happen to be available.
- The amount of CPU resources a Pod can use during a burst is limited to some reasonable amount.
Clean up
Delete your namespace:
kubectl delete namespace cpu-example
What's next
For app developers
For cluster administrators
4.3.3 - Configure GMSA for Windows Pods and containers
Kubernetes v1.18 [stable]
This page shows how to configure Group Managed Service Accounts (GMSA) for Pods and containers that will run on Windows nodes. Group Managed Service Accounts are a specific type of Active Directory account that provides automatic password management, simplified service principal name (SPN) management, and the ability to delegate the management to other administrators across multiple servers.
In Kubernetes, GMSA credential specs are configured at a Kubernetes cluster-wide scope as Custom Resources. Windows Pods, as well as individual containers within a Pod, can be configured to use a GMSA for domain based functions (e.g. Kerberos authentication) when interacting with other Windows services.
Before you begin
You need to have a Kubernetes cluster and the kubectl
command-line tool must be configured to communicate with your cluster. The cluster is expected to have Windows worker nodes. This section covers a set of initial steps required once for each cluster:
Install the GMSACredentialSpec CRD
A CustomResourceDefinition(CRD) for GMSA credential spec resources needs to be configured on the cluster to define the custom resource type GMSACredentialSpec
. Download the GMSA CRD YAML and save it as gmsa-crd.yaml.
Next, install the CRD with kubectl apply -f gmsa-crd.yaml
Install webhooks to validate GMSA users
Two webhooks need to be configured on the Kubernetes cluster to populate and validate GMSA credential spec references at the Pod or container level:
-
A mutating webhook that expands references to GMSAs (by name from a Pod specification) into the full credential spec in JSON form within the Pod spec.
-
A validating webhook ensures all references to GMSAs are authorized to be used by the Pod service account.
Installing the above webhooks and associated objects require the steps below:
-
Create a certificate key pair (that will be used to allow the webhook container to communicate to the cluster)
-
Install a secret with the certificate from above.
-
Create a deployment for the core webhook logic.
-
Create the validating and mutating webhook configurations referring to the deployment.
A script can be used to deploy and configure the GMSA webhooks and associated objects mentioned above. The script can be run with a --dry-run=server
option to allow you to review the changes that would be made to your cluster.
The YAML template used by the script may also be used to deploy the webhooks and associated objects manually (with appropriate substitutions for the parameters)
Configure GMSAs and Windows nodes in Active Directory
Before Pods in Kubernetes can be configured to use GMSAs, the desired GMSAs need to be provisioned in Active Directory as described in the Windows GMSA documentation. Windows worker nodes (that are part of the Kubernetes cluster) need to be configured in Active Directory to access the secret credentials associated with the desired GMSA as described in the Windows GMSA documentation
Create GMSA credential spec resources
With the GMSACredentialSpec CRD installed (as described earlier), custom resources containing GMSA credential specs can be configured. The GMSA credential spec does not contain secret or sensitive data. It is information that a container runtime can use to describe the desired GMSA of a container to Windows. GMSA credential specs can be generated in YAML format with a utility PowerShell script.
Following are the steps for generating a GMSA credential spec YAML manually in JSON format and then converting it:
-
Import the CredentialSpec module:
ipmo CredentialSpec.psm1
-
Create a credential spec in JSON format using
New-CredentialSpec
. To create a GMSA credential spec named WebApp1, invokeNew-CredentialSpec -Name WebApp1 -AccountName WebApp1 -Domain $(Get-ADDomain -Current LocalComputer)
-
Use
Get-CredentialSpec
to show the path of the JSON file. -
Convert the credspec file from JSON to YAML format and apply the necessary header fields
apiVersion
,kind
,metadata
andcredspec
to make it a GMSACredentialSpec custom resource that can be configured in Kubernetes.
The following YAML configuration describes a GMSA credential spec named gmsa-WebApp1
:
apiVersion: windows.k8s.io/v1
kind: GMSACredentialSpec
metadata:
name: gmsa-WebApp1 #This is an arbitrary name but it will be used as a reference
credspec:
ActiveDirectoryConfig:
GroupManagedServiceAccounts:
- Name: WebApp1 #Username of the GMSA account
Scope: CONTOSO #NETBIOS Domain Name
- Name: WebApp1 #Username of the GMSA account
Scope: contoso.com #DNS Domain Name
CmsPlugins:
- ActiveDirectory
DomainJoinConfig:
DnsName: contoso.com #DNS Domain Name
DnsTreeName: contoso.com #DNS Domain Name Root
Guid: 244818ae-87ac-4fcd-92ec-e79e5252348a #GUID
MachineAccountName: WebApp1 #Username of the GMSA account
NetBiosName: CONTOSO #NETBIOS Domain Name
Sid: S-1-5-21-2126449477-2524075714-3094792973 #SID of GMSA
The above credential spec resource may be saved as gmsa-Webapp1-credspec.yaml
and applied to the cluster using: kubectl apply -f gmsa-Webapp1-credspec.yml
Configure cluster role to enable RBAC on specific GMSA credential specs
A cluster role needs to be defined for each GMSA credential spec resource. This authorizes the use
verb on a specific GMSA resource by a subject which is typically a service account. The following example shows a cluster role that authorizes usage of the gmsa-WebApp1
credential spec from above. Save the file as gmsa-webapp1-role.yaml and apply using kubectl apply -f gmsa-webapp1-role.yaml
#Create the Role to read the credspec
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: webapp1-role
rules:
- apiGroups: ["windows.k8s.io"]
resources: ["gmsacredentialspecs"]
verbs: ["use"]
resourceNames: ["gmsa-WebApp1"]
Assign role to service accounts to use specific GMSA credspecs
A service account (that Pods will be configured with) needs to be bound to the cluster role create above. This authorizes the service account to use the desired GMSA credential spec resource. The following shows the default service account being bound to a cluster role webapp1-role
to use gmsa-WebApp1
credential spec resource created above.
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: allow-default-svc-account-read-on-gmsa-WebApp1
namespace: default
subjects:
- kind: ServiceAccount
name: default
namespace: default
roleRef:
kind: ClusterRole
name: webapp1-role
apiGroup: rbac.authorization.k8s.io
Configure GMSA credential spec reference in Pod spec
The Pod spec field securityContext.windowsOptions.gmsaCredentialSpecName
is used to specify references to desired GMSA credential spec custom resources in Pod specs. This configures all containers in the Pod spec to use the specified GMSA. A sample Pod spec with the annotation populated to refer to gmsa-WebApp1
:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
run: with-creds
name: with-creds
namespace: default
spec:
replicas: 1
selector:
matchLabels:
run: with-creds
template:
metadata:
labels:
run: with-creds
spec:
securityContext:
windowsOptions:
gmsaCredentialSpecName: gmsa-webapp1
containers:
- image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
imagePullPolicy: Always
name: iis
nodeSelector:
kubernetes.io/os: windows
Individual containers in a Pod spec can also specify the desired GMSA credspec using a per-container securityContext.windowsOptions.gmsaCredentialSpecName
field. For example:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
run: with-creds
name: with-creds
namespace: default
spec:
replicas: 1
selector:
matchLabels:
run: with-creds
template:
metadata:
labels:
run: with-creds
spec:
containers:
- image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
imagePullPolicy: Always
name: iis
securityContext:
windowsOptions:
gmsaCredentialSpecName: gmsa-Webapp1
nodeSelector:
kubernetes.io/os: windows
As Pod specs with GMSA fields populated (as described above) are applied in a cluster, the following sequence of events take place:
-
The mutating webhook resolves and expands all references to GMSA credential spec resources to the contents of the GMSA credential spec.
-
The validating webhook ensures the service account associated with the Pod is authorized for the
use
verb on the specified GMSA credential spec. -
The container runtime configures each Windows container with the specified GMSA credential spec so that the container can assume the identity of the GMSA in Active Directory and access services in the domain using that identity.
Authenticating to network shares using hostname or FQDN
If you are experiencing issues connecting to SMB shares from Pods using hostname or FQDN, but are able to access the shares via their IPv4 address then make sure the following registry key is set on the Windows nodes.
reg add "HKLM\SYSTEM\CurrentControlSet\Services\hns\State" /v EnableCompartmentNamespace /t REG_DWORD /d 1
Running Pods will then need to be recreated to pick up the behavior changes. More information on how this registry key is used can be found here
Troubleshooting
If you are having difficulties getting GMSA to work in your environment, there are a few troubleshooting steps you can take.
First, make sure the credspec has been passed to the Pod. To do this you will need to exec
into one of your Pods and check the output of the nltest.exe /parentdomain
command.
In the example below the Pod did not get the credspec correctly:
kubectl exec -it iis-auth-7776966999-n5nzr powershell.exe
nltest.exe /parentdomain
results in the following error:
Getting parent domain failed: Status = 1722 0x6ba RPC_S_SERVER_UNAVAILABLE
If your Pod did get the credspec correctly, then next check communication with the domain. First, from inside of your Pod, quickly do an nslookup to find the root of your domain.
This will tell us 3 things:
- The Pod can reach the DC
- The DC can reach the Pod
- DNS is working correctly.
If the DNS and communication test passes, next you will need to check if the Pod has established secure channel communication with the domain. To do this, again, exec
into your Pod and run the nltest.exe /query
command.
nltest.exe /query
Results in the following output:
I_NetLogonControl failed: Status = 1722 0x6ba RPC_S_SERVER_UNAVAILABLE
This tells us that for some reason, the Pod was unable to logon to the domain using the account specified in the credspec. You can try to repair the secure channel by running the following:
nltest /sc_reset:domain.example
If the command is successful you will see and output similar to this:
Flags: 30 HAS_IP HAS_TIMESERV
Trusted DC Name \\dc10.domain.example
Trusted DC Connection Status Status = 0 0x0 NERR_Success
The command completed successfully
If the above corrects the error, you can automate the step by adding the following lifecycle hook to your Pod spec. If it did not correct the error, you will need to examine your credspec again and confirm that it is correct and complete.
image: registry.domain.example/iis-auth:1809v1
lifecycle:
postStart:
exec:
command: ["powershell.exe","-command","do { Restart-Service -Name netlogon } while ( $($Result = (nltest.exe /query); if ($Result -like '*0x0 NERR_Success*') {return $true} else {return $false}) -eq $false)"]
imagePullPolicy: IfNotPresent
If you add the lifecycle
section show above to your Pod spec, the Pod will execute the commands listed to restart the netlogon
service until the nltest.exe /query
command exits without error.
4.3.4 - Configure RunAsUserName for Windows pods and containers
Kubernetes v1.18 [stable]
This page shows how to use the runAsUserName
setting for Pods and containers that will run on Windows nodes. This is roughly equivalent of the Linux-specific runAsUser
setting, allowing you to run applications in a container as a different username than the default.
Before you begin
You need to have a Kubernetes cluster and the kubectl command-line tool must be configured to communicate with your cluster. The cluster is expected to have Windows worker nodes where pods with containers running Windows workloads will get scheduled.
Set the Username for a Pod
To specify the username with which to execute the Pod's container processes, include the securityContext
field (PodSecurityContext) in the Pod specification, and within it, the windowsOptions
(WindowsSecurityContextOptions) field containing the runAsUserName
field.
The Windows security context options that you specify for a Pod apply to all Containers and init Containers in the Pod.
Here is a configuration file for a Windows Pod that has the runAsUserName
field set:
apiVersion: v1
kind: Pod
metadata:
name: run-as-username-pod-demo
spec:
securityContext:
windowsOptions:
runAsUserName: "ContainerUser"
containers:
- name: run-as-username-demo
image: mcr.microsoft.com/windows/servercore:ltsc2019
command: ["ping", "-t", "localhost"]
nodeSelector:
kubernetes.io/os: windows
Create the Pod:
kubectl apply -f https://k8s.io/examples/windows/run-as-username-pod.yaml
Verify that the Pod's Container is running:
kubectl get pod run-as-username-pod-demo
Get a shell to the running Container:
kubectl exec -it run-as-username-pod-demo -- powershell
Check that the shell is running user the correct username:
echo $env:USERNAME
The output should be:
ContainerUser
Set the Username for a Container
To specify the username with which to execute a Container's processes, include the securityContext
field (SecurityContext) in the Container manifest, and within it, the windowsOptions
(WindowsSecurityContextOptions) field containing the runAsUserName
field.
The Windows security context options that you specify for a Container apply only to that individual Container, and they override the settings made at the Pod level.
Here is the configuration file for a Pod that has one Container, and the runAsUserName
field is set at the Pod level and the Container level:
apiVersion: v1
kind: Pod
metadata:
name: run-as-username-container-demo
spec:
securityContext:
windowsOptions:
runAsUserName: "ContainerUser"
containers:
- name: run-as-username-demo
image: mcr.microsoft.com/windows/servercore:ltsc2019
command: ["ping", "-t", "localhost"]
securityContext:
windowsOptions:
runAsUserName: "ContainerAdministrator"
nodeSelector:
kubernetes.io/os: windows
Create the Pod:
kubectl apply -f https://k8s.io/examples/windows/run-as-username-container.yaml
Verify that the Pod's Container is running:
kubectl get pod run-as-username-container-demo
Get a shell to the running Container:
kubectl exec -it run-as-username-container-demo -- powershell
Check that the shell is running user the correct username (the one set at the Container level):
echo $env:USERNAME
The output should be:
ContainerAdministrator
Windows Username limitations
In order to use this feature, the value set in the runAsUserName
field must be a valid username. It must have the following format: DOMAIN\USER
, where DOMAIN\
is optional. Windows user names are case insensitive. Additionally, there are some restrictions regarding the DOMAIN
and USER
:
- The
runAsUserName
field cannot be empty, and it cannot contain control characters (ASCII values:0x00-0x1F
,0x7F
) - The
DOMAIN
must be either a NetBios name, or a DNS name, each with their own restrictions:- NetBios names: maximum 15 characters, cannot start with
.
(dot), and cannot contain the following characters:\ / : * ? " < > |
- DNS names: maximum 255 characters, contains only alphanumeric characters, dots, and dashes, and it cannot start or end with a
.
(dot) or-
(dash).
- NetBios names: maximum 15 characters, cannot start with
- The
USER
must have at most 20 characters, it cannot contain only dots or spaces, and it cannot contain the following characters:" / \ [ ] : ; | = , + * ? < > @
.
Examples of acceptable values for the runAsUserName
field: ContainerAdministrator
, ContainerUser
, NT AUTHORITY\NETWORK SERVICE
, NT AUTHORITY\LOCAL SERVICE
.
For more information about these limtations, check here and here.
What's next
4.3.5 - Create a Windows HostProcess Pod
Kubernetes v1.23 [beta]
Windows HostProcess containers enable you to run containerized workloads on a Windows host. These containers operate as normal processes but have access to the host network namespace, storage, and devices when given the appropriate user privileges. HostProcess containers can be used to deploy network plugins, storage configurations, device plugins, kube-proxy, and other components to Windows nodes without the need for dedicated proxies or the direct installation of host services.
Administrative tasks such as installation of security patches, event log collection, and more can be performed without requiring cluster operators to log onto each Window node. HostProcess containers can run as any user that is available on the host or is in the domain of the host machine, allowing administrators to restrict resource access through user permissions. While neither filesystem or process isolation are supported, a new volume is created on the host upon starting the container to give it a clean and consolidated workspace. HostProcess containers can also be built on top of existing Windows base images and do not inherit the same compatibility requirements as Windows server containers, meaning that the version of the base images does not need to match that of the host. It is, however, recommended that you use the same base image version as your Windows Server container workloads to ensure you do not have any unused images taking up space on the node. HostProcess containers also support volume mounts within the container volume.
When should I use a Windows HostProcess container?
- When you need to perform tasks which require the networking namespace of the host. HostProcess containers have access to the host's network interfaces and IP addresses.
- You need access to resources on the host such as the filesystem, event logs, etc.
- Installation of specific device drivers or Windows services.
- Consolidation of administrative tasks and security policies. This reduces the degree of privileges needed by Windows nodes.
Before you begin
This task guide is specific to Kubernetes v1.23. If you are not running Kubernetes v1.23, check the documentation for that version of Kubernetes.
In Kubernetes 1.23, the HostProcess container feature is enabled by default. The kubelet will communicate with containerd directly by passing the hostprocess flag via CRI. You can use the latest version of containerd (v1.6+) to run HostProcess containers. How to install containerd.
To disable HostProcess containers you need to pass the following feature gate flag to the kubelet and kube-apiserver:
--feature-gates=WindowsHostProcessContainers=false
See Features Gates documentation for more details.
Limitations
These limitations are relevant for Kubernetes v1.23:
- HostProcess containers require containerd 1.6 or higher container runtime.
- HostProcess pods can only contain HostProcess containers. This is a current limitation of the Windows OS; non-privileged Windows containers cannot share a vNIC with the host IP namespace.
- HostProcess containers run as a process on the host and do not have any degree of isolation other than resource constraints imposed on the HostProcess user account. Neither filesystem or Hyper-V isolation are supported for HostProcess containers.
- Volume mounts are supported and are mounted under the container volume. See Volume Mounts
- A limited set of host user accounts are available for HostProcess containers by default. See Choosing a User Account.
- Resource limits (disk, memory, cpu count) are supported in the same fashion as processes on the host.
- Both Named pipe mounts and Unix domain sockets are not supported and should instead be accessed via their path on the host (e.g. \\.\pipe\*)
HostProcess Pod configuration requirements
Enabling a Windows HostProcess pod requires setting the right configurations in the pod security configuration. Of the policies defined in the Pod Security Standards HostProcess pods are disallowed by the baseline and restricted policies. It is therefore recommended that HostProcess pods run in alignment with the privileged profile.
When running under the privileged policy, here are the configurations which need to be set to enable the creation of a HostProcess pod:
Control | Policy |
---|---|
securityContext.windowsOptions.hostProcess |
Windows pods offer the ability to run HostProcess containers which enables privileged access to the Windows node. Allowed Values
|
hostNetwork |
Will be in host network by default initially. Support to set network to a different compartment may be desirable in the future. Allowed Values
|
securityContext.windowsOptions.runAsUsername |
Specification of which user the HostProcess container should run as is required for the pod spec. Allowed Values
|
runAsNonRoot |
Because HostProcess containers have privileged access to the host, the runAsNonRoot field cannot be set to true. Allowed Values
|
Example manifest (excerpt)
spec:
securityContext:
windowsOptions:
hostProcess: true
runAsUserName: "NT AUTHORITY\\Local service"
hostNetwork: true
containers:
- name: test
image: image1:latest
command:
- ping
- -t
- 127.0.0.1
nodeSelector:
"kubernetes.io/os": windows
Volume mounts
HostProcess containers support the ability to mount volumes within the container volume space.
Applications running inside the container can access volume mounts directly via relative or
absolute paths. An environment variable $CONTAINER_SANDBOX_MOUNT_POINT
is set upon container
creation and provides the absolute host path to the container volume. Relative paths are based
upon the .spec.containers.volumeMounts.mountPath
configuration.
Example
To access service account tokens the following path structures are supported within the container:
.\var\run\secrets\kubernetes.io\serviceaccount\
$CONTAINER_SANDBOX_MOUNT_POINT\var\run\secrets\kubernetes.io\serviceaccount\
Resource limits
Resource limits (disk, memory, cpu count) are applied to the job and are job wide. For example, with a limit of 10MB set, the memory allocated for any HostProcess job object will be capped at 10MB. This is the same behavior as other Windows container types. These limits would be specified the same way they are currently for whatever orchestrator or runtime is being used. The only difference is in the disk resource usage calculation used for resource tracking due to the difference in how HostProcess containers are bootstrapped.
Choosing a user account
HostProcess containers support the ability to run as one of three supported Windows service accounts:
You should select an appropriate Windows service account for each HostProcess container, aiming to limit the degree of privileges so as to avoid accidental (or even malicious) damage to the host. The LocalSystem service account has the highest level of privilege of the three and should be used only if absolutely necessary. Where possible, use the LocalService service account as it is the least privileged of the three options.
4.3.6 - Configure Quality of Service for Pods
This page shows how to configure Pods so that they will be assigned particular Quality of Service (QoS) classes. Kubernetes uses QoS classes to make decisions about scheduling and evicting Pods.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
QoS classes
When Kubernetes creates a Pod it assigns one of these QoS classes to the Pod:
- Guaranteed
- Burstable
- BestEffort
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of your cluster.
kubectl create namespace qos-example
Create a Pod that gets assigned a QoS class of Guaranteed
For a Pod to be given a QoS class of Guaranteed:
- Every Container in the Pod must have a memory limit and a memory request.
- For every Container in the Pod, the memory limit must equal the memory request.
- Every Container in the Pod must have a CPU limit and a CPU request.
- For every Container in the Pod, the CPU limit must equal the CPU request.
These restrictions apply to init containers and app containers equally.
Here is the configuration file for a Pod that has one Container. The Container has a memory limit and a memory request, both equal to 200 MiB. The Container has a CPU limit and a CPU request, both equal to 700 milliCPU:
apiVersion: v1
kind: Pod
metadata:
name: qos-demo
namespace: qos-example
spec:
containers:
- name: qos-demo-ctr
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "700m"
requests:
memory: "200Mi"
cpu: "700m"
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/qos/qos-pod.yaml --namespace=qos-example
View detailed information about the Pod:
kubectl get pod qos-demo --namespace=qos-example --output=yaml
The output shows that Kubernetes gave the Pod a QoS class of Guaranteed. The output also verifies that the Pod Container has a memory request that matches its memory limit, and it has a CPU request that matches its CPU limit.
spec:
containers:
...
resources:
limits:
cpu: 700m
memory: 200Mi
requests:
cpu: 700m
memory: 200Mi
...
status:
qosClass: Guaranteed
Delete your Pod:
kubectl delete pod qos-demo --namespace=qos-example
Create a Pod that gets assigned a QoS class of Burstable
A Pod is given a QoS class of Burstable if:
- The Pod does not meet the criteria for QoS class Guaranteed.
- At least one Container in the Pod has a memory or CPU request.
Here is the configuration file for a Pod that has one Container. The Container has a memory limit of 200 MiB and a memory request of 100 MiB.
apiVersion: v1
kind: Pod
metadata:
name: qos-demo-2
namespace: qos-example
spec:
containers:
- name: qos-demo-2-ctr
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/qos/qos-pod-2.yaml --namespace=qos-example
View detailed information about the Pod:
kubectl get pod qos-demo-2 --namespace=qos-example --output=yaml
The output shows that Kubernetes gave the Pod a QoS class of Burstable.
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: qos-demo-2-ctr
resources:
limits:
memory: 200Mi
requests:
memory: 100Mi
...
status:
qosClass: Burstable
Delete your Pod:
kubectl delete pod qos-demo-2 --namespace=qos-example
Create a Pod that gets assigned a QoS class of BestEffort
For a Pod to be given a QoS class of BestEffort, the Containers in the Pod must not have any memory or CPU limits or requests.
Here is the configuration file for a Pod that has one Container. The Container has no memory or CPU limits or requests:
apiVersion: v1
kind: Pod
metadata:
name: qos-demo-3
namespace: qos-example
spec:
containers:
- name: qos-demo-3-ctr
image: nginx
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/qos/qos-pod-3.yaml --namespace=qos-example
View detailed information about the Pod:
kubectl get pod qos-demo-3 --namespace=qos-example --output=yaml
The output shows that Kubernetes gave the Pod a QoS class of BestEffort.
spec:
containers:
...
resources: {}
...
status:
qosClass: BestEffort
Delete your Pod:
kubectl delete pod qos-demo-3 --namespace=qos-example
Create a Pod that has two Containers
Here is the configuration file for a Pod that has two Containers. One container specifies a memory request of 200 MiB. The other Container does not specify any requests or limits.
apiVersion: v1
kind: Pod
metadata:
name: qos-demo-4
namespace: qos-example
spec:
containers:
- name: qos-demo-4-ctr-1
image: nginx
resources:
requests:
memory: "200Mi"
- name: qos-demo-4-ctr-2
image: redis
Notice that this Pod meets the criteria for QoS class Burstable. That is, it does not meet the criteria for QoS class Guaranteed, and one of its Containers has a memory request.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/qos/qos-pod-4.yaml --namespace=qos-example
View detailed information about the Pod:
kubectl get pod qos-demo-4 --namespace=qos-example --output=yaml
The output shows that Kubernetes gave the Pod a QoS class of Burstable:
spec:
containers:
...
name: qos-demo-4-ctr-1
resources:
requests:
memory: 200Mi
...
name: qos-demo-4-ctr-2
resources: {}
...
status:
qosClass: Burstable
Delete your Pod:
kubectl delete pod qos-demo-4 --namespace=qos-example
Clean up
Delete your namespace:
kubectl delete namespace qos-example
What's next
For app developers
For cluster administrators
4.3.7 - Assign Extended Resources to a Container
Kubernetes v1.23 [stable]
This page shows how to assign extended resources to a Container.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Before you do this exercise, do the exercise in Advertise Extended Resources for a Node. That will configure one of your Nodes to advertise a dongle resource.
Assign an extended resource to a Pod
To request an extended resource, include the resources:requests
field in your
Container manifest. Extended resources are fully qualified with any domain outside of
*.kubernetes.io/
. Valid extended resource names have the form example.com/foo
where
example.com
is replaced with your organization's domain and foo
is a
descriptive resource name.
Here is the configuration file for a Pod that has one Container:
apiVersion: v1
kind: Pod
metadata:
name: extended-resource-demo
spec:
containers:
- name: extended-resource-demo-ctr
image: nginx
resources:
requests:
example.com/dongle: 3
limits:
example.com/dongle: 3
In the configuration file, you can see that the Container requests 3 dongles.
Create a Pod:
kubectl apply -f https://k8s.io/examples/pods/resource/extended-resource-pod.yaml
Verify that the Pod is running:
kubectl get pod extended-resource-demo
Describe the Pod:
kubectl describe pod extended-resource-demo
The output shows dongle requests:
Limits:
example.com/dongle: 3
Requests:
example.com/dongle: 3
Attempt to create a second Pod
Here is the configuration file for a Pod that has one Container. The Container requests two dongles.
apiVersion: v1
kind: Pod
metadata:
name: extended-resource-demo-2
spec:
containers:
- name: extended-resource-demo-2-ctr
image: nginx
resources:
requests:
example.com/dongle: 2
limits:
example.com/dongle: 2
Kubernetes will not be able to satisfy the request for two dongles, because the first Pod used three of the four available dongles.
Attempt to create a Pod:
kubectl apply -f https://k8s.io/examples/pods/resource/extended-resource-pod-2.yaml
Describe the Pod
kubectl describe pod extended-resource-demo-2
The output shows that the Pod cannot be scheduled, because there is no Node that has 2 dongles available:
Conditions:
Type Status
PodScheduled False
...
Events:
...
... Warning FailedScheduling pod (extended-resource-demo-2) failed to fit in any node
fit failure summary on nodes : Insufficient example.com/dongle (1)
View the Pod status:
kubectl get pod extended-resource-demo-2
The output shows that the Pod was created, but not scheduled to run on a Node. It has a status of Pending:
NAME READY STATUS RESTARTS AGE
extended-resource-demo-2 0/1 Pending 0 6m
Clean up
Delete the Pods that you created for this exercise:
kubectl delete pod extended-resource-demo
kubectl delete pod extended-resource-demo-2
What's next
For application developers
For cluster administrators
4.3.8 - Configure a Pod to Use a Volume for Storage
This page shows how to configure a Pod to use a Volume for storage.
A Container's file system lives only as long as the Container does. So when a Container terminates and restarts, filesystem changes are lost. For more consistent storage that is independent of the Container, you can use a Volume. This is especially important for stateful applications, such as key-value stores (such as Redis) and databases.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Configure a volume for a Pod
In this exercise, you create a Pod that runs one Container. This Pod has a Volume of type emptyDir that lasts for the life of the Pod, even if the Container terminates and restarts. Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: redis
spec:
containers:
- name: redis
image: redis
volumeMounts:
- name: redis-storage
mountPath: /data/redis
volumes:
- name: redis-storage
emptyDir: {}
-
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/storage/redis.yaml
-
Verify that the Pod's Container is running, and then watch for changes to the Pod:
kubectl get pod redis --watch
The output looks like this:
NAME READY STATUS RESTARTS AGE redis 1/1 Running 0 13s
-
In another terminal, get a shell to the running Container:
kubectl exec -it redis -- /bin/bash
-
In your shell, go to
/data/redis
, and then create a file:root@redis:/data# cd /data/redis/ root@redis:/data/redis# echo Hello > test-file
-
In your shell, list the running processes:
root@redis:/data/redis# apt-get update root@redis:/data/redis# apt-get install procps root@redis:/data/redis# ps aux
The output is similar to this:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND redis 1 0.1 0.1 33308 3828 ? Ssl 00:46 0:00 redis-server *:6379 root 12 0.0 0.0 20228 3020 ? Ss 00:47 0:00 /bin/bash root 15 0.0 0.0 17500 2072 ? R+ 00:48 0:00 ps aux
-
In your shell, kill the Redis process:
root@redis:/data/redis# kill <pid>
where
<pid>
is the Redis process ID (PID). -
In your original terminal, watch for changes to the Redis Pod. Eventually, you will see something like this:
NAME READY STATUS RESTARTS AGE redis 1/1 Running 0 13s redis 0/1 Completed 0 6m redis 1/1 Running 1 6m
At this point, the Container has terminated and restarted. This is because the
Redis Pod has a
restartPolicy
of Always
.
-
Get a shell into the restarted Container:
kubectl exec -it redis -- /bin/bash
-
In your shell, go to
/data/redis
, and verify thattest-file
is still there.root@redis:/data/redis# cd /data/redis/ root@redis:/data/redis# ls test-file
-
Delete the Pod that you created for this exercise:
kubectl delete pod redis
What's next
-
See Volume.
-
See Pod.
-
In addition to the local disk storage provided by
emptyDir
, Kubernetes supports many different network-attached storage solutions, including PD on GCE and EBS on EC2, which are preferred for critical data and will handle details such as mounting and unmounting the devices on the nodes. See Volumes for more details.
4.3.9 - Configure a Pod to Use a PersistentVolume for Storage
This page shows you how to configure a Pod to use a PersistentVolumeClaim for storage. Here is a summary of the process:
-
You, as cluster administrator, create a PersistentVolume backed by physical storage. You do not associate the volume with any Pod.
-
You, now taking the role of a developer / cluster user, create a PersistentVolumeClaim that is automatically bound to a suitable PersistentVolume.
-
You create a Pod that uses the above PersistentVolumeClaim for storage.
Before you begin
-
You need to have a Kubernetes cluster that has only one Node, and the kubectl command-line tool must be configured to communicate with your cluster. If you do not already have a single-node cluster, you can create one by using Minikube.
-
Familiarize yourself with the material in Persistent Volumes.
Create an index.html file on your Node
Open a shell to the single Node in your cluster. How you open a shell depends
on how you set up your cluster. For example, if you are using Minikube, you
can open a shell to your Node by entering minikube ssh
.
In your shell on that Node, create a /mnt/data
directory:
# This assumes that your Node uses "sudo" to run commands
# as the superuser
sudo mkdir /mnt/data
In the /mnt/data
directory, create an index.html
file:
# This again assumes that your Node uses "sudo" to run commands
# as the superuser
sudo sh -c "echo 'Hello from Kubernetes storage' > /mnt/data/index.html"
sudo
, you can
usually make this work if you replace sudo
with the name of the other tool.
Test that the index.html
file exists:
cat /mnt/data/index.html
The output should be:
Hello from Kubernetes storage
You can now close the shell to your Node.
Create a PersistentVolume
In this exercise, you create a hostPath PersistentVolume. Kubernetes supports hostPath for development and testing on a single-node cluster. A hostPath PersistentVolume uses a file or directory on the Node to emulate network-attached storage.
In a production cluster, you would not use hostPath. Instead a cluster administrator would provision a network resource like a Google Compute Engine persistent disk, an NFS share, or an Amazon Elastic Block Store volume. Cluster administrators can also use StorageClasses to set up dynamic provisioning.
Here is the configuration file for the hostPath PersistentVolume:
apiVersion: v1
kind: PersistentVolume
metadata:
name: task-pv-volume
labels:
type: local
spec:
storageClassName: manual
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/data"
The configuration file specifies that the volume is at /mnt/data
on the
cluster's Node. The configuration also specifies a size of 10 gibibytes and
an access mode of ReadWriteOnce
, which means the volume can be mounted as
read-write by a single Node. It defines the StorageClass name
manual
for the PersistentVolume, which will be used to bind
PersistentVolumeClaim requests to this PersistentVolume.
Create the PersistentVolume:
kubectl apply -f https://k8s.io/examples/pods/storage/pv-volume.yaml
View information about the PersistentVolume:
kubectl get pv task-pv-volume
The output shows that the PersistentVolume has a STATUS
of Available
. This
means it has not yet been bound to a PersistentVolumeClaim.
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS REASON AGE
task-pv-volume 10Gi RWO Retain Available manual 4s
Create a PersistentVolumeClaim
The next step is to create a PersistentVolumeClaim. Pods use PersistentVolumeClaims to request physical storage. In this exercise, you create a PersistentVolumeClaim that requests a volume of at least three gibibytes that can provide read-write access for at least one Node.
Here is the configuration file for the PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: task-pv-claim
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
Create the PersistentVolumeClaim:
kubectl apply -f https://k8s.io/examples/pods/storage/pv-claim.yaml
After you create the PersistentVolumeClaim, the Kubernetes control plane looks for a PersistentVolume that satisfies the claim's requirements. If the control plane finds a suitable PersistentVolume with the same StorageClass, it binds the claim to the volume.
Look again at the PersistentVolume:
kubectl get pv task-pv-volume
Now the output shows a STATUS
of Bound
.
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS REASON AGE
task-pv-volume 10Gi RWO Retain Bound default/task-pv-claim manual 2m
Look at the PersistentVolumeClaim:
kubectl get pvc task-pv-claim
The output shows that the PersistentVolumeClaim is bound to your PersistentVolume,
task-pv-volume
.
NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE
task-pv-claim Bound task-pv-volume 10Gi RWO manual 30s
Create a Pod
The next step is to create a Pod that uses your PersistentVolumeClaim as a volume.
Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: task-pv-pod
spec:
volumes:
- name: task-pv-storage
persistentVolumeClaim:
claimName: task-pv-claim
containers:
- name: task-pv-container
image: nginx
ports:
- containerPort: 80
name: "http-server"
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: task-pv-storage
Notice that the Pod's configuration file specifies a PersistentVolumeClaim, but it does not specify a PersistentVolume. From the Pod's point of view, the claim is a volume.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/storage/pv-pod.yaml
Verify that the container in the Pod is running;
kubectl get pod task-pv-pod
Get a shell to the container running in your Pod:
kubectl exec -it task-pv-pod -- /bin/bash
In your shell, verify that nginx is serving the index.html
file from the
hostPath volume:
# Be sure to run these 3 commands inside the root shell that comes from
# running "kubectl exec" in the previous step
apt update
apt install curl
curl http://localhost/
The output shows the text that you wrote to the index.html
file on the
hostPath volume:
Hello from Kubernetes storage
If you see that message, you have successfully configured a Pod to use storage from a PersistentVolumeClaim.
Clean up
Delete the Pod, the PersistentVolumeClaim and the PersistentVolume:
kubectl delete pod task-pv-pod
kubectl delete pvc task-pv-claim
kubectl delete pv task-pv-volume
If you don't already have a shell open to the Node in your cluster, open a new shell the same way that you did earlier.
In the shell on your Node, remove the file and directory that you created:
# This assumes that your Node uses "sudo" to run commands
# as the superuser
sudo rm /mnt/data/index.html
sudo rmdir /mnt/data
You can now close the shell to your Node.
Mounting the same persistentVolume in two places
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- name: test
image: nginx
volumeMounts:
# a mount for site-data
- name: config
mountPath: /usr/share/nginx/html
subPath: html
# another mount for nginx config
- name: config
mountPath: /etc/nginx/nginx.conf
subPath: nginx.conf
volumes:
- name: config
persistentVolumeClaim:
claimName: test-nfs-claim
You can perform 2 volume mounts on your nginx container:
/usr/share/nginx/html
for the static website
/etc/nginx/nginx.conf
for the default config
Access control
Storage configured with a group ID (GID) allows writing only by Pods using the same GID. Mismatched or missing GIDs cause permission denied errors. To reduce the need for coordination with users, an administrator can annotate a PersistentVolume with a GID. Then the GID is automatically added to any Pod that uses the PersistentVolume.
Use the pv.beta.kubernetes.io/gid
annotation as follows:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv1
annotations:
pv.beta.kubernetes.io/gid: "1234"
When a Pod consumes a PersistentVolume that has a GID annotation, the annotated GID is applied to all containers in the Pod in the same way that GIDs specified in the Pod's security context are. Every GID, whether it originates from a PersistentVolume annotation or the Pod's specification, is applied to the first process run in each container.
What's next
- Learn more about PersistentVolumes.
- Read the Persistent Storage design document.
Reference
4.3.10 - Configure a Pod to Use a Projected Volume for Storage
This page shows how to use a projected
Volume to mount
several existing volume sources into the same directory. Currently, secret
, configMap
, downwardAPI
,
and serviceAccountToken
volumes can be projected.
serviceAccountToken
is not a volume type.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Configure a projected volume for a pod
In this exercise, you create username and password Secrets from local files. You then create a Pod that runs one container, using a projected
Volume to mount the Secrets into the same shared directory.
Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: test-projected-volume
spec:
containers:
- name: test-projected-volume
image: busybox:1.28
args:
- sleep
- "86400"
volumeMounts:
- name: all-in-one
mountPath: "/projected-volume"
readOnly: true
volumes:
- name: all-in-one
projected:
sources:
- secret:
name: user
- secret:
name: pass
-
Create the Secrets:
# Create files containing the username and password: echo -n "admin" > ./username.txt echo -n "1f2d1e2e67df" > ./password.txt # Package these files into secrets: kubectl create secret generic user --from-file=./username.txt kubectl create secret generic pass --from-file=./password.txt
-
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/storage/projected.yaml
-
Verify that the Pod's container is running, and then watch for changes to the Pod:
kubectl get --watch pod test-projected-volume
The output looks like this:
NAME READY STATUS RESTARTS AGE test-projected-volume 1/1 Running 0 14s
-
In another terminal, get a shell to the running container:
kubectl exec -it test-projected-volume -- /bin/sh
-
In your shell, verify that the
projected-volume
directory contains your projected sources:ls /projected-volume/
Clean up
Delete the Pod and the Secrets:
kubectl delete pod test-projected-volume
kubectl delete secret user pass
What's next
- Learn more about
projected
volumes. - Read the all-in-one volume design document.
4.3.11 - Configure a Security Context for a Pod or Container
A security context defines privilege and access control settings for a Pod or Container. Security context settings include, but are not limited to:
-
Discretionary Access Control: Permission to access an object, like a file, is based on user ID (UID) and group ID (GID).
-
Security Enhanced Linux (SELinux): Objects are assigned security labels.
-
Running as privileged or unprivileged.
-
Linux Capabilities: Give a process some privileges, but not all the privileges of the root user.
-
AppArmor: Use program profiles to restrict the capabilities of individual programs.
-
Seccomp: Filter a process's system calls.
-
allowPrivilegeEscalation
: Controls whether a process can gain more privileges than its parent process. This bool directly controls whether theno_new_privs
flag gets set on the container process.allowPrivilegeEscalation
is always true when the container:- is run as privileged, or
- has
CAP_SYS_ADMIN
-
readOnlyRootFilesystem
: Mounts the container's root filesystem as read-only.
The above bullets are not a complete set of security context settings -- please see SecurityContext for a comprehensive list.
For more information about security mechanisms in Linux, see Overview of Linux Kernel Security Features
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Set the security context for a Pod
To specify security settings for a Pod, include the securityContext
field
in the Pod specification. The securityContext
field is a
PodSecurityContext object.
The security settings that you specify for a Pod apply to all Containers in the Pod.
Here is a configuration file for a Pod that has a securityContext
and an emptyDir
volume:
apiVersion: v1
kind: Pod
metadata:
name: security-context-demo
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
volumes:
- name: sec-ctx-vol
emptyDir: {}
containers:
- name: sec-ctx-demo
image: busybox:1.28
command: [ "sh", "-c", "sleep 1h" ]
volumeMounts:
- name: sec-ctx-vol
mountPath: /data/demo
securityContext:
allowPrivilegeEscalation: false
In the configuration file, the runAsUser
field specifies that for any Containers in
the Pod, all processes run with user ID 1000. The runAsGroup
field specifies the primary group ID of 3000 for
all processes within any containers of the Pod. If this field is omitted, the primary group ID of the containers
will be root(0). Any files created will also be owned by user 1000 and group 3000 when runAsGroup
is specified.
Since fsGroup
field is specified, all processes of the container are also part of the supplementary group ID 2000.
The owner for volume /data/demo
and any files created in that volume will be Group ID 2000.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/security/security-context.yaml
Verify that the Pod's Container is running:
kubectl get pod security-context-demo
Get a shell to the running Container:
kubectl exec -it security-context-demo -- sh
In your shell, list the running processes:
ps
The output shows that the processes are running as user 1000, which is the value of runAsUser
:
PID USER TIME COMMAND
1 1000 0:00 sleep 1h
6 1000 0:00 sh
...
In your shell, navigate to /data
, and list the one directory:
cd /data
ls -l
The output shows that the /data/demo
directory has group ID 2000, which is
the value of fsGroup
.
drwxrwsrwx 2 root 2000 4096 Jun 6 20:08 demo
In your shell, navigate to /data/demo
, and create a file:
cd demo
echo hello > testfile
List the file in the /data/demo
directory:
ls -l
The output shows that testfile
has group ID 2000, which is the value of fsGroup
.
-rw-r--r-- 1 1000 2000 6 Jun 6 20:08 testfile
Run the following command:
id
The output is similar to this:
uid=1000 gid=3000 groups=2000
From the output, you can see that gid
is 3000 which is same as the runAsGroup
field.
If the runAsGroup
was omitted, the gid
would remain as 0 (root) and the process will
be able to interact with files that are owned by the root(0) group and groups that have
the required group permissions for the root (0) group.
Exit your shell:
exit
Configure volume permission and ownership change policy for Pods
Kubernetes v1.23 [stable]
By default, Kubernetes recursively changes ownership and permissions for the contents of each
volume to match the fsGroup
specified in a Pod's securityContext
when that volume is
mounted.
For large volumes, checking and changing ownership and permissions can take a lot of time,
slowing Pod startup. You can use the fsGroupChangePolicy
field inside a securityContext
to control the way that Kubernetes checks and manages ownership and permissions
for a volume.
fsGroupChangePolicy - fsGroupChangePolicy
defines behavior for changing ownership
and permission of the volume before being exposed inside a Pod.
This field only applies to volume types that support fsGroup
controlled ownership and permissions.
This field has two possible values:
- OnRootMismatch: Only change permissions and ownership if permission and ownership of root directory does not match with expected permissions of the volume. This could help shorten the time it takes to change ownership and permission of a volume.
- Always: Always change permission and ownership of the volume when volume is mounted.
For example:
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
fsGroupChangePolicy: "OnRootMismatch"
Delegating volume permission and ownership change to CSI driver
Kubernetes v1.23 [beta]
If you deploy a Container Storage Interface (CSI)
driver which supports the VOLUME_MOUNT_GROUP
NodeServiceCapability
, the
process of setting file ownership and permissions based on the
fsGroup
specified in the securityContext
will be performed by the CSI driver
instead of Kubernetes, provided that the DelegateFSGroupToCSIDriver
Kubernetes
feature gate is enabled. In this case, since Kubernetes doesn't perform any
ownership and permission change, fsGroupChangePolicy
does not take effect, and
as specified by CSI, the driver is expected to mount the volume with the
provided fsGroup
, resulting in a volume that is readable/writable by the
fsGroup
.
Please refer to the KEP
and the description of the VolumeCapability.MountVolume.volume_mount_group
field in the CSI spec
for more information.
Set the security context for a Container
To specify security settings for a Container, include the securityContext
field
in the Container manifest. The securityContext
field is a
SecurityContext object.
Security settings that you specify for a Container apply only to
the individual Container, and they override settings made at the Pod level when
there is overlap. Container settings do not affect the Pod's Volumes.
Here is the configuration file for a Pod that has one Container. Both the Pod
and the Container have a securityContext
field:
apiVersion: v1
kind: Pod
metadata:
name: security-context-demo-2
spec:
securityContext:
runAsUser: 1000
containers:
- name: sec-ctx-demo-2
image: gcr.io/google-samples/node-hello:1.0
securityContext:
runAsUser: 2000
allowPrivilegeEscalation: false
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/security/security-context-2.yaml
Verify that the Pod's Container is running:
kubectl get pod security-context-demo-2
Get a shell into the running Container:
kubectl exec -it security-context-demo-2 -- sh
In your shell, list the running processes:
ps aux
The output shows that the processes are running as user 2000. This is the value
of runAsUser
specified for the Container. It overrides the value 1000 that is
specified for the Pod.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
2000 1 0.0 0.0 4336 764 ? Ss 20:36 0:00 /bin/sh -c node server.js
2000 8 0.1 0.5 772124 22604 ? Sl 20:36 0:00 node server.js
...
Exit your shell:
exit
Set capabilities for a Container
With Linux capabilities,
you can grant certain privileges to a process without granting all the privileges
of the root user. To add or remove Linux capabilities for a Container, include the
capabilities
field in the securityContext
section of the Container manifest.
First, see what happens when you don't include a capabilities
field.
Here is configuration file that does not add or remove any Container capabilities:
apiVersion: v1
kind: Pod
metadata:
name: security-context-demo-3
spec:
containers:
- name: sec-ctx-3
image: gcr.io/google-samples/node-hello:1.0
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/security/security-context-3.yaml
Verify that the Pod's Container is running:
kubectl get pod security-context-demo-3
Get a shell into the running Container:
kubectl exec -it security-context-demo-3 -- sh
In your shell, list the running processes:
ps aux
The output shows the process IDs (PIDs) for the Container:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 4336 796 ? Ss 18:17 0:00 /bin/sh -c node server.js
root 5 0.1 0.5 772124 22700 ? Sl 18:17 0:00 node server.js
In your shell, view the status for process 1:
cd /proc/1
cat status
The output shows the capabilities bitmap for the process:
...
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
...
Make a note of the capabilities bitmap, and then exit your shell:
exit
Next, run a Container that is the same as the preceding container, except that it has additional capabilities set.
Here is the configuration file for a Pod that runs one Container. The configuration
adds the CAP_NET_ADMIN
and CAP_SYS_TIME
capabilities:
apiVersion: v1
kind: Pod
metadata:
name: security-context-demo-4
spec:
containers:
- name: sec-ctx-4
image: gcr.io/google-samples/node-hello:1.0
securityContext:
capabilities:
add: ["NET_ADMIN", "SYS_TIME"]
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/security/security-context-4.yaml
Get a shell into the running Container:
kubectl exec -it security-context-demo-4 -- sh
In your shell, view the capabilities for process 1:
cd /proc/1
cat status
The output shows capabilities bitmap for the process:
...
CapPrm: 00000000aa0435fb
CapEff: 00000000aa0435fb
...
Compare the capabilities of the two Containers:
00000000a80425fb
00000000aa0435fb
In the capability bitmap of the first container, bits 12 and 25 are clear. In the second container,
bits 12 and 25 are set. Bit 12 is CAP_NET_ADMIN
, and bit 25 is CAP_SYS_TIME
.
See capability.h
for definitions of the capability constants.
CAP_XXX
.
But when you list capabilities in your container manifest, you must
omit the CAP_
portion of the constant.
For example, to add CAP_SYS_TIME
, include SYS_TIME
in your list of capabilities.
Set the Seccomp Profile for a Container
To set the Seccomp profile for a Container, include the seccompProfile
field
in the securityContext
section of your Pod or Container manifest. The
seccompProfile
field is a
SeccompProfile object consisting of type
and localhostProfile
.
Valid options for type
include RuntimeDefault
, Unconfined
, and
Localhost
. localhostProfile
must only be set set if type: Localhost
. It
indicates the path of the pre-configured profile on the node, relative to the
kubelet's configured Seccomp profile location (configured with the --root-dir
flag).
Here is an example that sets the Seccomp profile to the node's container runtime default profile:
...
securityContext:
seccompProfile:
type: RuntimeDefault
Here is an example that sets the Seccomp profile to a pre-configured file at
<kubelet-root-dir>/seccomp/my-profiles/profile-allow.json
:
...
securityContext:
seccompProfile:
type: Localhost
localhostProfile: my-profiles/profile-allow.json
Assign SELinux labels to a Container
To assign SELinux labels to a Container, include the seLinuxOptions
field in
the securityContext
section of your Pod or Container manifest. The
seLinuxOptions
field is an
SELinuxOptions
object. Here's an example that applies an SELinux level:
...
securityContext:
seLinuxOptions:
level: "s0:c123,c456"
Discussion
The security context for a Pod applies to the Pod's Containers and also to
the Pod's Volumes when applicable. Specifically fsGroup
and seLinuxOptions
are
applied to Volumes as follows:
-
fsGroup
: Volumes that support ownership management are modified to be owned and writable by the GID specified infsGroup
. See the Ownership Management design document for more details. -
seLinuxOptions
: Volumes that support SELinux labeling are relabeled to be accessible by the label specified underseLinuxOptions
. Usually you only need to set thelevel
section. This sets the Multi-Category Security (MCS) label given to all Containers in the Pod as well as the Volumes.
Clean up
Delete the Pod:
kubectl delete pod security-context-demo
kubectl delete pod security-context-demo-2
kubectl delete pod security-context-demo-3
kubectl delete pod security-context-demo-4
What's next
4.3.12 - Configure Service Accounts for Pods
A service account provides an identity for processes that run in a Pod.
When you (a human) access the cluster (for example, using kubectl
), you are
authenticated by the apiserver as a particular User Account (currently this is
usually admin
, unless your cluster administrator has customized your cluster). Processes in containers inside pods can also contact the apiserver.
When they do, they are authenticated as a particular Service Account (for example, default
).
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Use the Default Service Account to access the API server.
When you create a pod, if you do not specify a service account, it is
automatically assigned the default
service account in the same namespace.
If you get the raw json or yaml for a pod you have created (for example, kubectl get pods/<podname> -o yaml
),
you can see the spec.serviceAccountName
field has been
automatically set.
You can access the API from inside a pod using automatically mounted service account credentials, as described in Accessing the Cluster. The API permissions of the service account depend on the authorization plugin and policy in use.
In version 1.6+, you can opt out of automounting API credentials for a service account by setting automountServiceAccountToken: false
on the service account:
apiVersion: v1
kind: ServiceAccount
metadata:
name: build-robot
automountServiceAccountToken: false
...
In version 1.6+, you can also opt out of automounting API credentials for a particular pod:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
serviceAccountName: build-robot
automountServiceAccountToken: false
...
The pod spec takes precedence over the service account if both specify a automountServiceAccountToken
value.
Use Multiple Service Accounts.
Every namespace has a default service account resource called default
.
You can list this and any other serviceAccount resources in the namespace with this command:
kubectl get serviceaccounts
The output is similar to this:
NAME SECRETS AGE
default 1 1d
You can create additional ServiceAccount objects like this:
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
name: build-robot
EOF
The name of a ServiceAccount object must be a valid DNS subdomain name.
If you get a complete dump of the service account object, like this:
kubectl get serviceaccounts/build-robot -o yaml
The output is similar to this:
apiVersion: v1
kind: ServiceAccount
metadata:
creationTimestamp: 2015-06-16T00:12:59Z
name: build-robot
namespace: default
resourceVersion: "272500"
uid: 721ab723-13bc-11e5-aec2-42010af0021e
secrets:
- name: build-robot-token-bvbk5
then you will see that a token has automatically been created and is referenced by the service account.
You may use authorization plugins to set permissions on service accounts.
To use a non-default service account, set the spec.serviceAccountName
field of a pod to the name of the service account you wish to use.
The service account has to exist at the time the pod is created, or it will be rejected.
You cannot update the service account of an already created pod.
You can clean up the service account from this example like this:
kubectl delete serviceaccount/build-robot
Manually create a service account API token.
Suppose we have an existing service account named "build-robot" as mentioned above, and we create a new secret manually.
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: build-robot-secret
annotations:
kubernetes.io/service-account.name: build-robot
type: kubernetes.io/service-account-token
EOF
Now you can confirm that the newly built secret is populated with an API token for the "build-robot" service account.
Any tokens for non-existent service accounts will be cleaned up by the token controller.
kubectl describe secrets/build-robot-secret
The output is similar to this:
Name: build-robot-secret
Namespace: default
Labels: <none>
Annotations: kubernetes.io/service-account.name: build-robot
kubernetes.io/service-account.uid: da68f9c6-9d26-11e7-b84e-002dc52800da
Type: kubernetes.io/service-account-token
Data
====
ca.crt: 1338 bytes
namespace: 7 bytes
token: ...
token
is elided here.
Add ImagePullSecrets to a service account
Create an imagePullSecret
-
Create an imagePullSecret, as described in Specifying ImagePullSecrets on a Pod.
kubectl create secret docker-registry myregistrykey --docker-server=DUMMY_SERVER \ --docker-username=DUMMY_USERNAME --docker-password=DUMMY_DOCKER_PASSWORD \ --docker-email=DUMMY_DOCKER_EMAIL
-
Verify it has been created.
kubectl get secrets myregistrykey
The output is similar to this:
NAME TYPE DATA AGE myregistrykey kubernetes.io/.dockerconfigjson 1 1d
Add image pull secret to service account
Next, modify the default service account for the namespace to use this secret as an imagePullSecret.
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "myregistrykey"}]}'
You can instead use kubectl edit
, or manually edit the YAML manifests as shown below:
kubectl get serviceaccounts default -o yaml > ./sa.yaml
The output of the sa.yaml
file is similar to this:
apiVersion: v1
kind: ServiceAccount
metadata:
creationTimestamp: 2015-08-07T22:02:39Z
name: default
namespace: default
resourceVersion: "243024"
uid: 052fb0f4-3d50-11e5-b066-42010af0d7b6
secrets:
- name: default-token-uudge
Using your editor of choice (for example vi
), open the sa.yaml
file, delete line with key resourceVersion
, add lines with imagePullSecrets:
and save.
The output of the sa.yaml
file is similar to this:
apiVersion: v1
kind: ServiceAccount
metadata:
creationTimestamp: 2015-08-07T22:02:39Z
name: default
namespace: default
uid: 052fb0f4-3d50-11e5-b066-42010af0d7b6
secrets:
- name: default-token-uudge
imagePullSecrets:
- name: myregistrykey
Finally replace the serviceaccount with the new updated sa.yaml
file
kubectl replace serviceaccount default -f ./sa.yaml
Verify imagePullSecrets was added to pod spec
Now, when a new Pod is created in the current namespace and using the default ServiceAccount, the new Pod has its spec.imagePullSecrets
field set automatically:
kubectl run nginx --image=nginx --restart=Never
kubectl get pod nginx -o=jsonpath='{.spec.imagePullSecrets[0].name}{"\n"}'
The output is:
myregistrykey
Service Account Token Volume Projection
Kubernetes v1.20 [stable]
To enable and use token request projection, you must specify each of the following
command line arguments to kube-apiserver
:
-
--service-account-issuer
It can be used as the Identifier of the service account token issuer. You can specify the
--service-account-issuer
argument multiple times, this can be useful to enable a non-disruptive change of the issuer. When this flag is specified multiple times, the first is used to generate tokens and all are used to determine which issuers are accepted. You must be running Kubernetes v1.22 or later to be able to specify--service-account-issuer
multiple times. -
--service-account-key-file
File containing PEM-encoded x509 RSA or ECDSA private or public keys, used to verify ServiceAccount tokens. The specified file can contain multiple keys, and the flag can be specified multiple times with different files. If specified multiple times, tokens signed by any of the specified keys are considered valid by the Kubernetes API server.
-
--service-account-signing-key-file
Path to the file that contains the current private key of the service account token issuer. The issuer signs issued ID tokens with this private key.
-
--api-audiences
(can be omitted)The service account token authenticator validates that tokens used against the API are bound to at least one of these audiences. If
api-audiences
is specified multiple times, tokens for any of the specified audiences are considered valid by the Kubernetes API server. If the--service-account-issuer
flag is configured and this flag is not, this field defaults to a single element list containing the issuer URL.
The kubelet can also project a service account token into a Pod. You can specify desired properties of the token, such as the audience and the validity duration. These properties are not configurable on the default service account token. The service account token will also become invalid against the API when the Pod or the ServiceAccount is deleted.
This behavior is configured on a PodSpec using a ProjectedVolume type called ServiceAccountToken. To provide a pod with a token with an audience of "vault" and a validity duration of two hours, you would configure the following in your PodSpec:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- image: nginx
name: nginx
volumeMounts:
- mountPath: /var/run/secrets/tokens
name: vault-token
serviceAccountName: build-robot
volumes:
- name: vault-token
projected:
sources:
- serviceAccountToken:
path: vault-token
expirationSeconds: 7200
audience: vault
Create the Pod:
kubectl create -f https://k8s.io/examples/pods/pod-projected-svc-token.yaml
The kubelet will request and store the token on behalf of the pod, make the token available to the pod at a configurable file path, and refresh the token as it approaches expiration. The kubelet proactively rotates the token if it is older than 80% of its total TTL, or if the token is older than 24 hours.
The application is responsible for reloading the token when it rotates. Periodic reloading (e.g. once every 5 minutes) is sufficient for most use cases.
Service Account Issuer Discovery
Kubernetes v1.21 [stable]
The Service Account Issuer Discovery feature is enabled when the Service Account Token Projection feature is enabled, as described above.
The issuer URL must comply with the
OIDC Discovery Spec. In
practice, this means it must use the https
scheme, and should serve an OpenID
provider configuration at {service-account-issuer}/.well-known/openid-configuration
.
If the URL does not comply, the ServiceAccountIssuerDiscovery
endpoints will
not be registered, even if the feature is enabled.
The Service Account Issuer Discovery feature enables federation of Kubernetes service account tokens issued by a cluster (the identity provider) with external systems (relying parties).
When enabled, the Kubernetes API server provides an OpenID Provider
Configuration document at /.well-known/openid-configuration
and the associated
JSON Web Key Set (JWKS) at /openid/v1/jwks
. The OpenID Provider Configuration
is sometimes referred to as the discovery document.
Clusters include a default RBAC ClusterRole called
system:service-account-issuer-discovery
. A default RBAC ClusterRoleBinding
assigns this role to the system:serviceaccounts
group, which all service
accounts implicitly belong to. This allows pods running on the cluster to access
the service account discovery document via their mounted service account token.
Administrators may, additionally, choose to bind the role to
system:authenticated
or system:unauthenticated
depending on their security
requirements and which external systems they intend to federate with.
/.well-known/openid-configuration
and
/openid/v1/jwks
are designed to be OIDC compatible, but not strictly OIDC
compliant. Those documents contain only the parameters necessary to perform
validation of Kubernetes service account tokens.
The JWKS response contains public keys that a relying party can use to validate
the Kubernetes service account tokens. Relying parties first query for the
OpenID Provider Configuration, and use the jwks_uri
field in the response to
find the JWKS.
In many cases, Kubernetes API servers are not available on the public internet,
but public endpoints that serve cached responses from the API server can be made
available by users or service providers. In these cases, it is possible to
override the jwks_uri
in the OpenID Provider Configuration so that it points
to the public endpoint, rather than the API server's address, by passing the
--service-account-jwks-uri
flag to the API server. Like the issuer URL, the
JWKS URI is required to use the https
scheme.
What's next
See also:
4.3.13 - Pull an Image from a Private Registry
This page shows how to create a Pod that uses a Secret to pull an image from a private container image registry or repository. There are many private registries in use. This task uses Docker Hub as an example registry.
Before you begin
-
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
-
To do this exercise, you need the
docker
command line tool, and a Docker ID for which you know the password. -
If you are using a different private container registry, you need the command line tool for that registry and any login information for the registry.
Log in to Docker Hub
On your laptop, you must authenticate with a registry in order to pull a private image.
Use the docker
tool to log in to Docker Hub. See the log in section of
Docker ID accounts for more information.
docker login
When prompted, enter your Docker ID, and then the credential you want to use (access token, or the password for your Docker ID).
The login process creates or updates a config.json
file that holds an authorization token. Review how Kubernetes interprets this file.
View the config.json
file:
cat ~/.docker/config.json
The output contains a section similar to this:
{
"auths": {
"https://index.docker.io/v1/": {
"auth": "c3R...zE2"
}
}
}
auth
entry but a credsStore
entry with the name of the store as value.
Create a Secret based on existing credentials
A Kubernetes cluster uses the Secret of kubernetes.io/dockerconfigjson
type to authenticate with
a container registry to pull a private image.
If you already ran docker login
, you can copy
that credential into Kubernetes:
kubectl create secret generic regcred \
--from-file=.dockerconfigjson=<path/to/.docker/config.json> \
--type=kubernetes.io/dockerconfigjson
If you need more control (for example, to set a namespace or a label on the new secret) then you can customise the Secret before storing it. Be sure to:
- set the name of the data item to
.dockerconfigjson
- base64 encode the Docker configuration file and then paste that string, unbroken
as the value for field
data[".dockerconfigjson"]
- set
type
tokubernetes.io/dockerconfigjson
Example:
apiVersion: v1
kind: Secret
metadata:
name: myregistrykey
namespace: awesomeapps
data:
.dockerconfigjson: UmVhbGx5IHJlYWxseSByZWVlZWVlZWVlZWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWxsbGxsbGxsbGxsbGxsbGxsbGxsbGxsbGxsbGxsbGx5eXl5eXl5eXl5eXl5eXl5eXl5eSBsbGxsbGxsbGxsbGxsbG9vb29vb29vb29vb29vb29vb29vb29vb29vb25ubm5ubm5ubm5ubm5ubm5ubm5ubm5ubmdnZ2dnZ2dnZ2dnZ2dnZ2dnZ2cgYXV0aCBrZXlzCg==
type: kubernetes.io/dockerconfigjson
If you get the error message error: no objects passed to create
, it may mean the base64 encoded string is invalid.
If you get an error message like Secret "myregistrykey" is invalid: data[.dockerconfigjson]: invalid value ...
, it means
the base64 encoded string in the data was successfully decoded, but could not be parsed as a .docker/config.json
file.
Create a Secret by providing credentials on the command line
Create this Secret, naming it regcred
:
kubectl create secret docker-registry regcred --docker-server=<your-registry-server> --docker-username=<your-name> --docker-password=<your-pword> --docker-email=<your-email>
where:
<your-registry-server>
is your Private Docker Registry FQDN. Usehttps://index.docker.io/v1/
for DockerHub.<your-name>
is your Docker username.<your-pword>
is your Docker password.<your-email>
is your Docker email.
You have successfully set your Docker credentials in the cluster as a Secret called regcred
.
kubectl
is running.
Inspecting the Secret regcred
To understand the contents of the regcred
Secret you created, start by viewing the Secret in YAML format:
kubectl get secret regcred --output=yaml
The output is similar to this:
apiVersion: v1
kind: Secret
metadata:
...
name: regcred
...
data:
.dockerconfigjson: eyJodHRwczovL2luZGV4L ... J0QUl6RTIifX0=
type: kubernetes.io/dockerconfigjson
The value of the .dockerconfigjson
field is a base64 representation of your Docker credentials.
To understand what is in the .dockerconfigjson
field, convert the secret data to a
readable format:
kubectl get secret regcred --output="jsonpath={.data.\.dockerconfigjson}" | base64 --decode
The output is similar to this:
{"auths":{"your.private.registry.example.com":{"username":"janedoe","password":"xxxxxxxxxxx","email":"jdoe@example.com","auth":"c3R...zE2"}}}
To understand what is in the auth
field, convert the base64-encoded data to a readable format:
echo "c3R...zE2" | base64 --decode
The output, username and password concatenated with a :
, is similar to this:
janedoe:xxxxxxxxxxx
Notice that the Secret data contains the authorization token similar to your local ~/.docker/config.json
file.
You have successfully set your Docker credentials as a Secret called regcred
in the cluster.
Create a Pod that uses your Secret
Here is a manifest for an example Pod that needs access to your Docker credentials in regcred
:
apiVersion: v1
kind: Pod
metadata:
name: private-reg
spec:
containers:
- name: private-reg-container
image: <your-private-image>
imagePullSecrets:
- name: regcred
Download the above file onto your computer:
curl -L -O my-private-reg-pod.yaml https://k8s.io/examples/pods/private-reg-pod.yaml
In file my-private-reg-pod.yaml
, replace <your-private-image>
with the path to an image in a private registry such as:
your.private.registry.example.com/janedoe/jdoe-private:v1
To pull the image from the private registry, Kubernetes needs credentials.
The imagePullSecrets
field in the configuration file specifies that
Kubernetes should get the credentials from a Secret named regcred
.
Create a Pod that uses your Secret, and verify that the Pod is running:
kubectl apply -f my-private-reg-pod.yaml
kubectl get pod private-reg
What's next
- Learn more about Secrets
- or read the API reference for Secret
- Learn more about using a private registry.
- Learn more about adding image pull secrets to a service account.
- See kubectl create secret docker-registry.
- See the
imagePullSecrets
field within the container definitions of a Pod
4.3.14 - Configure Liveness, Readiness and Startup Probes
This page shows how to configure liveness, readiness and startup probes for containers.
The kubelet uses liveness probes to know when to restart a container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a container in such a state can help to make the application more available despite bugs.
The kubelet uses readiness probes to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.
The kubelet uses startup probes to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don't interfere with the application startup. This can be used to adopt liveness checks on slow starting containers, avoiding them getting killed by the kubelet before they are up and running.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Define a liveness command
Many applications running for long periods of time eventually transition to broken states, and cannot recover except by being restarted. Kubernetes provides liveness probes to detect and remedy such situations.
In this exercise, you create a Pod that runs a container based on the
k8s.gcr.io/busybox
image. Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
In the configuration file, you can see that the Pod has a single Container
.
The periodSeconds
field specifies that the kubelet should perform a liveness
probe every 5 seconds. The initialDelaySeconds
field tells the kubelet that it
should wait 5 seconds before performing the first probe. To perform a probe, the
kubelet executes the command cat /tmp/healthy
in the target container. If the
command succeeds, it returns 0, and the kubelet considers the container to be alive and
healthy. If the command returns a non-zero value, the kubelet kills the container
and restarts it.
When the container starts, it executes this command:
/bin/sh -c "touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600"
For the first 30 seconds of the container's life, there is a /tmp/healthy
file.
So during the first 30 seconds, the command cat /tmp/healthy
returns a success
code. After 30 seconds, cat /tmp/healthy
returns a failure code.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/probe/exec-liveness.yaml
Within 30 seconds, view the Pod events:
kubectl describe pod liveness-exec
The output indicates that no liveness probes have failed yet:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11s default-scheduler Successfully assigned default/liveness-exec to node01
Normal Pulling 9s kubelet, node01 Pulling image "k8s.gcr.io/busybox"
Normal Pulled 7s kubelet, node01 Successfully pulled image "k8s.gcr.io/busybox"
Normal Created 7s kubelet, node01 Created container liveness
Normal Started 7s kubelet, node01 Started container liveness
After 35 seconds, view the Pod events again:
kubectl describe pod liveness-exec
At the bottom of the output, there are messages indicating that the liveness probes have failed, and the containers have been killed and recreated.
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 57s default-scheduler Successfully assigned default/liveness-exec to node01
Normal Pulling 55s kubelet, node01 Pulling image "k8s.gcr.io/busybox"
Normal Pulled 53s kubelet, node01 Successfully pulled image "k8s.gcr.io/busybox"
Normal Created 53s kubelet, node01 Created container liveness
Normal Started 53s kubelet, node01 Started container liveness
Warning Unhealthy 10s (x3 over 20s) kubelet, node01 Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
Normal Killing 10s kubelet, node01 Container liveness failed liveness probe, will be restarted
Wait another 30 seconds, and verify that the container has been restarted:
kubectl get pod liveness-exec
The output shows that RESTARTS
has been incremented:
NAME READY STATUS RESTARTS AGE
liveness-exec 1/1 Running 1 1m
Define a liveness HTTP request
Another kind of liveness probe uses an HTTP GET request. Here is the configuration
file for a Pod that runs a container based on the k8s.gcr.io/liveness
image.
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-http
spec:
containers:
- name: liveness
image: k8s.gcr.io/liveness
args:
- /server
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: Awesome
initialDelaySeconds: 3
periodSeconds: 3
In the configuration file, you can see that the Pod has a single container.
The periodSeconds
field specifies that the kubelet should perform a liveness
probe every 3 seconds. The initialDelaySeconds
field tells the kubelet that it
should wait 3 seconds before performing the first probe. To perform a probe, the
kubelet sends an HTTP GET request to the server that is running in the container
and listening on port 8080. If the handler for the server's /healthz
path
returns a success code, the kubelet considers the container to be alive and
healthy. If the handler returns a failure code, the kubelet kills the container
and restarts it.
Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.
You can see the source code for the server in server.go.
For the first 10 seconds that the container is alive, the /healthz
handler
returns a status of 200. After that, the handler returns a status of 500.
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
duration := time.Now().Sub(started)
if duration.Seconds() > 10 {
w.WriteHeader(500)
w.Write([]byte(fmt.Sprintf("error: %v", duration.Seconds())))
} else {
w.WriteHeader(200)
w.Write([]byte("ok"))
}
})
The kubelet starts performing health checks 3 seconds after the container starts. So the first couple of health checks will succeed. But after 10 seconds, the health checks will fail, and the kubelet will kill and restart the container.
To try the HTTP liveness check, create a Pod:
kubectl apply -f https://k8s.io/examples/pods/probe/http-liveness.yaml
After 10 seconds, view Pod events to verify that liveness probes have failed and the container has been restarted:
kubectl describe pod liveness-http
In releases prior to v1.13 (including v1.13), if the environment variable
http_proxy
(or HTTP_PROXY
) is set on the node where a Pod is running,
the HTTP liveness probe uses that proxy.
In releases after v1.13, local HTTP proxy environment variable settings do not
affect the HTTP liveness probe.
Define a TCP liveness probe
A third type of liveness probe uses a TCP socket. With this configuration, the kubelet will attempt to open a socket to your container on the specified port. If it can establish a connection, the container is considered healthy, if it can't it is considered a failure.
apiVersion: v1
kind: Pod
metadata:
name: goproxy
labels:
app: goproxy
spec:
containers:
- name: goproxy
image: k8s.gcr.io/goproxy:0.1
ports:
- containerPort: 8080
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
As you can see, configuration for a TCP check is quite similar to an HTTP check.
This example uses both readiness and liveness probes. The kubelet will send the
first readiness probe 5 seconds after the container starts. This will attempt to
connect to the goproxy
container on port 8080. If the probe succeeds, the Pod
will be marked as ready. The kubelet will continue to run this check every 10
seconds.
In addition to the readiness probe, this configuration includes a liveness probe.
The kubelet will run the first liveness probe 15 seconds after the container
starts. Similar to the readiness probe, this will attempt to connect to the
goproxy
container on port 8080. If the liveness probe fails, the container
will be restarted.
To try the TCP liveness check, create a Pod:
kubectl apply -f https://k8s.io/examples/pods/probe/tcp-liveness-readiness.yaml
After 15 seconds, view Pod events to verify that liveness probes:
kubectl describe pod goproxy
Define a gRPC liveness probe
Kubernetes v1.23 [alpha]
If your application implements gRPC Health Checking Protocol,
kubelet can be configured to use it for application liveness checks.
You must enable the GRPCContainerProbe
feature gate
in order to configure checks that rely on gRPC.
Here is an example manifest:
apiVersion: v1
kind: Pod
metadata:
name: etcd-with-grpc
spec:
containers:
- name: etcd
image: k8s.gcr.io/etcd:3.5.1-0
command: [ "/usr/local/bin/etcd", "--data-dir", "/var/lib/etcd", "--listen-client-urls", "http://0.0.0.0:2379", "--advertise-client-urls", "http://127.0.0.1:2379", "--log-level", "debug"]
ports:
- containerPort: 2379
livenessProbe:
grpc:
port: 2379
initialDelaySeconds: 10
To use a gRPC probe, port
must be configured. If the health endpoint is configured
on a non-default service, you must also specify the service
.
Configuration problems (for example: incorrect port and service, unimplemented health checking protocol) are considered a probe failure, similar to HTTP and TCP probes.
To try the gRPC liveness check, create a Pod using the command below. In the example below, the etcd pod is configured to use gRPC liveness probe.
kubectl apply -f https://k8s.io/examples/pods/probe/grpc-liveness.yaml
After 15 seconds, view Pod events to verify that the liveness check has not failed:
kubectl describe pod etcd-with-grpc
Before Kubernetes 1.23, gRPC health probes were often implemented using grpc-health-probe, as described in the blog post Health checking gRPC servers on Kubernetes. The built-in gRPC probes behavior is similar to one implemented by grpc-health-probe. When migrating from grpc-health-probe to built-in probes, remember the following differences:
- Built-in probes run against the pod IP address, unlike grpc-health-probe that often runs against
127.0.0.1
. Be sure to configure your gRPC endpoint to listen on the Pod's IP address. - Built-in probes do not support any authentication parameters (like
-tls
). - There are no error codes for built-in probes. All errors are considered as probe failures.
- If
ExecProbeTimeout
feature gate is set tofalse
, grpc-health-probe does not respect thetimeoutSeconds
setting (which defaults to 1s), while built-in probe would fail on timeout.
Use a named port
You can use a named
port
for HTTP and TCP probes. (gRPC probes do not support named ports).
For example:
ports:
- name: liveness-port
containerPort: 8080
hostPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: liveness-port
Protect slow starting containers with startup probes
Sometimes, you have to deal with legacy applications that might require
an additional startup time on their first initialization.
In such cases, it can be tricky to set up liveness probe parameters without
compromising the fast response to deadlocks that motivated such a probe.
The trick is to set up a startup probe with the same command, HTTP or TCP
check, with a failureThreshold * periodSeconds
long enough to cover the
worse case startup time.
So, the previous example would become:
ports:
- name: liveness-port
containerPort: 8080
hostPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 1
periodSeconds: 10
startupProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 30
periodSeconds: 10
Thanks to the startup probe, the application will have a maximum of 5 minutes
(30 * 10 = 300s) to finish its startup.
Once the startup probe has succeeded once, the liveness probe takes over to
provide a fast response to container deadlocks.
If the startup probe never succeeds, the container is killed after 300s and
subject to the pod's restartPolicy
.
Define readiness probes
Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.
Readiness probes are configured similarly to liveness probes. The only difference
is that you use the readinessProbe
field instead of the livenessProbe
field.
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
Configuration for HTTP and TCP readiness probes also remains identical to liveness probes.
Readiness and liveness probes can be used in parallel for the same container. Using both can ensure that traffic does not reach a container that is not ready for it, and that containers are restarted when they fail.
Configure Probes
Probes have a number of fields that you can use to more precisely control the behavior of liveness and readiness checks:
initialDelaySeconds
: Number of seconds after the container has started before liveness or readiness probes are initiated. Defaults to 0 seconds. Minimum value is 0.periodSeconds
: How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.timeoutSeconds
: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.successThreshold
: Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness and startup Probes. Minimum value is 1.failureThreshold
: When a probe fails, Kubernetes will tryfailureThreshold
times before giving up. Giving up in case of liveness probe means restarting the container. In case of readiness probe the Pod will be marked Unready. Defaults to 3. Minimum value is 1.
Before Kubernetes 1.20, the field timeoutSeconds
was not respected for exec probes:
probes continued running indefinitely, even past their configured deadline,
until a result was returned.
This defect was corrected in Kubernetes v1.20. You may have been relying on the previous behavior,
even without realizing it, as the default timeout is 1 second.
As a cluster administrator, you can disable the feature gate ExecProbeTimeout
(set it to false
)
on each kubelet to restore the behavior from older versions, then remove that override
once all the exec probes in the cluster have a timeoutSeconds
value set.
If you have pods that are impacted from the default 1 second timeout,
you should update their probe timeout so that you're ready for the
eventual removal of that feature gate.
With the fix of the defect, for exec probes, on Kubernetes 1.20+
with the dockershim
container runtime,
the process inside the container may keep running even after probe returned failure because of the timeout.
HTTP probes
HTTP probes
have additional fields that can be set on httpGet
:
host
: Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.scheme
: Scheme to use for connecting to the host (HTTP or HTTPS). Defaults to HTTP.path
: Path to access on the HTTP server. Defaults to /.httpHeaders
: Custom headers to set in the request. HTTP allows repeated headers.port
: Name or number of the port to access on the container. Number must be in the range 1 to 65535.
For an HTTP probe, the kubelet sends an HTTP request to the specified path and
port to perform the check. The kubelet sends the probe to the pod's IP address,
unless the address is overridden by the optional host
field in httpGet
. If
scheme
field is set to HTTPS
, the kubelet sends an HTTPS request skipping the
certificate verification. In most scenarios, you do not want to set the host
field.
Here's one scenario where you would set it. Suppose the container listens on 127.0.0.1
and the Pod's hostNetwork
field is true. Then host
, under httpGet
, should be set
to 127.0.0.1. If your pod relies on virtual hosts, which is probably the more common
case, you should not use host
, but rather set the Host
header in httpHeaders
.
For an HTTP probe, the kubelet sends two request headers in addition to the mandatory Host
header:
User-Agent
, and Accept
. The default values for these headers are kube-probe/1.23
(where 1.23
is the version of the kubelet ), and */*
respectively.
You can override the default headers by defining .httpHeaders
for the probe; for example
livenessProbe:
httpGet:
httpHeaders:
- name: Accept
value: application/json
startupProbe:
httpGet:
httpHeaders:
- name: User-Agent
value: MyUserAgent
You can also remove these two headers by defining them with an empty value.
livenessProbe:
httpGet:
httpHeaders:
- name: Accept
value: ""
startupProbe:
httpGet:
httpHeaders:
- name: User-Agent
value: ""
TCP probes
For a TCP probe, the kubelet makes the probe connection at the node, not in the pod, which
means that you can not use a service name in the host
parameter since the kubelet is unable
to resolve it.
Probe-level terminationGracePeriodSeconds
Kubernetes v1.22 [beta]
Prior to release 1.21, the pod-level terminationGracePeriodSeconds
was used
for terminating a container that failed its liveness or startup probe. This
coupling was unintended and may have resulted in failed containers taking an
unusually long time to restart when a pod-level terminationGracePeriodSeconds
was set.
In 1.21 and beyond, when the feature gate ProbeTerminationGracePeriod
is
enabled, users can specify a probe-level terminationGracePeriodSeconds
as
part of the probe specification. When the feature gate is enabled, and both a
pod- and probe-level terminationGracePeriodSeconds
are set, the kubelet will
use the probe-level value.
As of Kubernetes 1.22, the ProbeTerminationGracePeriod
feature gate is only
available on the API Server. The kubelet always honors the probe-level
terminationGracePeriodSeconds
field if it is present on a Pod.
If you have existing Pods where the terminationGracePeriodSeconds
field is set and
you no longer wish to use per-probe termination grace periods, you must delete
those existing Pods.
When you (or the control plane, or some other component) create replacement
Pods, and the feature gate ProbeTerminationGracePeriod
is disabled, then the
API server ignores the Pod-level terminationGracePeriodSeconds
field, even if
a Pod or pod template specifies it.
For example,
spec:
terminationGracePeriodSeconds: 3600 # pod-level
containers:
- name: test
image: ...
ports:
- name: liveness-port
containerPort: 8080
hostPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 1
periodSeconds: 60
# Override pod-level terminationGracePeriodSeconds #
terminationGracePeriodSeconds: 60
Probe-level terminationGracePeriodSeconds
cannot be set for readiness probes.
It will be rejected by the API server.
What's next
- Learn more about Container Probes.
You can also read the API references for:
- Pod, and specifically:
4.3.15 - Assign Pods to Nodes
This page shows how to assign a Kubernetes Pod to a particular node in a Kubernetes cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Add a label to a node
-
List the nodes in your cluster, along with their labels:
kubectl get nodes --show-labels
The output is similar to this:
NAME STATUS ROLES AGE VERSION LABELS worker0 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker0 worker1 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker1 worker2 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker2
-
Chose one of your nodes, and add a label to it:
kubectl label nodes <your-node-name> disktype=ssd
where
<your-node-name>
is the name of your chosen node. -
Verify that your chosen node has a
disktype=ssd
label:kubectl get nodes --show-labels
The output is similar to this:
NAME STATUS ROLES AGE VERSION LABELS worker0 Ready <none> 1d v1.13.0 ...,disktype=ssd,kubernetes.io/hostname=worker0 worker1 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker1 worker2 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker2
In the preceding output, you can see that the
worker0
node has adisktype=ssd
label.
Create a pod that gets scheduled to your chosen node
This pod configuration file describes a pod that has a node selector,
disktype: ssd
. This means that the pod will get scheduled on a node that has
a disktype=ssd
label.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd
-
Use the configuration file to create a pod that will get scheduled on your chosen node:
kubectl apply -f https://k8s.io/examples/pods/pod-nginx.yaml
-
Verify that the pod is running on your chosen node:
kubectl get pods --output=wide
The output is similar to this:
NAME READY STATUS RESTARTS AGE IP NODE nginx 1/1 Running 0 13s 10.200.0.4 worker0
Create a pod that gets scheduled to specific node
You can also schedule a pod to one specific node via setting nodeName
.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
nodeName: foo-node # schedule pod to specific node
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
Use the configuration file to create a pod that will get scheduled on foo-node
only.
What's next
- Learn more about labels and selectors.
- Learn more about nodes.
4.3.16 - Assign Pods to Nodes using Node Affinity
This page shows how to assign a Kubernetes Pod to a particular node using Node Affinity in a Kubernetes cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.10. To check the version, enterkubectl version
.
Add a label to a node
-
List the nodes in your cluster, along with their labels:
kubectl get nodes --show-labels
The output is similar to this:
NAME STATUS ROLES AGE VERSION LABELS worker0 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker0 worker1 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker1 worker2 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker2
-
Chose one of your nodes, and add a label to it:
kubectl label nodes <your-node-name> disktype=ssd
where
<your-node-name>
is the name of your chosen node. -
Verify that your chosen node has a
disktype=ssd
label:kubectl get nodes --show-labels
The output is similar to this:
NAME STATUS ROLES AGE VERSION LABELS worker0 Ready <none> 1d v1.13.0 ...,disktype=ssd,kubernetes.io/hostname=worker0 worker1 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker1 worker2 Ready <none> 1d v1.13.0 ...,kubernetes.io/hostname=worker2
In the preceding output, you can see that the
worker0
node has adisktype=ssd
label.
Schedule a Pod using required node affinity
This manifest describes a Pod that has a requiredDuringSchedulingIgnoredDuringExecution
node affinity,disktype: ssd
.
This means that the pod will get scheduled only on a node that has a disktype=ssd
label.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
-
Apply the manifest to create a Pod that is scheduled onto your chosen node:
kubectl apply -f https://k8s.io/examples/pods/pod-nginx-required-affinity.yaml
-
Verify that the pod is running on your chosen node:
kubectl get pods --output=wide
The output is similar to this:
NAME READY STATUS RESTARTS AGE IP NODE nginx 1/1 Running 0 13s 10.200.0.4 worker0
Schedule a Pod using preferred node affinity
This manifest describes a Pod that has a preferredDuringSchedulingIgnoredDuringExecution
node affinity,disktype: ssd
.
This means that the pod will prefer a node that has a disktype=ssd
label.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
-
Apply the manifest to create a Pod that is scheduled onto your chosen node:
kubectl apply -f https://k8s.io/examples/pods/pod-nginx-preferred-affinity.yaml
-
Verify that the pod is running on your chosen node:
kubectl get pods --output=wide
The output is similar to this:
NAME READY STATUS RESTARTS AGE IP NODE nginx 1/1 Running 0 13s 10.200.0.4 worker0
What's next
Learn more about Node Affinity.
4.3.17 - Configure Pod Initialization
This page shows how to use an Init Container to initialize a Pod before an application Container runs.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Create a Pod that has an Init Container
In this exercise you create a Pod that has one application Container and one Init Container. The init container runs to completion before the application container starts.
Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: init-demo
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
volumeMounts:
- name: workdir
mountPath: /usr/share/nginx/html
# These containers are run during pod initialization
initContainers:
- name: install
image: busybox:1.28
command:
- wget
- "-O"
- "/work-dir/index.html"
- http://info.cern.ch
volumeMounts:
- name: workdir
mountPath: "/work-dir"
dnsPolicy: Default
volumes:
- name: workdir
emptyDir: {}
In the configuration file, you can see that the Pod has a Volume that the init container and the application container share.
The init container mounts the
shared Volume at /work-dir
, and the application container mounts the shared
Volume at /usr/share/nginx/html
. The init container runs the following command
and then terminates:
wget -O /work-dir/index.html http://info.cern.ch
Notice that the init container writes the index.html
file in the root directory
of the nginx server.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/init-containers.yaml
Verify that the nginx container is running:
kubectl get pod init-demo
The output shows that the nginx container is running:
NAME READY STATUS RESTARTS AGE
init-demo 1/1 Running 0 1m
Get a shell into the nginx container running in the init-demo Pod:
kubectl exec -it init-demo -- /bin/bash
In your shell, send a GET request to the nginx server:
root@nginx:~# apt-get update
root@nginx:~# apt-get install curl
root@nginx:~# curl localhost
The output shows that nginx is serving the web page that was written by the init container:
<html><head></head><body><header>
<title>http://info.cern.ch</title>
</header>
<h1>http://info.cern.ch - home of the first website</h1>
...
<li><a href="http://info.cern.ch/hypertext/WWW/TheProject.html">Browse the first website</a></li>
...
What's next
- Learn more about communicating between Containers running in the same Pod.
- Learn more about Init Containers.
- Learn more about Volumes.
- Learn more about Debugging Init Containers
4.3.18 - Attach Handlers to Container Lifecycle Events
This page shows how to attach handlers to Container lifecycle events. Kubernetes supports the postStart and preStop events. Kubernetes sends the postStart event immediately after a Container is started, and it sends the preStop event immediately before the Container is terminated. A Container may specify one handler per event.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Define postStart and preStop handlers
In this exercise, you create a Pod that has one Container. The Container has handlers for the postStart and preStop events.
Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: lifecycle-demo
spec:
containers:
- name: lifecycle-demo-container
image: nginx
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
preStop:
exec:
command: ["/bin/sh","-c","nginx -s quit; while killall -0 nginx; do sleep 1; done"]
In the configuration file, you can see that the postStart command writes a message
file to the Container's /usr/share
directory. The preStop command shuts down
nginx gracefully. This is helpful if the Container is being terminated because of a failure.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/lifecycle-events.yaml
Verify that the Container in the Pod is running:
kubectl get pod lifecycle-demo
Get a shell into the Container running in your Pod:
kubectl exec -it lifecycle-demo -- /bin/bash
In your shell, verify that the postStart
handler created the message
file:
root@lifecycle-demo:/# cat /usr/share/message
The output shows the text written by the postStart handler:
Hello from the postStart handler
Discussion
Kubernetes sends the postStart event immediately after the Container is created. There is no guarantee, however, that the postStart handler is called before the Container's entrypoint is called. The postStart handler runs asynchronously relative to the Container's code, but Kubernetes' management of the container blocks until the postStart handler completes. The Container's status is not set to RUNNING until the postStart handler completes.
Kubernetes sends the preStop event immediately before the Container is terminated. Kubernetes' management of the Container blocks until the preStop handler completes, unless the Pod's grace period expires. For more details, see Pod Lifecycle.
What's next
- Learn more about Container lifecycle hooks.
- Learn more about the lifecycle of a Pod.
Reference
4.3.19 - Configure a Pod to Use a ConfigMap
Many applications rely on configuration which is used during either application initialization or runtime. Most of the times there is a requirement to adjust values assigned to configuration parameters. ConfigMaps is the kubernetes way to inject application pods with configuration data. ConfigMaps allow you to decouple configuration artifacts from image content to keep containerized applications portable. This page provides a series of usage examples demonstrating how to create ConfigMaps and configure Pods using data stored in ConfigMaps.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Create a ConfigMap
You can use either kubectl create configmap
or a ConfigMap generator in kustomization.yaml
to create a ConfigMap. Note that kubectl
starts to support kustomization.yaml
since 1.14.
Create a ConfigMap Using kubectl create configmap
Use the kubectl create configmap
command to create ConfigMaps from directories, files, or literal values:
kubectl create configmap <map-name> <data-source>
where <map-name> is the name you want to assign to the ConfigMap and <data-source> is the directory, file, or literal value to draw the data from. The name of a ConfigMap object must be a valid DNS subdomain name.
When you are creating a ConfigMap based on a file, the key in the <data-source> defaults to the basename of the file, and the value defaults to the file content.
You can use kubectl describe
or
kubectl get
to retrieve information
about a ConfigMap.
Create ConfigMaps from directories
You can use kubectl create configmap
to create a ConfigMap from multiple files in the same directory. When you are creating a ConfigMap based on a directory, kubectl identifies files whose basename is a valid key in the directory and packages each of those files into the new ConfigMap. Any directory entries except regular files are ignored (e.g. subdirectories, symlinks, devices, pipes, etc).
For example:
# Create the local directory
mkdir -p configure-pod-container/configmap/
# Download the sample files into `configure-pod-container/configmap/` directory
wget https://kubernetes.io/examples/configmap/game.properties -O configure-pod-container/configmap/game.properties
wget https://kubernetes.io/examples/configmap/ui.properties -O configure-pod-container/configmap/ui.properties
# Create the configmap
kubectl create configmap game-config --from-file=configure-pod-container/configmap/
The above command packages each file, in this case, game.properties
and ui.properties
in the configure-pod-container/configmap/
directory into the game-config ConfigMap. You can display details of the ConfigMap using the following command:
kubectl describe configmaps game-config
The output is similar to this:
Name: game-config
Namespace: default
Labels: <none>
Annotations: <none>
Data
====
game.properties:
----
enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30
ui.properties:
----
color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice
The game.properties
and ui.properties
files in the configure-pod-container/configmap/
directory are represented in the data
section of the ConfigMap.
kubectl get configmaps game-config -o yaml
The output is similar to this:
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: 2016-02-18T18:52:05Z
name: game-config
namespace: default
resourceVersion: "516"
uid: b4952dc3-d670-11e5-8cd0-68f728db1985
data:
game.properties: |
enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30
ui.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice
Create ConfigMaps from files
You can use kubectl create configmap
to create a ConfigMap from an individual file, or from multiple files.
For example,
kubectl create configmap game-config-2 --from-file=configure-pod-container/configmap/game.properties
would produce the following ConfigMap:
kubectl describe configmaps game-config-2
where the output is similar to this:
Name: game-config-2
Namespace: default
Labels: <none>
Annotations: <none>
Data
====
game.properties:
----
enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30
You can pass in the --from-file
argument multiple times to create a ConfigMap from multiple data sources.
kubectl create configmap game-config-2 --from-file=configure-pod-container/configmap/game.properties --from-file=configure-pod-container/configmap/ui.properties
You can display details of the game-config-2
ConfigMap using the following command:
kubectl describe configmaps game-config-2
The output is similar to this:
Name: game-config-2
Namespace: default
Labels: <none>
Annotations: <none>
Data
====
game.properties:
----
enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30
ui.properties:
----
color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice
When kubectl
creates a ConfigMap from inputs that are not ASCII or UTF-8, the tool puts these into the binaryData
field of the ConfigMap, and not in data
. Both text and binary data sources can be combined in one ConfigMap.
If you want to view the binaryData
keys (and their values) in a ConfigMap, you can run kubectl get configmap -o jsonpath='{.binaryData}' <name>
.
Use the option --from-env-file
to create a ConfigMap from an env-file, for example:
# Env-files contain a list of environment variables.
# These syntax rules apply:
# Each line in an env file has to be in VAR=VAL format.
# Lines beginning with # (i.e. comments) are ignored.
# Blank lines are ignored.
# There is no special handling of quotation marks (i.e. they will be part of the ConfigMap value)).
# Download the sample files into `configure-pod-container/configmap/` directory
wget https://kubernetes.io/examples/configmap/game-env-file.properties -O configure-pod-container/configmap/game-env-file.properties
wget https://kubernetes.io/examples/configmap/ui-env-file.properties -O configure-pod-container/configmap/ui-env-file.properties
# The env-file `game-env-file.properties` looks like below
cat configure-pod-container/configmap/game-env-file.properties
enemies=aliens
lives=3
allowed="true"
# This comment and the empty line above it are ignored
kubectl create configmap game-config-env-file \
--from-env-file=configure-pod-container/configmap/game-env-file.properties
would produce the following ConfigMap:
kubectl get configmap game-config-env-file -o yaml
where the output is similar to this:
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: 2017-12-27T18:36:28Z
name: game-config-env-file
namespace: default
resourceVersion: "809965"
uid: d9d1ca5b-eb34-11e7-887b-42010a8002b8
data:
allowed: '"true"'
enemies: aliens
lives: "3"
Starting with Kubernetes v1.23, kubectl
supports the --from-env-file
argument to be
specified multiple times to create a ConfigMap from multiple data sources.
kubectl create configmap config-multi-env-files \
--from-env-file=configure-pod-container/configmap/game-env-file.properties \
--from-env-file=configure-pod-container/configmap/ui-env-file.properties
would produce the following ConfigMap:
kubectl get configmap config-multi-env-files -o yaml
where the output is similar to this:
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: 2017-12-27T18:38:34Z
name: config-multi-env-files
namespace: default
resourceVersion: "810136"
uid: 252c4572-eb35-11e7-887b-42010a8002b8
data:
allowed: '"true"'
color: purple
enemies: aliens
how: fairlyNice
lives: "3"
textmode: "true"
Define the key to use when creating a ConfigMap from a file
You can define a key other than the file name to use in the data
section of your ConfigMap when using the --from-file
argument:
kubectl create configmap game-config-3 --from-file=<my-key-name>=<path-to-file>
where <my-key-name>
is the key you want to use in the ConfigMap and <path-to-file>
is the location of the data source file you want the key to represent.
For example:
kubectl create configmap game-config-3 --from-file=game-special-key=configure-pod-container/configmap/game.properties
would produce the following ConfigMap:
kubectl get configmaps game-config-3 -o yaml
where the output is similar to this:
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: 2016-02-18T18:54:22Z
name: game-config-3
namespace: default
resourceVersion: "530"
uid: 05f8da22-d671-11e5-8cd0-68f728db1985
data:
game-special-key: |
enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30
Create ConfigMaps from literal values
You can use kubectl create configmap
with the --from-literal
argument to define a literal value from the command line:
kubectl create configmap special-config --from-literal=special.how=very --from-literal=special.type=charm
You can pass in multiple key-value pairs. Each pair provided on the command line is represented as a separate entry in the data
section of the ConfigMap.
kubectl get configmaps special-config -o yaml
The output is similar to this:
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: 2016-02-18T19:14:38Z
name: special-config
namespace: default
resourceVersion: "651"
uid: dadce046-d673-11e5-8cd0-68f728db1985
data:
special.how: very
special.type: charm
Create a ConfigMap from generator
kubectl
supports kustomization.yaml
since 1.14.
You can also create a ConfigMap from generators and then apply it to create the object on
the Apiserver. The generators
should be specified in a kustomization.yaml
inside a directory.
Generate ConfigMaps from files
For example, to generate a ConfigMap from files configure-pod-container/configmap/game.properties
# Create a kustomization.yaml file with ConfigMapGenerator
cat <<EOF >./kustomization.yaml
configMapGenerator:
- name: game-config-4
files:
- configure-pod-container/configmap/game.properties
EOF
Apply the kustomization directory to create the ConfigMap object.
kubectl apply -k .
configmap/game-config-4-m9dm2f92bt created
You can check that the ConfigMap was created like this:
kubectl get configmap
NAME DATA AGE
game-config-4-m9dm2f92bt 1 37s
kubectl describe configmaps/game-config-4-m9dm2f92bt
Name: game-config-4-m9dm2f92bt
Namespace: default
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","data":{"game.properties":"enemies=aliens\nlives=3\nenemies.cheat=true\nenemies.cheat.level=noGoodRotten\nsecret.code.p...
Data
====
game.properties:
----
enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30
Events: <none>
Note that the generated ConfigMap name has a suffix appended by hashing the contents. This ensures that a new ConfigMap is generated each time the content is modified.
Define the key to use when generating a ConfigMap from a file
You can define a key other than the file name to use in the ConfigMap generator.
For example, to generate a ConfigMap from files configure-pod-container/configmap/game.properties
with the key game-special-key
# Create a kustomization.yaml file with ConfigMapGenerator
cat <<EOF >./kustomization.yaml
configMapGenerator:
- name: game-config-5
files:
- game-special-key=configure-pod-container/configmap/game.properties
EOF
Apply the kustomization directory to create the ConfigMap object.
kubectl apply -k .
configmap/game-config-5-m67dt67794 created
Generate ConfigMaps from Literals
To generate a ConfigMap from literals special.type=charm
and special.how=very
,
you can specify the ConfigMap generator in kustomization.yaml
as
# Create a kustomization.yaml file with ConfigMapGenerator
cat <<EOF >./kustomization.yaml
configMapGenerator:
- name: special-config-2
literals:
- special.how=very
- special.type=charm
EOF
Apply the kustomization directory to create the ConfigMap object.
kubectl apply -k .
configmap/special-config-2-c92b5mmcf2 created
Define container environment variables using ConfigMap data
Define a container environment variable with data from a single ConfigMap
-
Define an environment variable as a key-value pair in a ConfigMap:
kubectl create configmap special-config --from-literal=special.how=very
-
Assign the
special.how
value defined in the ConfigMap to theSPECIAL_LEVEL_KEY
environment variable in the Pod specification.
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
command: [ "/bin/sh", "-c", "env" ]
env:
# Define the environment variable
- name: SPECIAL_LEVEL_KEY
valueFrom:
configMapKeyRef:
# The ConfigMap containing the value you want to assign to SPECIAL_LEVEL_KEY
name: special-config
# Specify the key associated with the value
key: special.how
restartPolicy: Never
Create the Pod:
kubectl create -f https://kubernetes.io/examples/pods/pod-single-configmap-env-variable.yaml
Now, the Pod's output includes environment variable SPECIAL_LEVEL_KEY=very
.
Define container environment variables with data from multiple ConfigMaps
-
As with the previous example, create the ConfigMaps first.
apiVersion: v1 kind: ConfigMap metadata: name: special-config namespace: default data: special.how: very --- apiVersion: v1 kind: ConfigMap metadata: name: env-config namespace: default data: log_level: INFO
Create the ConfigMap:
kubectl create -f https://kubernetes.io/examples/configmap/configmaps.yaml
-
Define the environment variables in the Pod specification.
apiVersion: v1 kind: Pod metadata: name: dapi-test-pod spec: containers: - name: test-container image: k8s.gcr.io/busybox command: [ "/bin/sh", "-c", "env" ] env: - name: SPECIAL_LEVEL_KEY valueFrom: configMapKeyRef: name: special-config key: special.how - name: LOG_LEVEL valueFrom: configMapKeyRef: name: env-config key: log_level restartPolicy: Never
Create the Pod:
kubectl create -f https://kubernetes.io/examples/pods/pod-multiple-configmap-env-variable.yaml
Now, the Pod's output includes environment variables SPECIAL_LEVEL_KEY=very
and LOG_LEVEL=INFO
.
Configure all key-value pairs in a ConfigMap as container environment variables
-
Create a ConfigMap containing multiple key-value pairs.
apiVersion: v1 kind: ConfigMap metadata: name: special-config namespace: default data: SPECIAL_LEVEL: very SPECIAL_TYPE: charm
Create the ConfigMap:
kubectl create -f https://kubernetes.io/examples/configmap/configmap-multikeys.yaml
- Use
envFrom
to define all of the ConfigMap's data as container environment variables. The key from the ConfigMap becomes the environment variable name in the Pod.
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
command: [ "/bin/sh", "-c", "env" ]
envFrom:
- configMapRef:
name: special-config
restartPolicy: Never
Create the Pod:
kubectl create -f https://kubernetes.io/examples/pods/pod-configmap-envFrom.yaml
Now, the Pod's output includes environment variables SPECIAL_LEVEL=very
and SPECIAL_TYPE=charm
.
Use ConfigMap-defined environment variables in Pod commands
You can use ConfigMap-defined environment variables in the command
and args
of a container using the $(VAR_NAME)
Kubernetes substitution syntax.
For example, the following Pod specification
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
command: [ "/bin/echo", "$(SPECIAL_LEVEL_KEY) $(SPECIAL_TYPE_KEY)" ]
env:
- name: SPECIAL_LEVEL_KEY
valueFrom:
configMapKeyRef:
name: special-config
key: SPECIAL_LEVEL
- name: SPECIAL_TYPE_KEY
valueFrom:
configMapKeyRef:
name: special-config
key: SPECIAL_TYPE
restartPolicy: Never
created by running
kubectl create -f https://kubernetes.io/examples/pods/pod-configmap-env-var-valueFrom.yaml
produces the following output in the test-container
container:
very charm
Add ConfigMap data to a Volume
As explained in Create ConfigMaps from files, when you create a ConfigMap using --from-file
, the filename becomes a key stored in the data
section of the ConfigMap. The file contents become the key's value.
The examples in this section refer to a ConfigMap named special-config, shown below.
apiVersion: v1
kind: ConfigMap
metadata:
name: special-config
namespace: default
data:
SPECIAL_LEVEL: very
SPECIAL_TYPE: charm
Create the ConfigMap:
kubectl create -f https://kubernetes.io/examples/configmap/configmap-multikeys.yaml
Populate a Volume with data stored in a ConfigMap
Add the ConfigMap name under the volumes
section of the Pod specification.
This adds the ConfigMap data to the directory specified as volumeMounts.mountPath
(in this case, /etc/config
).
The command
section lists directory files with names that match the keys in ConfigMap.
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
command: [ "/bin/sh", "-c", "ls /etc/config/" ]
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
# Provide the name of the ConfigMap containing the files you want
# to add to the container
name: special-config
restartPolicy: Never
Create the Pod:
kubectl create -f https://kubernetes.io/examples/pods/pod-configmap-volume.yaml
When the pod runs, the command ls /etc/config/
produces the output below:
SPECIAL_LEVEL
SPECIAL_TYPE
/etc/config/
directory, they will be deleted.
Add ConfigMap data to a specific path in the Volume
Use the path
field to specify the desired file path for specific ConfigMap items.
In this case, the SPECIAL_LEVEL
item will be mounted in the config-volume
volume at /etc/config/keys
.
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
command: [ "/bin/sh","-c","cat /etc/config/keys" ]
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: special-config
items:
- key: SPECIAL_LEVEL
path: keys
restartPolicy: Never
Create the Pod:
kubectl create -f https://kubernetes.io/examples/pods/pod-configmap-volume-specific-key.yaml
When the pod runs, the command cat /etc/config/keys
produces the output below:
very
/etc/config/
directory will be deleted.
Project keys to specific paths and file permissions
You can project keys to specific paths and specific permissions on a per-file basis. The Secrets user guide explains the syntax.
Optional References
A ConfigMap reference may be marked "optional". If the ConfigMap is non-existent, the mounted volume will be empty. If the ConfigMap exists, but the referenced key is non-existent the path will be absent beneath the mount point.
Mounted ConfigMaps are updated automatically
When a mounted ConfigMap is updated, the projected content is eventually updated too. This applies in the case where an optionally referenced ConfigMap comes into existence after a pod has started.
Kubelet checks whether the mounted ConfigMap is fresh on every periodic sync. However, it uses its local TTL-based cache for getting the current value of the ConfigMap. As a result, the total delay from the moment when the ConfigMap is updated to the moment when new keys are projected to the pod can be as long as kubelet sync period (1 minute by default) + TTL of ConfigMaps cache (1 minute by default) in kubelet. You can trigger an immediate refresh by updating one of the pod's annotations.
Understanding ConfigMaps and Pods
The ConfigMap API resource stores configuration data as key-value pairs. The data can be consumed in pods or provide the configurations for system components such as controllers. ConfigMap is similar to Secrets, but provides a means of working with strings that don't contain sensitive information. Users and system components alike can store configuration data in ConfigMap.
/etc
directory and its contents. For example, if you create a Kubernetes Volume from a ConfigMap, each data item in the ConfigMap is represented by an individual file in the volume.
The ConfigMap's data
field contains the configuration data. As shown in the example below, this can be simple -- like individual properties defined using --from-literal
-- or complex -- like configuration files or JSON blobs defined using --from-file
.
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: 2016-02-18T19:14:38Z
name: example-config
namespace: default
data:
# example of a simple property defined using --from-literal
example.property.1: hello
example.property.2: world
# example of a complex property defined using --from-file
example.property.file: |-
property.1=value-1
property.2=value-2
property.3=value-3
Restrictions
-
You must create a ConfigMap before referencing it in a Pod specification (unless you mark the ConfigMap as "optional"). If you reference a ConfigMap that doesn't exist, the Pod won't start. Likewise, references to keys that don't exist in the ConfigMap will prevent the pod from starting.
-
If you use
envFrom
to define environment variables from ConfigMaps, keys that are considered invalid will be skipped. The pod will be allowed to start, but the invalid names will be recorded in the event log (InvalidVariableNames
). The log message lists each skipped key. For example:kubectl get events
The output is similar to this:
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 0s 0s 1 dapi-test-pod Pod Warning InvalidEnvironmentVariableNames {kubelet, 127.0.0.1} Keys [1badkey, 2alsobad] from the EnvFrom configMap default/myconfig were skipped since they are considered invalid environment variable names.
-
ConfigMaps reside in a specific Namespace. A ConfigMap can only be referenced by pods residing in the same namespace.
-
You can't use ConfigMaps for static pods, because the Kubelet does not support this.
What's next
- Follow a real world example of Configuring Redis using a ConfigMap.
4.3.20 - Share Process Namespace between Containers in a Pod
Kubernetes v1.17 [stable]
This page shows how to configure process namespace sharing for a pod. When process namespace sharing is enabled, processes in a container are visible to all other containers in that pod.
You can use this feature to configure cooperating containers, such as a log handler sidecar container, or to troubleshoot container images that don't include debugging utilities like a shell.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.10. To check the version, enterkubectl version
.
Configure a Pod
Process Namespace Sharing is enabled using the shareProcessNamespace
field of
v1.PodSpec
. For example:
-
Create the pod
nginx
on your cluster:kubectl apply -f https://k8s.io/examples/pods/share-process-namespace.yaml
-
Attach to the
shell
container and runps
:kubectl attach -it nginx -c shell
If you don't see a command prompt, try pressing enter.
/ # ps ax PID USER TIME COMMAND 1 root 0:00 /pause 8 root 0:00 nginx: master process nginx -g daemon off; 14 101 0:00 nginx: worker process 15 root 0:00 sh 21 root 0:00 ps ax
You can signal processes in other containers. For example, send SIGHUP
to
nginx to restart the worker process. This requires the SYS_PTRACE
capability.
/ # kill -HUP 8
/ # ps ax
PID USER TIME COMMAND
1 root 0:00 /pause
8 root 0:00 nginx: master process nginx -g daemon off;
15 root 0:00 sh
22 101 0:00 nginx: worker process
23 root 0:00 ps ax
It's even possible to access another container image using the
/proc/$pid/root
link.
/ # head /proc/8/root/etc/nginx/nginx.conf
user nginx;
worker_processes 1;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
Understanding Process Namespace Sharing
Pods share many resources so it makes sense they would also share a process namespace. Some container images may expect to be isolated from other containers, though, so it's important to understand these differences:
-
The container process no longer has PID 1. Some container images refuse to start without PID 1 (for example, containers using
systemd
) or run commands likekill -HUP 1
to signal the container process. In pods with a shared process namespace,kill -HUP 1
will signal the pod sandbox. (/pause
in the above example.) -
Processes are visible to other containers in the pod. This includes all information visible in
/proc
, such as passwords that were passed as arguments or environment variables. These are protected only by regular Unix permissions. -
Container filesystems are visible to other containers in the pod through the
/proc/$pid/root
link. This makes debugging easier, but it also means that filesystem secrets are protected only by filesystem permissions.
4.3.21 - Create static Pods
Static Pods are managed directly by the kubelet daemon on a specific node, without the API server observing them. Unlike Pods that are managed by the control plane (for example, a Deployment); instead, the kubelet watches each static Pod (and restarts it if it fails).
Static Pods are always bound to one Kubelet on a specific node.
The kubelet automatically tries to create a mirror Pod on the Kubernetes API server for each static Pod. This means that the Pods running on a node are visible on the API server, but cannot be controlled from there. The Pod names will be suffixed with the node hostname with a leading hyphen.
spec
of a static Pod cannot refer to other API objects
(e.g., ServiceAccount,
ConfigMap,
Secret, etc).
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
This page assumes you're using CRI-O to run Pods, and that your nodes are running the Fedora operating system. Instructions for other distributions or Kubernetes installations may vary.
Create a static pod
You can configure a static Pod with either a file system hosted configuration file or a web hosted configuration file.
Filesystem-hosted static Pod manifest
Manifests are standard Pod definitions in JSON or YAML format in a specific directory. Use the staticPodPath: <the directory>
field in the
kubelet configuration file,
which periodically scans the directory and creates/deletes static Pods as YAML/JSON files appear/disappear there.
Note that the kubelet will ignore files starting with dots when scanning the specified directory.
For example, this is how to start a simple web server as a static Pod:
-
Choose a node where you want to run the static Pod. In this example, it's
my-node1
.ssh my-node1
-
Choose a directory, say
/etc/kubelet.d
and place a web server Pod definition there, for example/etc/kubelet.d/static-web.yaml
:# Run this command on the node where kubelet is running mkdir /etc/kubelet.d/ cat <<EOF >/etc/kubelet.d/static-web.yaml apiVersion: v1 kind: Pod metadata: name: static-web labels: role: myrole spec: containers: - name: web image: nginx ports: - name: web containerPort: 80 protocol: TCP EOF
-
Configure your kubelet on the node to use this directory by running it with
--pod-manifest-path=/etc/kubelet.d/
argument. On Fedora edit/etc/kubernetes/kubelet
to include this line:KUBELET_ARGS="--cluster-dns=10.254.0.10 --cluster-domain=kube.local --pod-manifest-path=/etc/kubelet.d/"
or add the
staticPodPath: <the directory>
field in the kubelet configuration file. -
Restart the kubelet. On Fedora, you would run:
# Run this command on the node where the kubelet is running systemctl restart kubelet
Web-hosted static pod manifest
Kubelet periodically downloads a file specified by --manifest-url=<URL>
argument
and interprets it as a JSON/YAML file that contains Pod definitions.
Similar to how filesystem-hosted manifests work, the kubelet
refetches the manifest on a schedule. If there are changes to the list of static
Pods, the kubelet applies them.
To use this approach:
-
Create a YAML file and store it on a web server so that you can pass the URL of that file to the kubelet.
apiVersion: v1 kind: Pod metadata: name: static-web labels: role: myrole spec: containers: - name: web image: nginx ports: - name: web containerPort: 80 protocol: TCP
-
Configure the kubelet on your selected node to use this web manifest by running it with
--manifest-url=<manifest-url>
. On Fedora, edit/etc/kubernetes/kubelet
to include this line:KUBELET_ARGS="--cluster-dns=10.254.0.10 --cluster-domain=kube.local --manifest-url=<manifest-url>"
-
Restart the kubelet. On Fedora, you would run:
# Run this command on the node where the kubelet is running systemctl restart kubelet
Observe static pod behavior
When the kubelet starts, it automatically starts all defined static Pods. As you have defined a static Pod and restarted the kubelet, the new static Pod should already be running.
You can view running containers (including static Pods) by running (on the node):
# Run this command on the node where the kubelet is running
crictl ps
The output might be something like:
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
129fd7d382018 docker.io/library/nginx@sha256:... 11 minutes ago Running web 0 34533c6729106
crictl
outputs the image URI and SHA-256 checksum. NAME
will look more like:
docker.io/library/nginx@sha256:0d17b565c37bcbd895e9d92315a05c1c3c9a29f762b011a10c54a66cd53c9b31
.
You can see the mirror Pod on the API server:
kubectl get pods
NAME READY STATUS RESTARTS AGE
static-web 1/1 Running 0 2m
Labels from the static Pod are propagated into the mirror Pod. You can use those labels as normal via selectors, etc.
If you try to use kubectl
to delete the mirror Pod from the API server,
the kubelet doesn't remove the static Pod:
kubectl delete pod static-web
pod "static-web" deleted
You can see that the Pod is still running:
kubectl get pods
NAME READY STATUS RESTARTS AGE
static-web 1/1 Running 0 4s
Back on your node where the kubelet is running, you can try to stop the container manually. You'll see that, after a time, the kubelet will notice and will restart the Pod automatically:
# Run these commands on the node where the kubelet is running
crictl stop 129fd7d382018 # replace with the ID of your container
sleep 20
crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
89db4553e1eeb docker.io/library/nginx@sha256:... 19 seconds ago Running web 1 34533c6729106
Dynamic addition and removal of static pods
The running kubelet periodically scans the configured directory (/etc/kubelet.d
in our example) for changes and adds/removes Pods as files appear/disappear in this directory.
# This assumes you are using filesystem-hosted static Pod configuration
# Run these commands on the node where the kubelet is running
#
mv /etc/kubelet.d/static-web.yaml /tmp
sleep 20
crictl ps
# You see that no nginx container is running
mv /tmp/static-web.yaml /etc/kubelet.d/
sleep 20
crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
f427638871c35 docker.io/library/nginx@sha256:... 19 seconds ago Running web 1 34533c6729106
4.3.22 - Translate a Docker Compose File to Kubernetes Resources
What's Kompose? It's a conversion tool for all things compose (namely Docker Compose) to container orchestrators (Kubernetes or OpenShift).
More information can be found on the Kompose website at http://kompose.io.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Install Kompose
We have multiple ways to install Kompose. Our preferred method is downloading the binary from the latest GitHub release.
Kompose is released via GitHub on a three-week cycle, you can see all current releases on the GitHub release page.
# Linux
curl -L https://github.com/kubernetes/kompose/releases/download/v1.26.0/kompose-linux-amd64 -o kompose
# macOS
curl -L https://github.com/kubernetes/kompose/releases/download/v1.26.0/kompose-darwin-amd64 -o kompose
# Windows
curl -L https://github.com/kubernetes/kompose/releases/download/v1.26.0/kompose-windows-amd64.exe -o kompose.exe
chmod +x kompose
sudo mv ./kompose /usr/local/bin/kompose
Alternatively, you can download the tarball.
Installing using go get
pulls from the master branch with the latest development changes.
go get -u github.com/kubernetes/kompose
Kompose is in EPEL CentOS repository.
If you don't have EPEL repository already installed and enabled you can do it by running sudo yum install epel-release
If you have EPEL enabled in your system, you can install Kompose like any other package.
sudo yum -y install kompose
Kompose is in Fedora 24, 25 and 26 repositories. You can install it like any other package.
sudo dnf -y install kompose
On macOS you can install latest release via Homebrew:
brew install kompose
Use Kompose
In a few steps, we'll take you from Docker Compose to Kubernetes. All
you need is an existing docker-compose.yml
file.
-
Go to the directory containing your
docker-compose.yml
file. If you don't have one, test using this one.version: "2" services: redis-master: image: k8s.gcr.io/redis:e2e ports: - "6379" redis-slave: image: gcr.io/google_samples/gb-redisslave:v3 ports: - "6379" environment: - GET_HOSTS_FROM=dns frontend: image: gcr.io/google-samples/gb-frontend:v4 ports: - "80:80" environment: - GET_HOSTS_FROM=dns labels: kompose.service.type: LoadBalancer
-
To convert the
docker-compose.yml
file to files that you can use withkubectl
, runkompose convert
and thenkubectl apply -f <output file>
.kompose convert
The output is similar to:
INFO Kubernetes file "frontend-service.yaml" created INFO Kubernetes file "frontend-service.yaml" created INFO Kubernetes file "frontend-service.yaml" created INFO Kubernetes file "redis-master-service.yaml" created INFO Kubernetes file "redis-master-service.yaml" created INFO Kubernetes file "redis-master-service.yaml" created INFO Kubernetes file "redis-slave-service.yaml" created INFO Kubernetes file "redis-slave-service.yaml" created INFO Kubernetes file "redis-slave-service.yaml" created INFO Kubernetes file "frontend-deployment.yaml" created INFO Kubernetes file "frontend-deployment.yaml" created INFO Kubernetes file "frontend-deployment.yaml" created INFO Kubernetes file "redis-master-deployment.yaml" created INFO Kubernetes file "redis-master-deployment.yaml" created INFO Kubernetes file "redis-master-deployment.yaml" created INFO Kubernetes file "redis-slave-deployment.yaml" created INFO Kubernetes file "redis-slave-deployment.yaml" created INFO Kubernetes file "redis-slave-deployment.yaml" created
kubectl apply -f frontend-service.yaml,redis-master-service.yaml,redis-slave-service.yaml,frontend-deployment.yaml,redis-master-deployment.yaml,redis-slave-deployment.yaml
The output is similar to:
service/frontend created service/redis-master created service/redis-slave created deployment.apps/frontend created deployment.apps/redis-master created deployment.apps/redis-slave created
Your deployments are running in Kubernetes.
-
Access your application.
If you're already using
minikube
for your development process:minikube service frontend
Otherwise, let's look up what IP your service is using!
kubectl describe svc frontend
Name: frontend Namespace: default Labels: service=frontend Selector: service=frontend Type: LoadBalancer IP: 10.0.0.183 LoadBalancer Ingress: 192.0.2.89 Port: 80 80/TCP NodePort: 80 31144/TCP Endpoints: 172.17.0.4:80 Session Affinity: None No events.
If you're using a cloud provider, your IP will be listed next to
LoadBalancer Ingress
.curl http://192.0.2.89
User Guide
- CLI
- Documentation
Kompose has support for two providers: OpenShift and Kubernetes.
You can choose a targeted provider using global option --provider
. If no provider is specified, Kubernetes is set by default.
kompose convert
Kompose supports conversion of V1, V2, and V3 Docker Compose files into Kubernetes and OpenShift objects.
Kubernetes kompose convert
example
kompose --file docker-voting.yml convert
WARN Unsupported key networks - ignoring
WARN Unsupported key build - ignoring
INFO Kubernetes file "worker-svc.yaml" created
INFO Kubernetes file "db-svc.yaml" created
INFO Kubernetes file "redis-svc.yaml" created
INFO Kubernetes file "result-svc.yaml" created
INFO Kubernetes file "vote-svc.yaml" created
INFO Kubernetes file "redis-deployment.yaml" created
INFO Kubernetes file "result-deployment.yaml" created
INFO Kubernetes file "vote-deployment.yaml" created
INFO Kubernetes file "worker-deployment.yaml" created
INFO Kubernetes file "db-deployment.yaml" created
ls
db-deployment.yaml docker-compose.yml docker-gitlab.yml redis-deployment.yaml result-deployment.yaml vote-deployment.yaml worker-deployment.yaml
db-svc.yaml docker-voting.yml redis-svc.yaml result-svc.yaml vote-svc.yaml worker-svc.yaml
You can also provide multiple docker-compose files at the same time:
kompose -f docker-compose.yml -f docker-guestbook.yml convert
INFO Kubernetes file "frontend-service.yaml" created
INFO Kubernetes file "mlbparks-service.yaml" created
INFO Kubernetes file "mongodb-service.yaml" created
INFO Kubernetes file "redis-master-service.yaml" created
INFO Kubernetes file "redis-slave-service.yaml" created
INFO Kubernetes file "frontend-deployment.yaml" created
INFO Kubernetes file "mlbparks-deployment.yaml" created
INFO Kubernetes file "mongodb-deployment.yaml" created
INFO Kubernetes file "mongodb-claim0-persistentvolumeclaim.yaml" created
INFO Kubernetes file "redis-master-deployment.yaml" created
INFO Kubernetes file "redis-slave-deployment.yaml" created
ls
mlbparks-deployment.yaml mongodb-service.yaml redis-slave-service.jsonmlbparks-service.yaml
frontend-deployment.yaml mongodb-claim0-persistentvolumeclaim.yaml redis-master-service.yaml
frontend-service.yaml mongodb-deployment.yaml redis-slave-deployment.yaml
redis-master-deployment.yaml
When multiple docker-compose files are provided the configuration is merged. Any configuration that is common will be over ridden by subsequent file.
OpenShift kompose convert
example
kompose --provider openshift --file docker-voting.yml convert
WARN [worker] Service cannot be created because of missing port.
INFO OpenShift file "vote-service.yaml" created
INFO OpenShift file "db-service.yaml" created
INFO OpenShift file "redis-service.yaml" created
INFO OpenShift file "result-service.yaml" created
INFO OpenShift file "vote-deploymentconfig.yaml" created
INFO OpenShift file "vote-imagestream.yaml" created
INFO OpenShift file "worker-deploymentconfig.yaml" created
INFO OpenShift file "worker-imagestream.yaml" created
INFO OpenShift file "db-deploymentconfig.yaml" created
INFO OpenShift file "db-imagestream.yaml" created
INFO OpenShift file "redis-deploymentconfig.yaml" created
INFO OpenShift file "redis-imagestream.yaml" created
INFO OpenShift file "result-deploymentconfig.yaml" created
INFO OpenShift file "result-imagestream.yaml" created
It also supports creating buildconfig for build directive in a service. By default, it uses the remote repo for the current git branch as the source repo, and the current branch as the source branch for the build. You can specify a different source repo and branch using --build-repo
and --build-branch
options respectively.
kompose --provider openshift --file buildconfig/docker-compose.yml convert
WARN [foo] Service cannot be created because of missing port.
INFO OpenShift Buildconfig using git@github.com:rtnpro/kompose.git::master as source.
INFO OpenShift file "foo-deploymentconfig.yaml" created
INFO OpenShift file "foo-imagestream.yaml" created
INFO OpenShift file "foo-buildconfig.yaml" created
oc create -f
, you need to ensure that you push the imagestream artifact before the buildconfig artifact, to workaround this OpenShift issue: https://github.com/openshift/origin/issues/4518 .
Alternative Conversions
The default kompose
transformation will generate Kubernetes Deployments and Services, in yaml format. You have alternative option to generate json with -j
. Also, you can alternatively generate Replication Controllers objects, Daemon Sets, or Helm charts.
kompose convert -j
INFO Kubernetes file "redis-svc.json" created
INFO Kubernetes file "web-svc.json" created
INFO Kubernetes file "redis-deployment.json" created
INFO Kubernetes file "web-deployment.json" created
The *-deployment.json
files contain the Deployment objects.
kompose convert --replication-controller
INFO Kubernetes file "redis-svc.yaml" created
INFO Kubernetes file "web-svc.yaml" created
INFO Kubernetes file "redis-replicationcontroller.yaml" created
INFO Kubernetes file "web-replicationcontroller.yaml" created
The *-replicationcontroller.yaml
files contain the Replication Controller objects. If you want to specify replicas (default is 1), use --replicas
flag: kompose convert --replication-controller --replicas 3
kompose convert --daemon-set
INFO Kubernetes file "redis-svc.yaml" created
INFO Kubernetes file "web-svc.yaml" created
INFO Kubernetes file "redis-daemonset.yaml" created
INFO Kubernetes file "web-daemonset.yaml" created
The *-daemonset.yaml
files contain the DaemonSet objects
If you want to generate a Chart to be used with Helm run:
kompose convert -c
INFO Kubernetes file "web-svc.yaml" created
INFO Kubernetes file "redis-svc.yaml" created
INFO Kubernetes file "web-deployment.yaml" created
INFO Kubernetes file "redis-deployment.yaml" created
chart created in "./docker-compose/"
tree docker-compose/
docker-compose
├── Chart.yaml
├── README.md
└── templates
├── redis-deployment.yaml
├── redis-svc.yaml
├── web-deployment.yaml
└── web-svc.yaml
The chart structure is aimed at providing a skeleton for building your Helm charts.
Labels
kompose
supports Kompose-specific labels within the docker-compose.yml
file in order to explicitly define a service's behavior upon conversion.
kompose.service.type
defines the type of service to be created.
For example:
version: "2"
services:
nginx:
image: nginx
dockerfile: foobar
build: ./foobar
cap_add:
- ALL
container_name: foobar
labels:
kompose.service.type: nodeport
kompose.service.expose
defines if the service needs to be made accessible from outside the cluster or not. If the value is set to "true", the provider sets the endpoint automatically, and for any other value, the value is set as the hostname. If multiple ports are defined in a service, the first one is chosen to be the exposed.- For the Kubernetes provider, an ingress resource is created and it is assumed that an ingress controller has already been configured.
- For the OpenShift provider, a route is created.
For example:
version: "2"
services:
web:
image: tuna/docker-counter23
ports:
- "5000:5000"
links:
- redis
labels:
kompose.service.expose: "counter.example.com"
redis:
image: redis:3.0
ports:
- "6379"
The currently supported options are:
Key | Value |
---|---|
kompose.service.type | nodeport / clusterip / loadbalancer |
kompose.service.expose | true / hostname |
kompose.service.type
label should be defined with ports
only, otherwise kompose
will fail.
Restart
If you want to create normal pods without controllers you can use restart
construct of docker-compose to define that. Follow table below to see what happens on the restart
value.
docker-compose restart |
object created | Pod restartPolicy |
---|---|---|
"" |
controller object | Always |
always |
controller object | Always |
on-failure |
Pod | OnFailure |
no |
Pod | Never |
deployment
or replicationcontroller
.
For example, the pival
service will become pod down here. This container calculated value of pi
.
version: '2'
services:
pival:
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restart: "on-failure"
Warning about Deployment Configurations
If the Docker Compose file has a volume specified for a service, the Deployment (Kubernetes) or DeploymentConfig (OpenShift) strategy is changed to "Recreate" instead of "RollingUpdate" (default). This is done to avoid multiple instances of a service from accessing a volume at the same time.
If the Docker Compose file has service name with _
in it (eg.web_service
), then it will be replaced by -
and the service name will be renamed accordingly (eg.web-service
). Kompose does this because "Kubernetes" doesn't allow _
in object name.
Please note that changing service name might break some docker-compose
files.
Docker Compose Versions
Kompose supports Docker Compose versions: 1, 2 and 3. We have limited support on versions 2.1 and 3.2 due to their experimental nature.
A full list on compatibility between all three versions is listed in our conversion document including a list of all incompatible Docker Compose keys.
4.3.23 - Enforce Pod Security Standards by Configuring the Built-in Admission Controller
As of v1.22, Kubernetes provides a built-in admission controller to enforce the Pod Security Standards. You can configure this admission controller to set cluster-wide defaults and exemptions.
Before you begin
Your Kubernetes server must be at or later than version v1.22.
To check the version, enter kubectl version
.
- Ensure the
PodSecurity
feature gate is enabled.
Configure the Admission Controller
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
configuration:
apiVersion: pod-security.admission.config.k8s.io/v1beta1
kind: PodSecurityConfiguration
# Defaults applied when a mode label is not set.
#
# Level label values must be one of:
# - "privileged" (default)
# - "baseline"
# - "restricted"
#
# Version label values must be one of:
# - "latest" (default)
# - specific version like "v1.23"
defaults:
enforce: "privileged"
enforce-version: "latest"
audit: "privileged"
audit-version: "latest"
warn: "privileged"
warn-version: "latest"
exemptions:
# Array of authenticated usernames to exempt.
usernames: []
# Array of runtime class names to exempt.
runtimeClasses: []
# Array of namespaces to exempt.
namespaces: []
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
configuration:
apiVersion: pod-security.admission.config.k8s.io/v1alpha1
kind: PodSecurityConfiguration
# Defaults applied when a mode label is not set.
#
# Level label values must be one of:
# - "privileged" (default)
# - "baseline"
# - "restricted"
#
# Version label values must be one of:
# - "latest" (default)
# - specific version like "v1.23"
defaults:
enforce: "privileged"
enforce-version: "latest"
audit: "privileged"
audit-version: "latest"
warn: "privileged"
warn-version: "latest"
exemptions:
# Array of authenticated usernames to exempt.
usernames: []
# Array of runtime class names to exempt.
runtimeClasses: []
# Array of namespaces to exempt.
namespaces: []
4.3.24 - Enforce Pod Security Standards with Namespace Labels
Namespaces can be labeled to enforce the Pod Security Standards.
Before you begin
Your Kubernetes server must be at or later than version v1.22.
To check the version, enter kubectl version
.
- Ensure the
PodSecurity
feature gate is enabled.
Requiring the baseline
Pod Security Standard with namespace labels
This manifest defines a Namespace my-baseline-namespace
that:
- Blocks any pods that don't satisfy the
baseline
policy requirements. - Generates a user-facing warning and adds an audit annotation to any created pod that does not
meet the
restricted
policy requirements. - Pins the versions of the
baseline
andrestricted
policies to v1.23.
apiVersion: v1
kind: Namespace
metadata:
name: my-baseline-namespace
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: v1.23
# We are setting these to our _desired_ `enforce` level.
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: v1.23
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: v1.23
Add labels to existing namespaces with kubectl label
enforce
policy (or version) label is added or changed, the admission plugin will test
each pod in the namespace against the new policy. Violations are returned to the user as warnings.
It is helpful to apply the --dry-run
flag when initially evaluating security profile changes for
namespaces. The Pod Security Standard checks will still be run in dry run mode, giving you
information about how the new policy would treat existing pods, without actually updating a policy.
kubectl label --dry-run=server --overwrite ns --all \
pod-security.kubernetes.io/enforce=baseline
Applying to all namespaces
If you're just getting started with the Pod Security Standards, a suitable first step would be to
configure all namespaces with audit annotations for a stricter level such as baseline
:
kubectl label --overwrite ns --all \
pod-security.kubernetes.io/audit=baseline \
pod-security.kubernetes.io/warn=baseline
Note that this is not setting an enforce level, so that namespaces that haven't been explicitly evaluated can be distinguished. You can list namespaces without an explicitly set enforce level using this command:
kubectl get namespaces --selector='!pod-security.kubernetes.io/enforce'
Applying to a single namespace
You can update a specific namespace as well. This command adds the enforce=restricted
policy to my-existing-namespace
, pinning the restricted policy version to v1.23.
kubectl label --overwrite ns my-existing-namespace \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/enforce-version=v1.23
4.3.25 - Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller
This page describes the process of migrating from PodSecurityPolicies to the built-in PodSecurity
admission controller. This can be done effectively using a combination of dry-run and audit
and
warn
modes, although this becomes harder if mutating PSPs are used.
Before you begin
Your Kubernetes server must be at or later than version v1.22.
To check the version, enter kubectl version
.
- Ensure the
PodSecurity
feature gate is enabled.
This page assumes you are already familiar with the basic Pod Security Admission concepts.
Overall approach
There are multiple strategies you can take for migrating from PodSecurityPolicy to Pod Security Admission. The following steps are one possible migration path, with a goal of minimizing both the risks of a production outage and of a security gap.
- Decide whether Pod Security Admission is the right fit for your use case.
- Review namespace permissions
- Simplify & standardize PodSecurityPolicies
- Update namespaces
- Identify an appropriate Pod Security level
- Verify the Pod Security level
- Enforce the Pod Security level
- Bypass PodSecurityPolicy
- Review namespace creation processes
- Disable PodSecurityPolicy
0. Decide whether Pod Security Admission is right for you
Pod Security Admission was designed to meet the most common security needs out of the box, and to provide a standard set of security levels across clusters. However, it is less flexible than PodSecurityPolicy. Notably, the following features are supported by PodSecurityPolicy but not Pod Security Admission:
- Setting default security constraints - Pod Security Admission is a non-mutating admission controller, meaning it won't modify pods before validating them. If you were relying on this aspect of PSP, you will need to either modify your workloads to meet the Pod Security constraints, or use a Mutating Admission Webhook to make those changes. See Simplify & Standardize PodSecurityPolicies below for more detail.
- Fine-grained control over policy definition - Pod Security Admission only supports 3 standard levels. If you require more control over specific constraints, then you will need to use a Validating Admission Webhook to enforce those policies.
- Sub-namespace policy granularity - PodSecurityPolicy lets you bind different policies to different Service Accounts or users, even within a single namespace. This approach has many pitfalls and is not recommended, but if you require this feature anyway you will need to use a 3rd party webhook instead. The exception to this is if you only need to completely exempt specific users or RuntimeClasses, in which case Pod Security Admission does expose some static configuration for exemptions.
Even if Pod Security Admission does not meet all of your needs it was designed to be complementary to other policy enforcement mechanisms, and can provide a useful fallback running alongside other admission webhooks.
1. Review namespace permissions
Pod Security Admission is controlled by labels on namespaces. This means that anyone who can update (or patch or create) a namespace can also modify the Pod Security level for that namespace, which could be used to bypass a more restrictive policy. Before proceeding, ensure that only trusted, privileged users have these namespace permissions. It is not recommended to grant these powerful permissions to users that shouldn't have elevated permissions, but if you must you will need to use an admission webhook to place additional restrictions on setting Pod Security labels on Namespace objects.
2. Simplify & standardize PodSecurityPolicies
In this section, you will reduce mutating PodSecurityPolicies and remove options that are outside
the scope of the Pod Security Standards. You should make the changes recommended here to an offline
copy of the original PodSecurityPolicy being modified. The cloned PSP should have a different
name that is alphabetically before the original (for example, prepend a 0
to it). Do not create the
new policies in Kubernetes yet - that will be covered in the Rollout the updated
policies section below.
2.a. Eliminate purely mutating fields
If a PodSecurityPolicy is mutating pods, then you could end up with pods that don't meet the Pod Security level requirements when you finally turn PodSecurityPolicy off. In order to avoid this, you should eliminate all PSP mutation prior to switching over. Unfortunately PSP does not cleanly separate mutating & validating fields, so this is not a straightforward migration.
You can start by eliminating the fields that are purely mutating, and don't have any bearing on the validating policy. These fields (also listed in the Mapping PodSecurityPolicies to Pod Security Standards reference) are:
.spec.defaultAllowPrivilegeEscalation
.spec.runtimeClass.defaultRuntimeClassName
.metadata.annotations['seccomp.security.alpha.kubernetes.io/defaultProfileName']
.metadata.annotations['apparmor.security.beta.kubernetes.io/defaultProfileName']
.spec.defaultAddCapabilities
- Although technically a mutating & validating field, these should be merged into.spec.allowedCapabilities
which performs the same validation without mutation.
2.b. Eliminate options not covered by the Pod Security Standards
There are several fields in PodSecurityPolicy that are not covered by the Pod Security Standards. If you must enforce these options, you will need to supplement Pod Security Admission with an admission webhook, which is outside the scope of this guide.
First, you can remove the purely validating fields that the Pod Security Standards do not cover. These fields (also listed in the Mapping PodSecurityPolicies to Pod Security Standards reference with "no opinion") are:
.spec.allowedHostPaths
.spec.allowedFlexVolumes
.spec.allowedCSIDrivers
.spec.forbiddenSysctls
.spec.runtimeClass
You can also remove the following fields, that are related to POSIX / UNIX group controls.
MustRunAs
strategy they may be mutating! Removing these could result in
workloads not setting the required groups, and cause problems. See
Rollout the updated policies below for advice on how to roll these changes
out safely.
.spec.runAsGroup
.spec.supplementalGroups
.spec.fsGroup
The remaining mutating fields are required to properly support the Pod Security Standards, and will need to be handled on a case-by-case basis later:
.spec.requiredDropCapabilities
- Required to dropALL
for the Restricted profile..spec.seLinux
- (Only mutating with theMustRunAs
rule) required to enforce the SELinux requirements of the Baseline & Restricted profiles..spec.runAsUser
- (Non-mutating with theRunAsAny
rule) required to enforceRunAsNonRoot
for the Restricted profile..spec.allowPrivilegeEscalation
- (Only mutating if set tofalse
) required for the Restricted profile.
2.c. Rollout the updated PSPs
Next, you can rollout the updated policies to your cluster. You should proceed with caution, as removing the mutating options may result in workloads missing required configuration.
For each updated PodSecurityPolicy:
- Identify pods running under the original PSP. This can be done using the
kubernetes.io/psp
annotation. For example, using kubectl:PSP_NAME="original" # Set the name of the PSP you're checking for kubectl get pods --all-namespaces -o jsonpath="{range .items[?(@.metadata.annotations.kubernetes\.io\/psp=='$PSP_NAME')]}{.metadata.namespace} {.metadata.name}{'\n'}{end}"
- Compare these running pods against the original pod spec to determine whether PodSecurityPolicy
has modified the pod. For pods created by a workload resource
you can compare the pod with the PodTemplate in the controller resource. If any changes are
identified, the original Pod or PodTemplate should be updated with the desired configuration.
The fields to review are:
.metadata.annotations['container.apparmor.security.beta.kubernetes.io/*']
(replace * with each container name).spec.runtimeClassName
.spec.securityContext.fsGroup
.spec.securityContext.seccompProfile
.spec.securityContext.seLinuxOptions
.spec.securityContext.supplementalGroups
- On containers, under
.spec.containers[*]
and.spec.initContainers[*]
:.securityContext.allowPrivilegeEscalation
.securityContext.capabilities.add
.securityContext.capabilities.drop
.securityContext.readOnlyRootFilesystem
.securityContext.runAsGroup
.securityContext.runAsNonRoot
.securityContext.runAsUser
.securityContext.seccompProfile
.securityContext.seLinuxOptions
- Create the new PodSecurityPolicies. If any Roles or ClusterRoles are granting
use
on all PSPs this could cause the new PSPs to be used instead of their mutating counter-parts. - Update your authorization to grant access to the new PSPs. In RBAC this means updating any Roles
or ClusterRoles that grant the
use
permision on the original PSP to also grant it to the updated PSP. - Verify: after some soak time, rerun the command from step 1 to see if any pods are still using the original PSPs. Note that pods need to be recreated after the new policies have been rolled out before they can be fully verified.
- (optional) Once you have verified that the original PSPs are no longer in use, you can delete them.
3. Update Namespaces
The following steps will need to be performed on every namespace in the cluster. Commands referenced
in these steps use the $NAMESPACE
variable to refer to the namespace being updated.
3.a. Identify an appropriate Pod Security level
Start reviewing the Pod Security Standards and familiarizing yourself with the 3 different levels.
There are several ways to choose a Pod Security level for your namespace:
- By security requirements for the namespace - If you are familiar with the expected access level for the namespace, you can choose an appropriate level based on those requirements, similar to how one might approach this on a new cluster.
- By existing PodSecurityPolicies - Using the
Mapping PodSecurityPolicies to Pod Security Standards
reference you can map each
PSP to a Pod Security Standard level. If your PSPs aren't based on the Pod Security Standards, you
may need to decide between choosing a level that is at least as permissive as the PSP, and a
level that is at least as restrictive. You can see which PSPs are in use for pods in a given
namespace with this command:
kubectl get pods -n $NAMESPACE -o jsonpath="{.items[*].metadata.annotations.kubernetes\.io\/psp}" | tr " " "\n" | sort -u
- By existing pods - Using the strategies under Verify the Pod Security level, you can test out both the Baseline and Restricted levels to see whether they are sufficiently permissive for existing workloads, and chose the least-privileged valid level.
3.b. Verify the Pod Security level
Once you have selected a Pod Security level for the namespace (or if you're trying several), it's a good idea to test it out first (you can skip this step if using the Privileged level). Pod Security includes several tools to help test and safely roll out profiles.
First, you can dry-run the policy, which will evaluate pods currently running in the namespace against the applied policy, without making the new policy take effect:
# $LEVEL is the level to dry-run, either "baseline" or "restricted".
kubectl label --dry-run=server --overwrite ns $NAMESPACE pod-security.kubernetes.io/enforce=$LEVEL
This command will return a warning for any existing pods that are not valid under the proposed level.
The second option is better for catching workloads that are not currently running: audit mode. When running under audit-mode (as opposed to enforcing), pods that violate the policy level are recorded in the audit logs, which can be reviewed later after some soak time, but are not forbidden. Warning mode works similarly, but returns the warning to the user immediately. You can set the audit level on a namespace with this command:
kubectl label --overwrite ns $NAMESPACE pod-security.kubernetes.io/audit=$LEVEL
If either of these approaches yield unexpected violations, you will need to either update the violating workloads to meet the policy requirements, or relax the namespace Pod Security level.
3.c. Enforce the Pod Security level
When you are satisfied that the chosen level can safely be enforced on the namespace, you can update the namespace to enforce the desired level:
kubectl label --overwrite ns $NAMESPACE pod-security.kubernetes.io/enforce=$LEVEL
3.d. Bypass PodSecurityPolicy
Finally, you can effectively bypass PodSecurityPolicy at the namespace level by binding the fully privileged PSP to all service accounts in the namespace.
# The following cluster-scoped commands are only needed once.
kubectl apply -f privileged-psp.yaml
kubectl create clusterrole privileged-psp --verb use --resource podsecuritypolicies.policy --resource-name privileged
# Per-namespace disable
kubectl create -n $NAMESPACE rolebinding disable-psp --clusterrole privileged-psp --group system:serviceaccounts:$NAMESPACE
Since the privileged PSP is non-mutating, and the PSP admission controller always prefers non-mutating PSPs, this will ensure that pods in this namespace are no longer being modified or restricted by PodSecurityPolicy.
The advantage to disabling PodSecurityPolicy on a per-namespace basis like this is if a problem arises you can easily roll the change back by deleting the RoleBinding. Just make sure the pre-existing PodSecurityPolicies are still in place!
# Undo PodSecurityPolicy disablement.
kubectl delete -n $NAMESPACE rolebinding disable-psp
4. Review namespace creation processes
Now that existing namespaces have been updated to enforce Pod Security Admission, you should ensure that your processes and/or policies for creating new namespaces are updated to ensure that an appropriate Pod Security profile is applied to new namespaces.
You can also statically configure the Pod Security admission controller to set a default enforce, audit, and/or warn level for unlabeled namespaces. See Configure the Admission Controller for more information.
5. Disable PodSecurityPolicy
Finally, you're ready to disable PodSecurityPolicy. To do so, you will need to modify the admission configuration of the API server: How do I turn off an admission controller?.
To verify that the PodSecurityPolicy admission controller is no longer enabled, you can manually run a test by impersonating a user without access to any PodSecurityPolicies (see the PodSecurityPolicy example), or by verifying in the API server logs. At startup, the API server outputs log lines listing the loaded admission controller plugins:
I0218 00:59:44.903329 13 plugins.go:158] Loaded 16 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,ExtendedResourceToleration,PersistentVolumeLabel,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0218 00:59:44.903350 13 plugins.go:161] Loaded 14 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,DenyServiceExternalIPs,ValidatingAdmissionWebhook,ResourceQuota.
You should see PodSecurity
(in the validating admission controllers), and neither list should
contain PodSecurityPolicy
.
Once you are certain the PSP admission controller is disabled (and after sufficient soak time to be confident you won't need to roll back), you are free to delete your PodSecurityPolicies and any associated Roles, ClusterRoles, RoleBindings and ClusterRoleBindings (just make sure they don't grant any other unrelated permissions).
4.4 - Manage Kubernetes Objects
4.4.1 - Declarative Management of Kubernetes Objects Using Configuration Files
Kubernetes objects can be created, updated, and deleted by storing multiple
object configuration files in a directory and using kubectl apply
to
recursively create and update those objects as needed. This method
retains writes made to live objects without merging the changes
back into the object configuration files. kubectl diff
also gives you a
preview of what changes apply
will make.
Before you begin
Install kubectl
.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Trade-offs
The kubectl
tool supports three kinds of object management:
- Imperative commands
- Imperative object configuration
- Declarative object configuration
See Kubernetes Object Management for a discussion of the advantages and disadvantage of each kind of object management.
Overview
Declarative object configuration requires a firm understanding of the Kubernetes object definitions and configuration. Read and complete the following documents if you have not already:
- Managing Kubernetes Objects Using Imperative Commands
- Imperative Management of Kubernetes Objects Using Configuration Files
Following are definitions for terms used in this document:
- object configuration file / configuration file: A file that defines the
configuration for a Kubernetes object. This topic shows how to pass configuration
files to
kubectl apply
. Configuration files are typically stored in source control, such as Git. - live object configuration / live configuration: The live configuration values of an object, as observed by the Kubernetes cluster. These are kept in the Kubernetes cluster storage, typically etcd.
- declarative configuration writer / declarative writer: A person or software component
that makes updates to a live object. The live writers referred to in this topic make changes
to object configuration files and run
kubectl apply
to write the changes.
How to create objects
Use kubectl apply
to create all objects, except those that already exist,
defined by configuration files in a specified directory:
kubectl apply -f <directory>/
This sets the kubectl.kubernetes.io/last-applied-configuration: '{...}'
annotation on each object. The annotation contains the contents of the object
configuration file that was used to create the object.
-R
flag to recursively process directories.
Here's an example of an object configuration file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
minReadySeconds: 5
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Run kubectl diff
to print the object that will be created:
kubectl diff -f https://k8s.io/examples/application/simple_deployment.yaml
diff
uses server-side dry-run,
which needs to be enabled on kube-apiserver
.
Since diff
performs a server-side apply request in dry-run mode,
it requires granting PATCH
, CREATE
, and UPDATE
permissions.
See Dry-Run Authorization
for details.
Create the object using kubectl apply
:
kubectl apply -f https://k8s.io/examples/application/simple_deployment.yaml
Print the live configuration using kubectl get
:
kubectl get -f https://k8s.io/examples/application/simple_deployment.yaml -o yaml
The output shows that the kubectl.kubernetes.io/last-applied-configuration
annotation
was written to the live configuration, and it matches the configuration file:
kind: Deployment
metadata:
annotations:
# ...
# This is the json representation of simple_deployment.yaml
# It was written by kubectl apply when the object was created
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment",
"metadata":{"annotations":{},"name":"nginx-deployment","namespace":"default"},
"spec":{"minReadySeconds":5,"selector":{"matchLabels":{"app":nginx}},"template":{"metadata":{"labels":{"app":"nginx"}},
"spec":{"containers":[{"image":"nginx:1.14.2","name":"nginx",
"ports":[{"containerPort":80}]}]}}}}
# ...
spec:
# ...
minReadySeconds: 5
selector:
matchLabels:
# ...
app: nginx
template:
metadata:
# ...
labels:
app: nginx
spec:
containers:
- image: nginx:1.14.2
# ...
name: nginx
ports:
- containerPort: 80
# ...
# ...
# ...
# ...
How to update objects
You can also use kubectl apply
to update all objects defined in a directory, even
if those objects already exist. This approach accomplishes the following:
- Sets fields that appear in the configuration file in the live configuration.
- Clears fields removed from the configuration file in the live configuration.
kubectl diff -f <directory>/
kubectl apply -f <directory>/
-R
flag to recursively process directories.
Here's an example configuration file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
minReadySeconds: 5
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Create the object using kubectl apply
:
kubectl apply -f https://k8s.io/examples/application/simple_deployment.yaml
Print the live configuration using kubectl get
:
kubectl get -f https://k8s.io/examples/application/simple_deployment.yaml -o yaml
The output shows that the kubectl.kubernetes.io/last-applied-configuration
annotation
was written to the live configuration, and it matches the configuration file:
kind: Deployment
metadata:
annotations:
# ...
# This is the json representation of simple_deployment.yaml
# It was written by kubectl apply when the object was created
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment",
"metadata":{"annotations":{},"name":"nginx-deployment","namespace":"default"},
"spec":{"minReadySeconds":5,"selector":{"matchLabels":{"app":nginx}},"template":{"metadata":{"labels":{"app":"nginx"}},
"spec":{"containers":[{"image":"nginx:1.14.2","name":"nginx",
"ports":[{"containerPort":80}]}]}}}}
# ...
spec:
# ...
minReadySeconds: 5
selector:
matchLabels:
# ...
app: nginx
template:
metadata:
# ...
labels:
app: nginx
spec:
containers:
- image: nginx:1.14.2
# ...
name: nginx
ports:
- containerPort: 80
# ...
# ...
# ...
# ...
Directly update the replicas
field in the live configuration by using kubectl scale
.
This does not use kubectl apply
:
kubectl scale deployment/nginx-deployment --replicas=2
Print the live configuration using kubectl get
:
kubectl get deployment nginx-deployment -o yaml
The output shows that the replicas
field has been set to 2, and the last-applied-configuration
annotation does not contain a replicas
field:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
# ...
# note that the annotation does not contain replicas
# because it was not updated through apply
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment",
"metadata":{"annotations":{},"name":"nginx-deployment","namespace":"default"},
"spec":{"minReadySeconds":5,"selector":{"matchLabels":{"app":nginx}},"template":{"metadata":{"labels":{"app":"nginx"}},
"spec":{"containers":[{"image":"nginx:1.14.2","name":"nginx",
"ports":[{"containerPort":80}]}]}}}}
# ...
spec:
replicas: 2 # written by scale
# ...
minReadySeconds: 5
selector:
matchLabels:
# ...
app: nginx
template:
metadata:
# ...
labels:
app: nginx
spec:
containers:
- image: nginx:1.14.2
# ...
name: nginx
ports:
- containerPort: 80
# ...
Update the simple_deployment.yaml
configuration file to change the image from
nginx:1.14.2
to nginx:1.16.1
, and delete the minReadySeconds
field:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.16.1 # update the image
ports:
- containerPort: 80
Apply the changes made to the configuration file:
kubectl diff -f https://k8s.io/examples/application/update_deployment.yaml
kubectl apply -f https://k8s.io/examples/application/update_deployment.yaml
Print the live configuration using kubectl get
:
kubectl get -f https://k8s.io/examples/application/update_deployment.yaml -o yaml
The output shows the following changes to the live configuration:
- The
replicas
field retains the value of 2 set bykubectl scale
. This is possible because it is omitted from the configuration file. - The
image
field has been updated tonginx:1.16.1
fromnginx:1.14.2
. - The
last-applied-configuration
annotation has been updated with the new image. - The
minReadySeconds
field has been cleared. - The
last-applied-configuration
annotation no longer contains theminReadySeconds
field.
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
# ...
# The annotation contains the updated image to nginx 1.11.9,
# but does not contain the updated replicas to 2
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment",
"metadata":{"annotations":{},"name":"nginx-deployment","namespace":"default"},
"spec":{"selector":{"matchLabels":{"app":nginx}},"template":{"metadata":{"labels":{"app":"nginx"}},
"spec":{"containers":[{"image":"nginx:1.16.1","name":"nginx",
"ports":[{"containerPort":80}]}]}}}}
# ...
spec:
replicas: 2 # Set by `kubectl scale`. Ignored by `kubectl apply`.
# minReadySeconds cleared by `kubectl apply`
# ...
selector:
matchLabels:
# ...
app: nginx
template:
metadata:
# ...
labels:
app: nginx
spec:
containers:
- image: nginx:1.16.1 # Set by `kubectl apply`
# ...
name: nginx
ports:
- containerPort: 80
# ...
# ...
# ...
# ...
kubectl apply
with the imperative object configuration commands
create
and replace
is not supported. This is because create
and replace
do not retain the kubectl.kubernetes.io/last-applied-configuration
that kubectl apply
uses to compute updates.
How to delete objects
There are two approaches to delete objects managed by kubectl apply
.
Recommended: kubectl delete -f <filename>
Manually deleting objects using the imperative command is the recommended approach, as it is more explicit about what is being deleted, and less likely to result in the user deleting something unintentionally:
kubectl delete -f <filename>
Alternative: kubectl apply -f <directory/> --prune -l your=label
Only use this if you know what you are doing.
kubectl apply --prune
is in alpha, and backwards incompatible
changes might be introduced in subsequent releases.
As an alternative to kubectl delete
, you can use kubectl apply
to identify objects to be deleted after their
configuration files have been removed from the directory. Apply with --prune
queries the API server for all objects matching a set of labels, and attempts
to match the returned live object configurations against the object
configuration files. If an object matches the query, and it does not have a
configuration file in the directory, and it has a last-applied-configuration
annotation,
it is deleted.
kubectl apply -f <directory/> --prune -l <labels>
-l <labels>
and
do not appear in the subdirectory.
How to view an object
You can use kubectl get
with -o yaml
to view the configuration of a live object:
kubectl get -f <filename|url> -o yaml
How apply calculates differences and merges changes
When kubectl apply
updates the live configuration for an object,
it does so by sending a patch request to the API server. The
patch defines updates scoped to specific fields of the live object
configuration. The kubectl apply
command calculates this patch request
using the configuration file, the live configuration, and the
last-applied-configuration
annotation stored in the live configuration.
Merge patch calculation
The kubectl apply
command writes the contents of the configuration file to the
kubectl.kubernetes.io/last-applied-configuration
annotation. This
is used to identify fields that have been removed from the configuration
file and need to be cleared from the live configuration. Here are the steps used
to calculate which fields should be deleted or set:
- Calculate the fields to delete. These are the fields present in
last-applied-configuration
and missing from the configuration file. - Calculate the fields to add or set. These are the fields present in the configuration file whose values don't match the live configuration.
Here's an example. Suppose this is the configuration file for a Deployment object:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.16.1 # update the image
ports:
- containerPort: 80
Also, suppose this is the live configuration for the same Deployment object:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
# ...
# note that the annotation does not contain replicas
# because it was not updated through apply
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment",
"metadata":{"annotations":{},"name":"nginx-deployment","namespace":"default"},
"spec":{"minReadySeconds":5,"selector":{"matchLabels":{"app":nginx}},"template":{"metadata":{"labels":{"app":"nginx"}},
"spec":{"containers":[{"image":"nginx:1.14.2","name":"nginx",
"ports":[{"containerPort":80}]}]}}}}
# ...
spec:
replicas: 2 # written by scale
# ...
minReadySeconds: 5
selector:
matchLabels:
# ...
app: nginx
template:
metadata:
# ...
labels:
app: nginx
spec:
containers:
- image: nginx:1.14.2
# ...
name: nginx
ports:
- containerPort: 80
# ...
Here are the merge calculations that would be performed by kubectl apply
:
- Calculate the fields to delete by reading values from
last-applied-configuration
and comparing them to values in the configuration file. Clear fields explicitly set to null in the local object configuration file regardless of whether they appear in thelast-applied-configuration
. In this example,minReadySeconds
appears in thelast-applied-configuration
annotation, but does not appear in the configuration file. Action: ClearminReadySeconds
from the live configuration. - Calculate the fields to set by reading values from the configuration
file and comparing them to values in the live configuration. In this example,
the value of
image
in the configuration file does not match the value in the live configuration. Action: Set the value ofimage
in the live configuration. - Set the
last-applied-configuration
annotation to match the value of the configuration file. - Merge the results from 1, 2, 3 into a single patch request to the API server.
Here is the live configuration that is the result of the merge:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
# ...
# The annotation contains the updated image to nginx 1.11.9,
# but does not contain the updated replicas to 2
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1","kind":"Deployment",
"metadata":{"annotations":{},"name":"nginx-deployment","namespace":"default"},
"spec":{"selector":{"matchLabels":{"app":nginx}},"template":{"metadata":{"labels":{"app":"nginx"}},
"spec":{"containers":[{"image":"nginx:1.16.1","name":"nginx",
"ports":[{"containerPort":80}]}]}}}}
# ...
spec:
selector:
matchLabels:
# ...
app: nginx
replicas: 2 # Set by `kubectl scale`. Ignored by `kubectl apply`.
# minReadySeconds cleared by `kubectl apply`
# ...
template:
metadata:
# ...
labels:
app: nginx
spec:
containers:
- image: nginx:1.16.1 # Set by `kubectl apply`
# ...
name: nginx
ports:
- containerPort: 80
# ...
# ...
# ...
# ...
How different types of fields are merged
How a particular field in a configuration file is merged with the live configuration depends on the type of the field. There are several types of fields:
-
primitive: A field of type string, integer, or boolean. For example,
image
andreplicas
are primitive fields. Action: Replace. -
map, also called object: A field of type map or a complex type that contains subfields. For example,
labels
,annotations
,spec
andmetadata
are all maps. Action: Merge elements or subfields. -
list: A field containing a list of items that can be either primitive types or maps. For example,
containers
,ports
, andargs
are lists. Action: Varies.
When kubectl apply
updates a map or list field, it typically does
not replace the entire field, but instead updates the individual subelements.
For instance, when merging the spec
on a Deployment, the entire spec
is
not replaced. Instead the subfields of spec
, such as replicas
, are compared
and merged.
Merging changes to primitive fields
Primitive fields are replaced or cleared.
-
is used for "not applicable" because the value is not used.
Field in object configuration file | Field in live object configuration | Field in last-applied-configuration | Action |
---|---|---|---|
Yes | Yes | - | Set live to configuration file value. |
Yes | No | - | Set live to local configuration. |
No | - | Yes | Clear from live configuration. |
No | - | No | Do nothing. Keep live value. |
Merging changes to map fields
Fields that represent maps are merged by comparing each of the subfields or elements of the map:
-
is used for "not applicable" because the value is not used.
Key in object configuration file | Key in live object configuration | Field in last-applied-configuration | Action |
---|---|---|---|
Yes | Yes | - | Compare sub fields values. |
Yes | No | - | Set live to local configuration. |
No | - | Yes | Delete from live configuration. |
No | - | No | Do nothing. Keep live value. |
Merging changes for fields of type list
Merging changes to a list uses one of three strategies:
- Replace the list if all its elements are primitives.
- Merge individual elements in a list of complex elements.
- Merge a list of primitive elements.
The choice of strategy is made on a per-field basis.
Replace the list if all its elements are primitives
Treat the list the same as a primitive field. Replace or delete the entire list. This preserves ordering.
Example: Use kubectl apply
to update the args
field of a Container in a Pod. This sets
the value of args
in the live configuration to the value in the configuration file.
Any args
elements that had previously been added to the live configuration are lost.
The order of the args
elements defined in the configuration file is
retained in the live configuration.
# last-applied-configuration value
args: ["a", "b"]
# configuration file value
args: ["a", "c"]
# live configuration
args: ["a", "b", "d"]
# result after merge
args: ["a", "c"]
Explanation: The merge used the configuration file value as the new list value.
Merge individual elements of a list of complex elements:
Treat the list as a map, and treat a specific field of each element as a key. Add, delete, or update individual elements. This does not preserve ordering.
This merge strategy uses a special tag on each field called a patchMergeKey
. The
patchMergeKey
is defined for each field in the Kubernetes source code:
types.go
When merging a list of maps, the field specified as the patchMergeKey
for a given element
is used like a map key for that element.
Example: Use kubectl apply
to update the containers
field of a PodSpec.
This merges the list as though it was a map where each element is keyed
by name
.
# last-applied-configuration value
containers:
- name: nginx
image: nginx:1.16
- name: nginx-helper-a # key: nginx-helper-a; will be deleted in result
image: helper:1.3
- name: nginx-helper-b # key: nginx-helper-b; will be retained
image: helper:1.3
# configuration file value
containers:
- name: nginx
image: nginx:1.16
- name: nginx-helper-b
image: helper:1.3
- name: nginx-helper-c # key: nginx-helper-c; will be added in result
image: helper:1.3
# live configuration
containers:
- name: nginx
image: nginx:1.16
- name: nginx-helper-a
image: helper:1.3
- name: nginx-helper-b
image: helper:1.3
args: ["run"] # Field will be retained
- name: nginx-helper-d # key: nginx-helper-d; will be retained
image: helper:1.3
# result after merge
containers:
- name: nginx
image: nginx:1.16
# Element nginx-helper-a was deleted
- name: nginx-helper-b
image: helper:1.3
args: ["run"] # Field was retained
- name: nginx-helper-c # Element was added
image: helper:1.3
- name: nginx-helper-d # Element was ignored
image: helper:1.3
Explanation:
- The container named "nginx-helper-a" was deleted because no container named "nginx-helper-a" appeared in the configuration file.
- The container named "nginx-helper-b" retained the changes to
args
in the live configuration.kubectl apply
was able to identify that "nginx-helper-b" in the live configuration was the same "nginx-helper-b" as in the configuration file, even though their fields had different values (noargs
in the configuration file). This is because thepatchMergeKey
field value (name) was identical in both. - The container named "nginx-helper-c" was added because no container with that name appeared in the live configuration, but one with that name appeared in the configuration file.
- The container named "nginx-helper-d" was retained because no element with that name appeared in the last-applied-configuration.
Merge a list of primitive elements
As of Kubernetes 1.5, merging lists of primitive elements is not supported.
patchStrategy
tag in types.go
If no patchStrategy
is specified for a field of type list, then
the list is replaced.
Default field values
The API server sets certain fields to default values in the live configuration if they are not specified when the object is created.
Here's a configuration file for a Deployment. The file does not specify strategy
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
minReadySeconds: 5
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Create the object using kubectl apply
:
kubectl apply -f https://k8s.io/examples/application/simple_deployment.yaml
Print the live configuration using kubectl get
:
kubectl get -f https://k8s.io/examples/application/simple_deployment.yaml -o yaml
The output shows that the API server set several fields to default values in the live configuration. These fields were not specified in the configuration file.
apiVersion: apps/v1
kind: Deployment
# ...
spec:
selector:
matchLabels:
app: nginx
minReadySeconds: 5
replicas: 1 # defaulted by apiserver
strategy:
rollingUpdate: # defaulted by apiserver - derived from strategy.type
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate # defaulted by apiserver
template:
metadata:
creationTimestamp: null
labels:
app: nginx
spec:
containers:
- image: nginx:1.14.2
imagePullPolicy: IfNotPresent # defaulted by apiserver
name: nginx
ports:
- containerPort: 80
protocol: TCP # defaulted by apiserver
resources: {} # defaulted by apiserver
terminationMessagePath: /dev/termination-log # defaulted by apiserver
dnsPolicy: ClusterFirst # defaulted by apiserver
restartPolicy: Always # defaulted by apiserver
securityContext: {} # defaulted by apiserver
terminationGracePeriodSeconds: 30 # defaulted by apiserver
# ...
In a patch request, defaulted fields are not re-defaulted unless they are explicitly cleared as part of a patch request. This can cause unexpected behavior for fields that are defaulted based on the values of other fields. When the other fields are later changed, the values defaulted from them will not be updated unless they are explicitly cleared.
For this reason, it is recommended that certain fields defaulted by the server are explicitly defined in the configuration file, even if the desired values match the server defaults. This makes it easier to recognize conflicting values that will not be re-defaulted by the server.
Example:
# last-applied-configuration
spec:
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
# configuration file
spec:
strategy:
type: Recreate # updated value
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
# live configuration
spec:
strategy:
type: RollingUpdate # defaulted value
rollingUpdate: # defaulted value derived from type
maxSurge : 1
maxUnavailable: 1
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
# result after merge - ERROR!
spec:
strategy:
type: Recreate # updated value: incompatible with rollingUpdate
rollingUpdate: # defaulted value: incompatible with "type: Recreate"
maxSurge : 1
maxUnavailable: 1
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Explanation:
- The user creates a Deployment without defining
strategy.type
. - The server defaults
strategy.type
toRollingUpdate
and defaults thestrategy.rollingUpdate
values. - The user changes
strategy.type
toRecreate
. Thestrategy.rollingUpdate
values remain at their defaulted values, though the server expects them to be cleared. If thestrategy.rollingUpdate
values had been defined initially in the configuration file, it would have been more clear that they needed to be deleted. - Apply fails because
strategy.rollingUpdate
is not cleared. Thestrategy.rollingupdate
field cannot be defined with astrategy.type
ofRecreate
.
Recommendation: These fields should be explicitly defined in the object configuration file:
- Selectors and PodTemplate labels on workloads, such as Deployment, StatefulSet, Job, DaemonSet, ReplicaSet, and ReplicationController
- Deployment rollout strategy
How to clear server-defaulted fields or fields set by other writers
Fields that do not appear in the configuration file can be cleared by
setting their values to null
and then applying the configuration file.
For fields defaulted by the server, this triggers re-defaulting
the values.
How to change ownership of a field between the configuration file and direct imperative writers
These are the only methods you should use to change an individual object field:
- Use
kubectl apply
. - Write directly to the live configuration without modifying the configuration file:
for example, use
kubectl scale
.
Changing the owner from a direct imperative writer to a configuration file
Add the field to the configuration file. For the field, discontinue direct updates to
the live configuration that do not go through kubectl apply
.
Changing the owner from a configuration file to a direct imperative writer
As of Kubernetes 1.5, changing ownership of a field from a configuration file to an imperative writer requires manual steps:
- Remove the field from the configuration file.
- Remove the field from the
kubectl.kubernetes.io/last-applied-configuration
annotation on the live object.
Changing management methods
Kubernetes objects should be managed using only one method at a time. Switching from one method to another is possible, but is a manual process.
Migrating from imperative command management to declarative object configuration
Migrating from imperative command management to declarative object configuration involves several manual steps:
-
Export the live object to a local configuration file:
kubectl get <kind>/<name> -o yaml > <kind>_<name>.yaml
-
Manually remove the
status
field from the configuration file.Note: This step is optional, askubectl apply
does not update the status field even if it is present in the configuration file. -
Set the
kubectl.kubernetes.io/last-applied-configuration
annotation on the object:kubectl replace --save-config -f <kind>_<name>.yaml
-
Change processes to use
kubectl apply
for managing the object exclusively.
Migrating from imperative object configuration to declarative object configuration
-
Set the
kubectl.kubernetes.io/last-applied-configuration
annotation on the object:kubectl replace --save-config -f <kind>_<name>.yaml
-
Change processes to use
kubectl apply
for managing the object exclusively.
Defining controller selectors and PodTemplate labels
The recommended approach is to define a single, immutable PodTemplate label used only by the controller selector with no other semantic meaning.
Example:
selector:
matchLabels:
controller-selector: "apps/v1/deployment/nginx"
template:
metadata:
labels:
controller-selector: "apps/v1/deployment/nginx"
What's next
4.4.2 - Declarative Management of Kubernetes Objects Using Kustomize
Kustomize is a standalone tool to customize Kubernetes objects through a kustomization file.
Since 1.14, Kubectl also supports the management of Kubernetes objects using a kustomization file. To view Resources found in a directory containing a kustomization file, run the following command:
kubectl kustomize <kustomization_directory>
To apply those Resources, run kubectl apply
with --kustomize
or -k
flag:
kubectl apply -k <kustomization_directory>
Before you begin
Install kubectl
.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Overview of Kustomize
Kustomize is a tool for customizing Kubernetes configurations. It has the following features to manage application configuration files:
- generating resources from other sources
- setting cross-cutting fields for resources
- composing and customizing collections of resources
Generating Resources
ConfigMaps and Secrets hold configuration or sensitive data that are used by other Kubernetes objects, such as Pods. The source of truth of ConfigMaps or Secrets are usually external to a cluster, such as a .properties
file or an SSH keyfile.
Kustomize has secretGenerator
and configMapGenerator
, which generate Secret and ConfigMap from files or literals.
configMapGenerator
To generate a ConfigMap from a file, add an entry to the files
list in configMapGenerator
. Here is an example of generating a ConfigMap with a data item from a .properties
file:
# Create a application.properties file
cat <<EOF >application.properties
FOO=Bar
EOF
cat <<EOF >./kustomization.yaml
configMapGenerator:
- name: example-configmap-1
files:
- application.properties
EOF
The generated ConfigMap can be examined with the following command:
kubectl kustomize ./
The generated ConfigMap is:
apiVersion: v1
data:
application.properties: |
FOO=Bar
kind: ConfigMap
metadata:
name: example-configmap-1-8mbdf7882g
To generate a ConfigMap from an env file, add an entry to the envs
list in configMapGenerator
. This can also be used to set values from local environment variables by omitting the =
and the value.
Here is an example of generating a ConfigMap with a data item from a .env
file:
# Create a .env file
# BAZ will be populated from the local environment variable $BAZ
cat <<EOF >.env
FOO=Bar
BAZ
EOF
cat <<EOF >./kustomization.yaml
configMapGenerator:
- name: example-configmap-1
envs:
- .env
EOF
The generated ConfigMap can be examined with the following command:
BAZ=Qux kubectl kustomize ./
The generated ConfigMap is:
apiVersion: v1
data:
BAZ: Qux
FOO: Bar
kind: ConfigMap
metadata:
name: example-configmap-1-892ghb99c8
.env
file becomes a separate key in the ConfigMap that you generate. This is different from the previous example which embeds a file named .properties
(and all its entries) as the value for a single key.
ConfigMaps can also be generated from literal key-value pairs. To generate a ConfigMap from a literal key-value pair, add an entry to the literals
list in configMapGenerator. Here is an example of generating a ConfigMap with a data item from a key-value pair:
cat <<EOF >./kustomization.yaml
configMapGenerator:
- name: example-configmap-2
literals:
- FOO=Bar
EOF
The generated ConfigMap can be checked by the following command:
kubectl kustomize ./
The generated ConfigMap is:
apiVersion: v1
data:
FOO: Bar
kind: ConfigMap
metadata:
name: example-configmap-2-g2hdhfc6tk
To use a generated ConfigMap in a Deployment, reference it by the name of the configMapGenerator. Kustomize will automatically replace this name with the generated name.
This is an example deployment that uses a generated ConfigMap:
# Create a application.properties file
cat <<EOF >application.properties
FOO=Bar
EOF
cat <<EOF >deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app
volumeMounts:
- name: config
mountPath: /config
volumes:
- name: config
configMap:
name: example-configmap-1
EOF
cat <<EOF >./kustomization.yaml
resources:
- deployment.yaml
configMapGenerator:
- name: example-configmap-1
files:
- application.properties
EOF
Generate the ConfigMap and Deployment:
kubectl kustomize ./
The generated Deployment will refer to the generated ConfigMap by name:
apiVersion: v1
data:
application.properties: |
FOO=Bar
kind: ConfigMap
metadata:
name: example-configmap-1-g4hk9g2ff8
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: my-app
name: my-app
spec:
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- image: my-app
name: app
volumeMounts:
- mountPath: /config
name: config
volumes:
- configMap:
name: example-configmap-1-g4hk9g2ff8
name: config
secretGenerator
You can generate Secrets from files or literal key-value pairs. To generate a Secret from a file, add an entry to the files
list in secretGenerator
. Here is an example of generating a Secret with a data item from a file:
# Create a password.txt file
cat <<EOF >./password.txt
username=admin
password=secret
EOF
cat <<EOF >./kustomization.yaml
secretGenerator:
- name: example-secret-1
files:
- password.txt
EOF
The generated Secret is as follows:
apiVersion: v1
data:
password.txt: dXNlcm5hbWU9YWRtaW4KcGFzc3dvcmQ9c2VjcmV0Cg==
kind: Secret
metadata:
name: example-secret-1-t2kt65hgtb
type: Opaque
To generate a Secret from a literal key-value pair, add an entry to literals
list in secretGenerator
. Here is an example of generating a Secret with a data item from a key-value pair:
cat <<EOF >./kustomization.yaml
secretGenerator:
- name: example-secret-2
literals:
- username=admin
- password=secret
EOF
The generated Secret is as follows:
apiVersion: v1
data:
password: c2VjcmV0
username: YWRtaW4=
kind: Secret
metadata:
name: example-secret-2-t52t6g96d8
type: Opaque
Like ConfigMaps, generated Secrets can be used in Deployments by referring to the name of the secretGenerator:
# Create a password.txt file
cat <<EOF >./password.txt
username=admin
password=secret
EOF
cat <<EOF >deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app
volumeMounts:
- name: password
mountPath: /secrets
volumes:
- name: password
secret:
secretName: example-secret-1
EOF
cat <<EOF >./kustomization.yaml
resources:
- deployment.yaml
secretGenerator:
- name: example-secret-1
files:
- password.txt
EOF
generatorOptions
The generated ConfigMaps and Secrets have a content hash suffix appended. This ensures that a new ConfigMap or Secret is generated when the contents are changed. To disable the behavior of appending a suffix, one can use generatorOptions
. Besides that, it is also possible to specify cross-cutting options for generated ConfigMaps and Secrets.
cat <<EOF >./kustomization.yaml
configMapGenerator:
- name: example-configmap-3
literals:
- FOO=Bar
generatorOptions:
disableNameSuffixHash: true
labels:
type: generated
annotations:
note: generated
EOF
Runkubectl kustomize ./
to view the generated ConfigMap:
apiVersion: v1
data:
FOO: Bar
kind: ConfigMap
metadata:
annotations:
note: generated
labels:
type: generated
name: example-configmap-3
Setting cross-cutting fields
It is quite common to set cross-cutting fields for all Kubernetes resources in a project. Some use cases for setting cross-cutting fields:
- setting the same namespace for all Resources
- adding the same name prefix or suffix
- adding the same set of labels
- adding the same set of annotations
Here is an example:
# Create a deployment.yaml
cat <<EOF >./deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
EOF
cat <<EOF >./kustomization.yaml
namespace: my-namespace
namePrefix: dev-
nameSuffix: "-001"
commonLabels:
app: bingo
commonAnnotations:
oncallPager: 800-555-1212
resources:
- deployment.yaml
EOF
Run kubectl kustomize ./
to view those fields are all set in the Deployment Resource:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
oncallPager: 800-555-1212
labels:
app: bingo
name: dev-nginx-deployment-001
namespace: my-namespace
spec:
selector:
matchLabels:
app: bingo
template:
metadata:
annotations:
oncallPager: 800-555-1212
labels:
app: bingo
spec:
containers:
- image: nginx
name: nginx
Composing and Customizing Resources
It is common to compose a set of Resources in a project and manage them inside the same file or directory. Kustomize offers composing Resources from different files and applying patches or other customization to them.
Composing
Kustomize supports composition of different resources. The resources
field, in the kustomization.yaml
file, defines the list of resources to include in a configuration. Set the path to a resource's configuration file in the resources
list.
Here is an example of an NGINX application comprised of a Deployment and a Service:
# Create a deployment.yaml file
cat <<EOF > deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
EOF
# Create a service.yaml file
cat <<EOF > service.yaml
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
ports:
- port: 80
protocol: TCP
selector:
run: my-nginx
EOF
# Create a kustomization.yaml composing them
cat <<EOF >./kustomization.yaml
resources:
- deployment.yaml
- service.yaml
EOF
The Resources from kubectl kustomize ./
contain both the Deployment and the Service objects.
Customizing
Patches can be used to apply different customizations to Resources. Kustomize supports different patching
mechanisms through patchesStrategicMerge
and patchesJson6902
. patchesStrategicMerge
is a list of file paths. Each file should be resolved to a strategic merge patch. The names inside the patches must match Resource names that are already loaded. Small patches that do one thing are recommended. For example, create one patch for increasing the deployment replica number and another patch for setting the memory limit.
# Create a deployment.yaml file
cat <<EOF > deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
EOF
# Create a patch increase_replicas.yaml
cat <<EOF > increase_replicas.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
replicas: 3
EOF
# Create another patch set_memory.yaml
cat <<EOF > set_memory.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
template:
spec:
containers:
- name: my-nginx
resources:
limits:
memory: 512Mi
EOF
cat <<EOF >./kustomization.yaml
resources:
- deployment.yaml
patchesStrategicMerge:
- increase_replicas.yaml
- set_memory.yaml
EOF
Run kubectl kustomize ./
to view the Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
replicas: 3
selector:
matchLabels:
run: my-nginx
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- image: nginx
name: my-nginx
ports:
- containerPort: 80
resources:
limits:
memory: 512Mi
Not all Resources or fields support strategic merge patches. To support modifying arbitrary fields in arbitrary Resources,
Kustomize offers applying JSON patch through patchesJson6902
.
To find the correct Resource for a Json patch, the group, version, kind and name of that Resource need to be
specified in kustomization.yaml
. For example, increasing the replica number of a Deployment object can also be done
through patchesJson6902
.
# Create a deployment.yaml file
cat <<EOF > deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
EOF
# Create a json patch
cat <<EOF > patch.yaml
- op: replace
path: /spec/replicas
value: 3
EOF
# Create a kustomization.yaml
cat <<EOF >./kustomization.yaml
resources:
- deployment.yaml
patchesJson6902:
- target:
group: apps
version: v1
kind: Deployment
name: my-nginx
path: patch.yaml
EOF
Run kubectl kustomize ./
to see the replicas
field is updated:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
replicas: 3
selector:
matchLabels:
run: my-nginx
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- image: nginx
name: my-nginx
ports:
- containerPort: 80
In addition to patches, Kustomize also offers customizing container images or injecting field values from other objects into containers
without creating patches. For example, you can change the image used inside containers by specifying the new image in images
field in kustomization.yaml
.
cat <<EOF > deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
EOF
cat <<EOF >./kustomization.yaml
resources:
- deployment.yaml
images:
- name: nginx
newName: my.image.registry/nginx
newTag: 1.4.0
EOF
Run kubectl kustomize ./
to see that the image being used is updated:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
replicas: 2
selector:
matchLabels:
run: my-nginx
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- image: my.image.registry/nginx:1.4.0
name: my-nginx
ports:
- containerPort: 80
Sometimes, the application running in a Pod may need to use configuration values from other objects. For example,
a Pod from a Deployment object need to read the corresponding Service name from Env or as a command argument.
Since the Service name may change as namePrefix
or nameSuffix
is added in the kustomization.yaml
file. It is
not recommended to hard code the Service name in the command argument. For this usage, Kustomize can inject the Service name into containers through vars
.
# Create a deployment.yaml file (quoting the here doc delimiter)
cat <<'EOF' > deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
command: ["start", "--host", "$(MY_SERVICE_NAME)"]
EOF
# Create a service.yaml file
cat <<EOF > service.yaml
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
ports:
- port: 80
protocol: TCP
selector:
run: my-nginx
EOF
cat <<EOF >./kustomization.yaml
namePrefix: dev-
nameSuffix: "-001"
resources:
- deployment.yaml
- service.yaml
vars:
- name: MY_SERVICE_NAME
objref:
kind: Service
name: my-nginx
apiVersion: v1
EOF
Run kubectl kustomize ./
to see that the Service name injected into containers is dev-my-nginx-001
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: dev-my-nginx-001
spec:
replicas: 2
selector:
matchLabels:
run: my-nginx
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- command:
- start
- --host
- dev-my-nginx-001
image: nginx
name: my-nginx
Bases and Overlays
Kustomize has the concepts of bases and overlays. A base is a directory with a kustomization.yaml
, which contains a
set of resources and associated customization. A base could be either a local directory or a directory from a remote repo,
as long as a kustomization.yaml
is present inside. An overlay is a directory with a kustomization.yaml
that refers to other
kustomization directories as its bases
. A base has no knowledge of an overlay and can be used in multiple overlays.
An overlay may have multiple bases and it composes all resources
from bases and may also have customization on top of them.
Here is an example of a base:
# Create a directory to hold the base
mkdir base
# Create a base/deployment.yaml
cat <<EOF > base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
EOF
# Create a base/service.yaml file
cat <<EOF > base/service.yaml
apiVersion: v1
kind: Service
metadata:
name: my-nginx
labels:
run: my-nginx
spec:
ports:
- port: 80
protocol: TCP
selector:
run: my-nginx
EOF
# Create a base/kustomization.yaml
cat <<EOF > base/kustomization.yaml
resources:
- deployment.yaml
- service.yaml
EOF
This base can be used in multiple overlays. You can add different namePrefix
or other cross-cutting fields
in different overlays. Here are two overlays using the same base.
mkdir dev
cat <<EOF > dev/kustomization.yaml
bases:
- ../base
namePrefix: dev-
EOF
mkdir prod
cat <<EOF > prod/kustomization.yaml
bases:
- ../base
namePrefix: prod-
EOF
How to apply/view/delete objects using Kustomize
Use --kustomize
or -k
in kubectl
commands to recognize Resources managed by kustomization.yaml
.
Note that -k
should point to a kustomization directory, such as
kubectl apply -k <kustomization directory>/
Given the following kustomization.yaml
,
# Create a deployment.yaml file
cat <<EOF > deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx
spec:
selector:
matchLabels:
run: my-nginx
replicas: 2
template:
metadata:
labels:
run: my-nginx
spec:
containers:
- name: my-nginx
image: nginx
ports:
- containerPort: 80
EOF
# Create a kustomization.yaml
cat <<EOF >./kustomization.yaml
namePrefix: dev-
commonLabels:
app: my-nginx
resources:
- deployment.yaml
EOF
Run the following command to apply the Deployment object dev-my-nginx
:
> kubectl apply -k ./
deployment.apps/dev-my-nginx created
Run one of the following commands to view the Deployment object dev-my-nginx
:
kubectl get -k ./
kubectl describe -k ./
Run the following command to compare the Deployment object dev-my-nginx
against the state that the cluster would be in if the manifest was applied:
kubectl diff -k ./
Run the following command to delete the Deployment object dev-my-nginx
:
> kubectl delete -k ./
deployment.apps "dev-my-nginx" deleted
Kustomize Feature List
Field | Type | Explanation |
---|---|---|
namespace | string | add namespace to all resources |
namePrefix | string | value of this field is prepended to the names of all resources |
nameSuffix | string | value of this field is appended to the names of all resources |
commonLabels | map[string]string | labels to add to all resources and selectors |
commonAnnotations | map[string]string | annotations to add to all resources |
resources | []string | each entry in this list must resolve to an existing resource configuration file |
configMapGenerator | []ConfigMapArgs | Each entry in this list generates a ConfigMap |
secretGenerator | []SecretArgs | Each entry in this list generates a Secret |
generatorOptions | GeneratorOptions | Modify behaviors of all ConfigMap and Secret generator |
bases | []string | Each entry in this list should resolve to a directory containing a kustomization.yaml file |
patchesStrategicMerge | []string | Each entry in this list should resolve a strategic merge patch of a Kubernetes object |
patchesJson6902 | []Patch | Each entry in this list should resolve to a Kubernetes object and a Json Patch |
vars | []Var | Each entry is to capture text from one resource's field |
images | []Image | Each entry is to modify the name, tags and/or digest for one image without creating patches |
configurations | []string | Each entry in this list should resolve to a file containing Kustomize transformer configurations |
crds | []string | Each entry in this list should resolve to an OpenAPI definition file for Kubernetes types |
What's next
4.4.3 - Managing Kubernetes Objects Using Imperative Commands
Kubernetes objects can quickly be created, updated, and deleted directly using
imperative commands built into the kubectl
command-line tool. This document
explains how those commands are organized and how to use them to manage live objects.
Before you begin
Install kubectl
.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Trade-offs
The kubectl
tool supports three kinds of object management:
- Imperative commands
- Imperative object configuration
- Declarative object configuration
See Kubernetes Object Management for a discussion of the advantages and disadvantage of each kind of object management.
How to create objects
The kubectl
tool supports verb-driven commands for creating some of the most common
object types. The commands are named to be recognizable to users unfamiliar with
the Kubernetes object types.
run
: Create a new Pod to run a Container.expose
: Create a new Service object to load balance traffic across Pods.autoscale
: Create a new Autoscaler object to automatically horizontally scale a controller, such as a Deployment.
The kubectl
tool also supports creation commands driven by object type.
These commands support more object types and are more explicit about
their intent, but require users to know the type of objects they intend
to create.
create <objecttype> [<subtype>] <instancename>
Some objects types have subtypes that you can specify in the create
command.
For example, the Service object has several subtypes including ClusterIP,
LoadBalancer, and NodePort. Here's an example that creates a Service with
subtype NodePort:
kubectl create service nodeport <myservicename>
In the preceding example, the create service nodeport
command is called
a subcommand of the create service
command.
You can use the -h
flag to find the arguments and flags supported by
a subcommand:
kubectl create service nodeport -h
How to update objects
The kubectl
command supports verb-driven commands for some common update operations.
These commands are named to enable users unfamiliar with Kubernetes
objects to perform updates without knowing the specific fields
that must be set:
scale
: Horizontally scale a controller to add or remove Pods by updating the replica count of the controller.annotate
: Add or remove an annotation from an object.label
: Add or remove a label from an object.
The kubectl
command also supports update commands driven by an aspect of the object.
Setting this aspect may set different fields for different object types:
set
<field>
: Set an aspect of an object.
The kubectl
tool supports these additional ways to update a live object directly,
however they require a better understanding of the Kubernetes object schema.
edit
: Directly edit the raw configuration of a live object by opening its configuration in an editor.patch
: Directly modify specific fields of a live object by using a patch string. For more details on patch strings, see the patch section in API Conventions.
How to delete objects
You can use the delete
command to delete an object from a cluster:
delete <type>/<name>
kubectl delete
for both imperative commands and imperative object
configuration. The difference is in the arguments passed to the command. To use
kubectl delete
as an imperative command, pass the object to be deleted as
an argument. Here's an example that passes a Deployment object named nginx:
kubectl delete deployment/nginx
How to view an object
There are several commands for printing information about an object:
get
: Prints basic information about matching objects. Useget -h
to see a list of options.describe
: Prints aggregated detailed information about matching objects.logs
: Prints the stdout and stderr for a container running in a Pod.
Using set
commands to modify objects before creation
There are some object fields that don't have a flag you can use
in a create
command. In some of those cases, you can use a combination of
set
and create
to specify a value for the field before object
creation. This is done by piping the output of the create
command to the
set
command, and then back to the create
command. Here's an example:
kubectl create service clusterip my-svc --clusterip="None" -o yaml --dry-run=client | kubectl set selector --local -f - 'environment=qa' -o yaml | kubectl create -f -
- The
kubectl create service -o yaml --dry-run=client
command creates the configuration for the Service, but prints it to stdout as YAML instead of sending it to the Kubernetes API server. - The
kubectl set selector --local -f - -o yaml
command reads the configuration from stdin, and writes the updated configuration to stdout as YAML. - The
kubectl create -f -
command creates the object using the configuration provided via stdin.
Using --edit
to modify objects before creation
You can use kubectl create --edit
to make arbitrary changes to an object
before it is created. Here's an example:
kubectl create service clusterip my-svc --clusterip="None" -o yaml --dry-run=client > /tmp/srv.yaml
kubectl create --edit -f /tmp/srv.yaml
- The
kubectl create service
command creates the configuration for the Service and saves it to/tmp/srv.yaml
. - The
kubectl create --edit
command opens the configuration file for editing before it creates the object.
What's next
4.4.4 - Imperative Management of Kubernetes Objects Using Configuration Files
Kubernetes objects can be created, updated, and deleted by using the kubectl
command-line tool along with an object configuration file written in YAML or JSON.
This document explains how to define and manage objects using configuration files.
Before you begin
Install kubectl
.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Trade-offs
The kubectl
tool supports three kinds of object management:
- Imperative commands
- Imperative object configuration
- Declarative object configuration
See Kubernetes Object Management for a discussion of the advantages and disadvantage of each kind of object management.
How to create objects
You can use kubectl create -f
to create an object from a configuration file.
Refer to the kubernetes API reference
for details.
kubectl create -f <filename|url>
How to update objects
replace
command drops all
parts of the spec not specified in the configuration file. This
should not be used with objects whose specs are partially managed
by the cluster, such as Services of type LoadBalancer
, where
the externalIPs
field is managed independently from the configuration
file. Independently managed fields must be copied to the configuration
file to prevent replace
from dropping them.
You can use kubectl replace -f
to update a live object according to a
configuration file.
kubectl replace -f <filename|url>
How to delete objects
You can use kubectl delete -f
to delete an object that is described in a
configuration file.
kubectl delete -f <filename|url>
If configuration file has specified the generateName
field in the metadata
section instead of the name
field, you cannot delete the object using
kubectl delete -f <filename|url>
.
You will have to use other flags for deleting the object. For example:
kubectl delete <type> <name>
kubectl delete <type> -l <label>
How to view an object
You can use kubectl get -f
to view information about an object that is
described in a configuration file.
kubectl get -f <filename|url> -o yaml
The -o yaml
flag specifies that the full object configuration is printed.
Use kubectl get -h
to see a list of options.
Limitations
The create
, replace
, and delete
commands work well when each object's
configuration is fully defined and recorded in its configuration
file. However when a live object is updated, and the updates are not merged
into its configuration file, the updates will be lost the next time a replace
is executed. This can happen if a controller, such as
a HorizontalPodAutoscaler, makes updates directly to a live object. Here's
an example:
- You create an object from a configuration file.
- Another source updates the object by changing some field.
- You replace the object from the configuration file. Changes made by the other source in step 2 are lost.
If you need to support multiple writers to the same object, you can use
kubectl apply
to manage the object.
Creating and editing an object from a URL without saving the configuration
Suppose you have the URL of an object configuration file. You can use
kubectl create --edit
to make changes to the configuration before the
object is created. This is particularly useful for tutorials and tasks
that point to a configuration file that could be modified by the reader.
kubectl create -f <url> --edit
Migrating from imperative commands to imperative object configuration
Migrating from imperative commands to imperative object configuration involves several manual steps.
-
Export the live object to a local object configuration file:
kubectl get <kind>/<name> -o yaml > <kind>_<name>.yaml
-
Manually remove the status field from the object configuration file.
-
For subsequent object management, use
replace
exclusively.kubectl replace -f <kind>_<name>.yaml
Defining controller selectors and PodTemplate labels
The recommended approach is to define a single, immutable PodTemplate label used only by the controller selector with no other semantic meaning.
Example label:
selector:
matchLabels:
controller-selector: "apps/v1/deployment/nginx"
template:
metadata:
labels:
controller-selector: "apps/v1/deployment/nginx"
What's next
4.4.5 - Update API Objects in Place Using kubectl patch
This task shows how to use kubectl patch
to update an API object in place. The exercises
in this task demonstrate a strategic merge patch and a JSON merge patch.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Use a strategic merge patch to update a Deployment
Here's the configuration file for a Deployment that has two replicas. Each replica is a Pod that has one container:
apiVersion: apps/v1
kind: Deployment
metadata:
name: patch-demo
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: patch-demo-ctr
image: nginx
tolerations:
- effect: NoSchedule
key: dedicated
value: test-team
Create the Deployment:
kubectl apply -f https://k8s.io/examples/application/deployment-patch.yaml
View the Pods associated with your Deployment:
kubectl get pods
The output shows that the Deployment has two Pods. The 1/1
indicates that
each Pod has one container:
NAME READY STATUS RESTARTS AGE
patch-demo-28633765-670qr 1/1 Running 0 23s
patch-demo-28633765-j5qs3 1/1 Running 0 23s
Make a note of the names of the running Pods. Later, you will see that these Pods get terminated and replaced by new ones.
At this point, each Pod has one Container that runs the nginx image. Now suppose you want each Pod to have two containers: one that runs nginx and one that runs redis.
Create a file named patch-file.yaml
that has this content:
spec:
template:
spec:
containers:
- name: patch-demo-ctr-2
image: redis
Patch your Deployment:
kubectl patch deployment patch-demo --patch-file patch-file.yaml
View the patched Deployment:
kubectl get deployment patch-demo --output yaml
The output shows that the PodSpec in the Deployment has two Containers:
containers:
- image: redis
imagePullPolicy: Always
name: patch-demo-ctr-2
...
- image: nginx
imagePullPolicy: Always
name: patch-demo-ctr
...
View the Pods associated with your patched Deployment:
kubectl get pods
The output shows that the running Pods have different names from the Pods that
were running previously. The Deployment terminated the old Pods and created two
new Pods that comply with the updated Deployment spec. The 2/2
indicates that
each Pod has two Containers:
NAME READY STATUS RESTARTS AGE
patch-demo-1081991389-2wrn5 2/2 Running 0 1m
patch-demo-1081991389-jmg7b 2/2 Running 0 1m
Take a closer look at one of the patch-demo Pods:
kubectl get pod <your-pod-name> --output yaml
The output shows that the Pod has two Containers: one running nginx and one running redis:
containers:
- image: redis
...
- image: nginx
...
Notes on the strategic merge patch
The patch you did in the preceding exercise is called a strategic merge patch.
Notice that the patch did not replace the containers
list. Instead it added a new
Container to the list. In other words, the list in the patch was merged with the
existing list. This is not always what happens when you use a strategic merge patch on a list.
In some cases, the list is replaced, not merged.
With a strategic merge patch, a list is either replaced or merged depending on its
patch strategy. The patch strategy is specified by the value of the patchStrategy
key
in a field tag in the Kubernetes source code. For example, the Containers
field of PodSpec
struct has a patchStrategy
of merge
:
type PodSpec struct {
...
Containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" ...`
You can also see the patch strategy in the OpenApi spec:
"io.k8s.api.core.v1.PodSpec": {
...
"containers": {
"description": "List of containers belonging to the pod. ...
},
"x-kubernetes-patch-merge-key": "name",
"x-kubernetes-patch-strategy": "merge"
},
And you can see the patch strategy in the Kubernetes API documentation.
Create a file named patch-file-tolerations.yaml
that has this content:
spec:
template:
spec:
tolerations:
- effect: NoSchedule
key: disktype
value: ssd
Patch your Deployment:
kubectl patch deployment patch-demo --patch-file patch-file-tolerations.yaml
View the patched Deployment:
kubectl get deployment patch-demo --output yaml
The output shows that the PodSpec in the Deployment has only one Toleration:
tolerations:
- effect: NoSchedule
key: disktype
value: ssd
Notice that the tolerations
list in the PodSpec was replaced, not merged. This is because
the Tolerations field of PodSpec does not have a patchStrategy
key in its field tag. So the
strategic merge patch uses the default patch strategy, which is replace
.
type PodSpec struct {
...
Tolerations []Toleration `json:"tolerations,omitempty" protobuf:"bytes,22,opt,name=tolerations"`
Use a JSON merge patch to update a Deployment
A strategic merge patch is different from a JSON merge patch. With a JSON merge patch, if you want to update a list, you have to specify the entire new list. And the new list completely replaces the existing list.
The kubectl patch
command has a type
parameter that you can set to one of these values:
Parameter value | Merge type |
---|---|
json | JSON Patch, RFC 6902 |
merge | JSON Merge Patch, RFC 7386 |
strategic | Strategic merge patch |
For a comparison of JSON patch and JSON merge patch, see JSON Patch and JSON Merge Patch.
The default value for the type
parameter is strategic
. So in the preceding exercise, you
did a strategic merge patch.
Next, do a JSON merge patch on your same Deployment. Create a file named patch-file-2.yaml
that has this content:
spec:
template:
spec:
containers:
- name: patch-demo-ctr-3
image: gcr.io/google-samples/node-hello:1.0
In your patch command, set type
to merge
:
kubectl patch deployment patch-demo --type merge --patch-file patch-file-2.yaml
View the patched Deployment:
kubectl get deployment patch-demo --output yaml
The containers
list that you specified in the patch has only one Container.
The output shows that your list of one Container replaced the existing containers
list.
spec:
containers:
- image: gcr.io/google-samples/node-hello:1.0
...
name: patch-demo-ctr-3
List the running Pods:
kubectl get pods
In the output, you can see that the existing Pods were terminated, and new Pods
were created. The 1/1
indicates that each new Pod is running only one Container.
NAME READY STATUS RESTARTS AGE
patch-demo-1307768864-69308 1/1 Running 0 1m
patch-demo-1307768864-c86dc 1/1 Running 0 1m
Use strategic merge patch to update a Deployment using the retainKeys strategy
Here's the configuration file for a Deployment that uses the RollingUpdate
strategy:
apiVersion: apps/v1
kind: Deployment
metadata:
name: retainkeys-demo
spec:
selector:
matchLabels:
app: nginx
strategy:
rollingUpdate:
maxSurge: 30%
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: retainkeys-demo-ctr
image: nginx
Create the deployment:
kubectl apply -f https://k8s.io/examples/application/deployment-retainkeys.yaml
At this point, the deployment is created and is using the RollingUpdate
strategy.
Create a file named patch-file-no-retainkeys.yaml
that has this content:
spec:
strategy:
type: Recreate
Patch your Deployment:
kubectl patch deployment retainkeys-demo --type merge --patch-file patch-file-no-retainkeys.yaml
In the output, you can see that it is not possible to set type
as Recreate
when a value is defined for spec.strategy.rollingUpdate
:
The Deployment "retainkeys-demo" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate'
The way to remove the value for spec.strategy.rollingUpdate
when updating the value for type
is to use the retainKeys
strategy for the strategic merge.
Create another file named patch-file-retainkeys.yaml
that has this content:
spec:
strategy:
$retainKeys:
- type
type: Recreate
With this patch, we indicate that we want to retain only the type
key of the strategy
object. Thus, the rollingUpdate
will be removed during the patch operation.
Patch your Deployment again with this new patch:
kubectl patch deployment retainkeys-demo --type merge --patch-file patch-file-retainkeys.yaml
Examine the content of the Deployment:
kubectl get deployment retainkeys-demo --output yaml
The output shows that the strategy object in the Deployment does not contain the rollingUpdate
key anymore:
spec:
strategy:
type: Recreate
template:
Notes on the strategic merge patch using the retainKeys strategy
The patch you did in the preceding exercise is called a strategic merge patch with retainKeys strategy. This method introduces a new directive $retainKeys
that has the following strategies:
- It contains a list of strings.
- All fields needing to be preserved must be present in the
$retainKeys
list. - The fields that are present will be merged with live object.
- All of the missing fields will be cleared when patching.
- All fields in the
$retainKeys
list must be a superset or the same as the fields present in the patch.
The retainKeys
strategy does not work for all objects. It only works when the value of the patchStrategy
key in a field tag in the Kubernetes source code contains retainKeys
. For example, the Strategy
field of the DeploymentSpec
struct has a patchStrategy
of retainKeys
:
type DeploymentSpec struct {
...
// +patchStrategy=retainKeys
Strategy DeploymentStrategy `json:"strategy,omitempty" patchStrategy:"retainKeys" ...`
You can also see the retainKeys
strategy in the OpenApi spec:
"io.k8s.api.apps.v1.DeploymentSpec": {
...
"strategy": {
"$ref": "#/definitions/io.k8s.api.apps.v1.DeploymentStrategy",
"description": "The deployment strategy to use to replace existing pods with new ones.",
"x-kubernetes-patch-strategy": "retainKeys"
},
And you can see the retainKeys
strategy in the
Kubernetes API documentation.
Alternate forms of the kubectl patch command
The kubectl patch
command takes YAML or JSON. It can take the patch as a file or
directly on the command line.
Create a file named patch-file.json
that has this content:
{
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "patch-demo-ctr-2",
"image": "redis"
}
]
}
}
}
}
The following commands are equivalent:
kubectl patch deployment patch-demo --patch-file patch-file.yaml
kubectl patch deployment patch-demo --patch 'spec:\n template:\n spec:\n containers:\n - name: patch-demo-ctr-2\n image: redis'
kubectl patch deployment patch-demo --patch-file patch-file.json
kubectl patch deployment patch-demo --patch '{"spec": {"template": {"spec": {"containers": [{"name": "patch-demo-ctr-2","image": "redis"}]}}}}'
Summary
In this exercise, you used kubectl patch
to change the live configuration
of a Deployment object. You did not change the configuration file that you originally used to
create the Deployment object. Other commands for updating API objects include
kubectl annotate,
kubectl edit,
kubectl replace,
kubectl scale,
and
kubectl apply.
What's next
4.5 - Managing Secrets
4.5.1 - Managing Secrets using kubectl
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Create a Secret
A Secret
can contain user credentials required by pods to access a database.
For example, a database connection string consists of a username and password.
You can store the username in a file ./username.txt
and the password in a
file ./password.txt
on your local machine.
echo -n 'admin' > ./username.txt
echo -n '1f2d1e2e67df' > ./password.txt
In these commands, the -n
flag ensures that the generated files do not have
an extra newline character at the end of the text. This is important because
when kubectl
reads a file and encodes the content into a base64 string, the
extra newline character gets encoded too.
The kubectl create secret
command packages these files into a Secret and creates
the object on the API server.
kubectl create secret generic db-user-pass \
--from-file=./username.txt \
--from-file=./password.txt
The output is similar to:
secret/db-user-pass created
The default key name is the filename. You can optionally set the key name using
--from-file=[key=]source
. For example:
kubectl create secret generic db-user-pass \
--from-file=username=./username.txt \
--from-file=password=./password.txt
You do not need to escape special characters in password strings that you include in a file.
You can also provide Secret data using the --from-literal=<key>=<value>
tag.
This tag can be specified more than once to provide multiple key-value pairs.
Note that special characters such as $
, \
, *
, =
, and !
will be
interpreted by your shell
and require escaping.
In most shells, the easiest way to escape the password is to surround it with
single quotes ('
). For example, if your password is S!B\*d$zDsb=
,
run the following command:
kubectl create secret generic db-user-pass \
--from-literal=username=devuser \
--from-literal=password='S!B\*d$zDsb='
Verify the Secret
Check that the Secret was created:
kubectl get secrets
The output is similar to:
NAME TYPE DATA AGE
db-user-pass Opaque 2 51s
You can view a description of the Secret
:
kubectl describe secrets/db-user-pass
The output is similar to:
Name: db-user-pass
Namespace: default
Labels: <none>
Annotations: <none>
Type: Opaque
Data
====
password: 12 bytes
username: 5 bytes
The commands kubectl get
and kubectl describe
avoid showing the contents
of a Secret
by default. This is to protect the Secret
from being exposed
accidentally, or from being stored in a terminal log.
Decoding the Secret
To view the contents of the Secret you created, run the following command:
kubectl get secret db-user-pass -o jsonpath='{.data}'
The output is similar to:
{"password":"MWYyZDFlMmU2N2Rm","username":"YWRtaW4="}
Now you can decode the password
data:
# This is an example for documentation purposes.
# If you did things this way, the data 'MWYyZDFlMmU2N2Rm' could be stored in
# your shell history.
# Someone with access to you computer could find that remembered command
# and base-64 decode the secret, perhaps without your knowledge.
# It's usually better to combine the steps, as shown later in the page.
echo 'MWYyZDFlMmU2N2Rm' | base64 --decode
The output is similar to:
1f2d1e2e67df
In order to avoid storing a secret encoded value in your shell history, you can run the following command:
kubectl get secret db-user-pass -o jsonpath='{.data.password}' | base64 --decode
The output shall be similar as above.
Clean Up
Delete the Secret you created:
kubectl delete secret db-user-pass
What's next
- Read more about the Secret concept
- Learn how to manage Secrets using config files
- Learn how to manage Secrets using kustomize
4.5.2 - Managing Secrets using Configuration File
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Create the Config file
You can create a Secret in a file first, in JSON or YAML format, and then
create that object. The
Secret
resource contains two maps: data
and stringData
.
The data
field is used to store arbitrary data, encoded using base64. The
stringData
field is provided for convenience, and it allows you to provide
Secret data as unencoded strings.
The keys of data
and stringData
must consist of alphanumeric characters,
-
, _
or .
.
For example, to store two strings in a Secret using the data
field, convert
the strings to base64 as follows:
echo -n 'admin' | base64
The output is similar to:
YWRtaW4=
echo -n '1f2d1e2e67df' | base64
The output is similar to:
MWYyZDFlMmU2N2Rm
Write a Secret config file that looks like this:
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
data:
username: YWRtaW4=
password: MWYyZDFlMmU2N2Rm
Note that the name of a Secret object must be a valid DNS subdomain name.
base64
utility on Darwin/macOS, users should avoid using the -b
option to split long lines. Conversely, Linux users should add the option
-w 0
to base64
commands or the pipeline base64 | tr -d '\n'
if the -w
option is not available.
For certain scenarios, you may wish to use the stringData
field instead. This
field allows you to put a non-base64 encoded string directly into the Secret,
and the string will be encoded for you when the Secret is created or updated.
A practical example of this might be where you are deploying an application that uses a Secret to store a configuration file, and you want to populate parts of that configuration file during your deployment process.
For example, if your application uses the following configuration file:
apiUrl: "https://my.api.com/api/v1"
username: "<user>"
password: "<password>"
You could store this in a Secret using the following definition:
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
stringData:
config.yaml: |
apiUrl: "https://my.api.com/api/v1"
username: <user>
password: <password>
Create the Secret object
Now create the Secret using kubectl apply
:
kubectl apply -f ./secret.yaml
The output is similar to:
secret/mysecret created
Check the Secret
The stringData
field is a write-only convenience field. It is never output when
retrieving Secrets. For example, if you run the following command:
kubectl get secret mysecret -o yaml
The output is similar to:
apiVersion: v1
data:
config.yaml: YXBpVXJsOiAiaHR0cHM6Ly9teS5hcGkuY29tL2FwaS92MSIKdXNlcm5hbWU6IHt7dXNlcm5hbWV9fQpwYXNzd29yZDoge3twYXNzd29yZH19
kind: Secret
metadata:
creationTimestamp: 2018-11-15T20:40:59Z
name: mysecret
namespace: default
resourceVersion: "7225"
uid: c280ad2e-e916-11e8-98f2-025000000001
type: Opaque
The commands kubectl get
and kubectl describe
avoid showing the contents of a Secret
by
default. This is to protect the Secret
from being exposed accidentally to an onlooker,
or from being stored in a terminal log.
To check the actual content of the encoded data, please refer to
decoding secret.
If a field, such as username
, is specified in both data
and stringData
,
the value from stringData
is used. For example, the following Secret definition:
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
data:
username: YWRtaW4=
stringData:
username: administrator
Results in the following Secret:
apiVersion: v1
data:
username: YWRtaW5pc3RyYXRvcg==
kind: Secret
metadata:
creationTimestamp: 2018-11-15T20:46:46Z
name: mysecret
namespace: default
resourceVersion: "7579"
uid: 91460ecb-e917-11e8-98f2-025000000001
type: Opaque
Where YWRtaW5pc3RyYXRvcg==
decodes to administrator
.
Clean Up
To delete the Secret you have created:
kubectl delete secret mysecret
What's next
- Read more about the Secret concept
- Learn how to manage Secrets with the
kubectl
command - Learn how to manage Secrets using kustomize
4.5.3 - Managing Secrets using Kustomize
Since Kubernetes v1.14, kubectl
supports
managing objects using Kustomize.
Kustomize provides resource Generators to create Secrets and ConfigMaps. The
Kustomize generators should be specified in a kustomization.yaml
file inside
a directory. After generating the Secret, you can create the Secret on the API
server with kubectl apply
.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Create the Kustomization file
You can generate a Secret by defining a secretGenerator
in a
kustomization.yaml
file that references other existing files.
For example, the following kustomization file references the
./username.txt
and the ./password.txt
files:
secretGenerator:
- name: db-user-pass
files:
- username.txt
- password.txt
You can also define the secretGenerator
in the kustomization.yaml
file by providing some literals.
For example, the following kustomization.yaml
file contains two literals
for username
and password
respectively:
secretGenerator:
- name: db-user-pass
literals:
- username=admin
- password=1f2d1e2e67df
You can also define the secretGenerator
in the kustomization.yaml
file by providing .env
files.
For example, the following kustomization.yaml
file pulls in data from
.env.secret
file:
secretGenerator:
- name: db-user-pass
envs:
- .env.secret
Note that in all cases, you don't need to base64 encode the values.
Create the Secret
Apply the directory containing the kustomization.yaml
to create the Secret.
kubectl apply -k .
The output is similar to:
secret/db-user-pass-96mffmfh4k created
Note that when a Secret is generated, the Secret name is created by hashing the Secret data and appending the hash value to the name. This ensures that a new Secret is generated each time the data is modified.
Check the Secret created
You can check that the secret was created:
kubectl get secrets
The output is similar to:
NAME TYPE DATA AGE
db-user-pass-96mffmfh4k Opaque 2 51s
You can view a description of the secret:
kubectl describe secrets/db-user-pass-96mffmfh4k
The output is similar to:
Name: db-user-pass-96mffmfh4k
Namespace: default
Labels: <none>
Annotations: <none>
Type: Opaque
Data
====
password.txt: 12 bytes
username.txt: 5 bytes
The commands kubectl get
and kubectl describe
avoid showing the contents of a Secret
by
default. This is to protect the Secret
from being exposed accidentally to an onlooker,
or from being stored in a terminal log.
To check the actual content of the encoded data, please refer to
decoding secret.
Clean Up
To delete the Secret you have created:
kubectl delete secret db-user-pass-96mffmfh4k
What's next
- Read more about the Secret concept
- Learn how to manage Secrets with the
kubectl
command - Learn how to manage Secrets using config file
4.6 - Inject Data Into Applications
4.6.1 - Define a Command and Arguments for a Container
This page shows how to define commands and arguments when you run a container in a Pod.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Define a command and arguments when you create a Pod
When you create a Pod, you can define a command and arguments for the
containers that run in the Pod. To define a command, include the command
field in the configuration file. To define arguments for the command, include
the args
field in the configuration file. The command and arguments that
you define cannot be changed after the Pod is created.
The command and arguments that you define in the configuration file override the default command and arguments provided by the container image. If you define args, but do not define a command, the default command is used with your new arguments.
command
field corresponds to entrypoint
in some container runtimes.
In this exercise, you create a Pod that runs one container. The configuration file for the Pod defines a command and two arguments:
apiVersion: v1
kind: Pod
metadata:
name: command-demo
labels:
purpose: demonstrate-command
spec:
containers:
- name: command-demo-container
image: debian
command: ["printenv"]
args: ["HOSTNAME", "KUBERNETES_PORT"]
restartPolicy: OnFailure
-
Create a Pod based on the YAML configuration file:
kubectl apply -f https://k8s.io/examples/pods/commands.yaml
-
List the running Pods:
kubectl get pods
The output shows that the container that ran in the command-demo Pod has completed.
-
To see the output of the command that ran in the container, view the logs from the Pod:
kubectl logs command-demo
The output shows the values of the HOSTNAME and KUBERNETES_PORT environment variables:
command-demo tcp://10.3.240.1:443
Use environment variables to define arguments
In the preceding example, you defined the arguments directly by providing strings. As an alternative to providing strings directly, you can define arguments by using environment variables:
env:
- name: MESSAGE
value: "hello world"
command: ["/bin/echo"]
args: ["$(MESSAGE)"]
This means you can define an argument for a Pod using any of the techniques available for defining environment variables, including ConfigMaps and Secrets.
"$(VAR)"
. This is
required for the variable to be expanded in the command
or args
field.
Run a command in a shell
In some cases, you need your command to run in a shell. For example, your command might consist of several commands piped together, or it might be a shell script. To run your command in a shell, wrap it like this:
command: ["/bin/sh"]
args: ["-c", "while true; do echo hello; sleep 10;done"]
What's next
- Learn more about configuring pods and containers.
- Learn more about running commands in a container.
- See Container.
4.6.2 - Define Dependent Environment Variables
This page shows how to define dependent environment variables for a container in a Kubernetes Pod.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Define an environment dependent variable for a container
When you create a Pod, you can set dependent environment variables for the containers that run in the Pod. To set dependent environment variables, you can use $(VAR_NAME) in the value
of env
in the configuration file.
In this exercise, you create a Pod that runs one container. The configuration file for the Pod defines an dependent environment variable with common usage defined. Here is the configuration manifest for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: dependent-envars-demo
spec:
containers:
- name: dependent-envars-demo
args:
- while true; do echo -en '\n'; printf UNCHANGED_REFERENCE=$UNCHANGED_REFERENCE'\n'; printf SERVICE_ADDRESS=$SERVICE_ADDRESS'\n';printf ESCAPED_REFERENCE=$ESCAPED_REFERENCE'\n'; sleep 30; done;
command:
- sh
- -c
image: busybox:1.28
env:
- name: SERVICE_PORT
value: "80"
- name: SERVICE_IP
value: "172.17.0.1"
- name: UNCHANGED_REFERENCE
value: "$(PROTOCOL)://$(SERVICE_IP):$(SERVICE_PORT)"
- name: PROTOCOL
value: "https"
- name: SERVICE_ADDRESS
value: "$(PROTOCOL)://$(SERVICE_IP):$(SERVICE_PORT)"
- name: ESCAPED_REFERENCE
value: "$$(PROTOCOL)://$(SERVICE_IP):$(SERVICE_PORT)"
-
Create a Pod based on that manifest:
kubectl apply -f https://k8s.io/examples/pods/inject/dependent-envars.yaml
pod/dependent-envars-demo created
-
List the running Pods:
kubectl get pods dependent-envars-demo
NAME READY STATUS RESTARTS AGE dependent-envars-demo 1/1 Running 0 9s
-
Check the logs for the container running in your Pod:
kubectl logs pod/dependent-envars-demo
UNCHANGED_REFERENCE=$(PROTOCOL)://172.17.0.1:80 SERVICE_ADDRESS=https://172.17.0.1:80 ESCAPED_REFERENCE=$(PROTOCOL)://172.17.0.1:80
As shown above, you have defined the correct dependency reference of SERVICE_ADDRESS
, bad dependency reference of UNCHANGED_REFERENCE
and skip dependent references of ESCAPED_REFERENCE
.
When an environment variable is already defined when being referenced,
the reference can be correctly resolved, such as in the SERVICE_ADDRESS
case.
When the environment variable is undefined or only includes some variables, the undefined environment variable is treated as a normal string, such as UNCHANGED_REFERENCE
. Note that incorrectly parsed environment variables, in general, will not block the container from starting.
The $(VAR_NAME)
syntax can be escaped with a double $
, ie: $$(VAR_NAME)
.
Escaped references are never expanded, regardless of whether the referenced variable
is defined or not. This can be seen from the ESCAPED_REFERENCE
case above.
What's next
- Learn more about environment variables.
- See EnvVarSource.
4.6.3 - Define Environment Variables for a Container
This page shows how to define environment variables for a container in a Kubernetes Pod.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Define an environment variable for a container
When you create a Pod, you can set environment variables for the containers
that run in the Pod. To set environment variables, include the env
or
envFrom
field in the configuration file.
In this exercise, you create a Pod that runs one container. The configuration
file for the Pod defines an environment variable with name DEMO_GREETING
and
value "Hello from the environment"
. Here is the configuration manifest for the
Pod:
apiVersion: v1
kind: Pod
metadata:
name: envar-demo
labels:
purpose: demonstrate-envars
spec:
containers:
- name: envar-demo-container
image: gcr.io/google-samples/node-hello:1.0
env:
- name: DEMO_GREETING
value: "Hello from the environment"
- name: DEMO_FAREWELL
value: "Such a sweet sorrow"
-
Create a Pod based on that manifest:
kubectl apply -f https://k8s.io/examples/pods/inject/envars.yaml
-
List the running Pods:
kubectl get pods -l purpose=demonstrate-envars
The output is similar to:
NAME READY STATUS RESTARTS AGE envar-demo 1/1 Running 0 9s
-
List the Pod's container environment variables:
kubectl exec envar-demo -- printenv
The output is similar to this:
NODE_VERSION=4.4.2 EXAMPLE_SERVICE_PORT_8080_TCP_ADDR=10.3.245.237 HOSTNAME=envar-demo ... DEMO_GREETING=Hello from the environment DEMO_FAREWELL=Such a sweet sorrow
env
or envFrom
field
override any environment variables specified in the container image.
Using environment variables inside of your config
Environment variables that you define in a Pod's configuration can be used
elsewhere in the configuration, for example in commands and arguments that
you set for the Pod's containers.
In the example configuration below, the GREETING
, HONORIFIC
, and
NAME
environment variables are set to Warm greetings to
, The Most Honorable
, and Kubernetes
, respectively. Those environment variables
are then used in the CLI arguments passed to the env-print-demo
container.
apiVersion: v1
kind: Pod
metadata:
name: print-greeting
spec:
containers:
- name: env-print-demo
image: bash
env:
- name: GREETING
value: "Warm greetings to"
- name: HONORIFIC
value: "The Most Honorable"
- name: NAME
value: "Kubernetes"
command: ["echo"]
args: ["$(GREETING) $(HONORIFIC) $(NAME)"]
Upon creation, the command echo Warm greetings to The Most Honorable Kubernetes
is run on the container.
What's next
- Learn more about environment variables.
- Learn about using secrets as environment variables.
- See EnvVarSource.
4.6.4 - Expose Pod Information to Containers Through Environment Variables
This page shows how a Pod can use environment variables to expose information about itself to Containers running in the Pod. Environment variables can expose Pod fields and Container fields.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
The Downward API
There are two ways to expose Pod and Container fields to a running Container:
- Environment variables
- Volume Files
Together, these two ways of exposing Pod and Container fields are called the Downward API.
Use Pod fields as values for environment variables
In this exercise, you create a Pod that has one Container. Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: dapi-envars-fieldref
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
command: [ "sh", "-c"]
args:
- while true; do
echo -en '\n';
printenv MY_NODE_NAME MY_POD_NAME MY_POD_NAMESPACE;
printenv MY_POD_IP MY_POD_SERVICE_ACCOUNT;
sleep 10;
done;
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: MY_POD_SERVICE_ACCOUNT
valueFrom:
fieldRef:
fieldPath: spec.serviceAccountName
restartPolicy: Never
In the configuration file, you can see five environment variables. The env
field is an array of
EnvVars.
The first element in the array specifies that the MY_NODE_NAME
environment
variable gets its value from the Pod's spec.nodeName
field. Similarly, the
other environment variables get their names from Pod fields.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/inject/dapi-envars-pod.yaml
Verify that the Container in the Pod is running:
kubectl get pods
View the Container's logs:
kubectl logs dapi-envars-fieldref
The output shows the values of selected environment variables:
minikube
dapi-envars-fieldref
default
172.17.0.4
default
To see why these values are in the log, look at the command
and args
fields
in the configuration file. When the Container starts, it writes the values of
five environment variables to stdout. It repeats this every ten seconds.
Next, get a shell into the Container that is running in your Pod:
kubectl exec -it dapi-envars-fieldref -- sh
In your shell, view the environment variables:
/# printenv
The output shows that certain environment variables have been assigned the values of Pod fields:
MY_POD_SERVICE_ACCOUNT=default
...
MY_POD_NAMESPACE=default
MY_POD_IP=172.17.0.4
...
MY_NODE_NAME=minikube
...
MY_POD_NAME=dapi-envars-fieldref
Use Container fields as values for environment variables
In the preceding exercise, you used Pod fields as the values for environment variables. In this next exercise, you use Container fields as the values for environment variables. Here is the configuration file for a Pod that has one container:
apiVersion: v1
kind: Pod
metadata:
name: dapi-envars-resourcefieldref
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox:1.24
command: [ "sh", "-c"]
args:
- while true; do
echo -en '\n';
printenv MY_CPU_REQUEST MY_CPU_LIMIT;
printenv MY_MEM_REQUEST MY_MEM_LIMIT;
sleep 10;
done;
resources:
requests:
memory: "32Mi"
cpu: "125m"
limits:
memory: "64Mi"
cpu: "250m"
env:
- name: MY_CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: test-container
resource: requests.cpu
- name: MY_CPU_LIMIT
valueFrom:
resourceFieldRef:
containerName: test-container
resource: limits.cpu
- name: MY_MEM_REQUEST
valueFrom:
resourceFieldRef:
containerName: test-container
resource: requests.memory
- name: MY_MEM_LIMIT
valueFrom:
resourceFieldRef:
containerName: test-container
resource: limits.memory
restartPolicy: Never
In the configuration file, you can see four environment variables. The env
field is an array of
EnvVars.
The first element in the array specifies that the MY_CPU_REQUEST
environment
variable gets its value from the requests.cpu
field of a Container named
test-container
. Similarly, the other environment variables get their values
from Container fields.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/inject/dapi-envars-container.yaml
Verify that the Container in the Pod is running:
kubectl get pods
View the Container's logs:
kubectl logs dapi-envars-resourcefieldref
The output shows the values of selected environment variables:
1
1
33554432
67108864
What's next
4.6.5 - Expose Pod Information to Containers Through Files
This page shows how a Pod can use a
DownwardAPIVolumeFile
to expose information about itself to Containers running in the Pod.
A DownwardAPIVolumeFile
can expose Pod fields and Container fields.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
The Downward API
There are two ways to expose Pod and Container fields to a running Container:
- Environment variables
- Volume files
Together, these two ways of exposing Pod and Container fields are called the "Downward API".
Store Pod fields
In this exercise, you create a Pod that has one Container. Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: kubernetes-downwardapi-volume-example
labels:
zone: us-est-coast
cluster: test-cluster1
rack: rack-22
annotations:
build: two
builder: john-doe
spec:
containers:
- name: client-container
image: k8s.gcr.io/busybox
command: ["sh", "-c"]
args:
- while true; do
if [[ -e /etc/podinfo/labels ]]; then
echo -en '\n\n'; cat /etc/podinfo/labels; fi;
if [[ -e /etc/podinfo/annotations ]]; then
echo -en '\n\n'; cat /etc/podinfo/annotations; fi;
sleep 5;
done;
volumeMounts:
- name: podinfo
mountPath: /etc/podinfo
volumes:
- name: podinfo
downwardAPI:
items:
- path: "labels"
fieldRef:
fieldPath: metadata.labels
- path: "annotations"
fieldRef:
fieldPath: metadata.annotations
In the configuration file, you can see that the Pod has a downwardAPI
Volume,
and the Container mounts the Volume at /etc/podinfo
.
Look at the items
array under downwardAPI
. Each element of the array is a
DownwardAPIVolumeFile.
The first element specifies that the value of the Pod's
metadata.labels
field should be stored in a file named labels
.
The second element specifies that the value of the Pod's annotations
field should be stored in a file named annotations
.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/inject/dapi-volume.yaml
Verify that the container in the Pod is running:
kubectl get pods
View the container's logs:
kubectl logs kubernetes-downwardapi-volume-example
The output shows the contents of the labels
file and the annotations
file:
cluster="test-cluster1"
rack="rack-22"
zone="us-est-coast"
build="two"
builder="john-doe"
Get a shell into the container that is running in your Pod:
kubectl exec -it kubernetes-downwardapi-volume-example -- sh
In your shell, view the labels
file:
/# cat /etc/podinfo/labels
The output shows that all of the Pod's labels have been written
to the labels
file:
cluster="test-cluster1"
rack="rack-22"
zone="us-est-coast"
Similarly, view the annotations
file:
/# cat /etc/podinfo/annotations
View the files in the /etc/podinfo
directory:
/# ls -laR /etc/podinfo
In the output, you can see that the labels
and annotations
files
are in a temporary subdirectory: in this example,
..2982_06_02_21_47_53.299460680
. In the /etc/podinfo
directory, ..data
is
a symbolic link to the temporary subdirectory. Also in the /etc/podinfo
directory,
labels
and annotations
are symbolic links.
drwxr-xr-x ... Feb 6 21:47 ..2982_06_02_21_47_53.299460680
lrwxrwxrwx ... Feb 6 21:47 ..data -> ..2982_06_02_21_47_53.299460680
lrwxrwxrwx ... Feb 6 21:47 annotations -> ..data/annotations
lrwxrwxrwx ... Feb 6 21:47 labels -> ..data/labels
/etc/..2982_06_02_21_47_53.299460680:
total 8
-rw-r--r-- ... Feb 6 21:47 annotations
-rw-r--r-- ... Feb 6 21:47 labels
Using symbolic links enables dynamic atomic refresh of the metadata; updates are
written to a new temporary directory, and the ..data
symlink is updated
atomically using rename(2).
Exit the shell:
/# exit
Store Container fields
The preceding exercise, you stored Pod fields in a
DownwardAPIVolumeFile
..
In this next exercise, you store Container fields. Here is the configuration
file for a Pod that has one Container:
apiVersion: v1
kind: Pod
metadata:
name: kubernetes-downwardapi-volume-example-2
spec:
containers:
- name: client-container
image: k8s.gcr.io/busybox:1.24
command: ["sh", "-c"]
args:
- while true; do
echo -en '\n';
if [[ -e /etc/podinfo/cpu_limit ]]; then
echo -en '\n'; cat /etc/podinfo/cpu_limit; fi;
if [[ -e /etc/podinfo/cpu_request ]]; then
echo -en '\n'; cat /etc/podinfo/cpu_request; fi;
if [[ -e /etc/podinfo/mem_limit ]]; then
echo -en '\n'; cat /etc/podinfo/mem_limit; fi;
if [[ -e /etc/podinfo/mem_request ]]; then
echo -en '\n'; cat /etc/podinfo/mem_request; fi;
sleep 5;
done;
resources:
requests:
memory: "32Mi"
cpu: "125m"
limits:
memory: "64Mi"
cpu: "250m"
volumeMounts:
- name: podinfo
mountPath: /etc/podinfo
volumes:
- name: podinfo
downwardAPI:
items:
- path: "cpu_limit"
resourceFieldRef:
containerName: client-container
resource: limits.cpu
divisor: 1m
- path: "cpu_request"
resourceFieldRef:
containerName: client-container
resource: requests.cpu
divisor: 1m
- path: "mem_limit"
resourceFieldRef:
containerName: client-container
resource: limits.memory
divisor: 1Mi
- path: "mem_request"
resourceFieldRef:
containerName: client-container
resource: requests.memory
divisor: 1Mi
In the configuration file, you can see that the Pod has a
downwardAPI
volume,
and the Container mounts the volume at /etc/podinfo
.
Look at the items
array under downwardAPI
. Each element of the array is a
DownwardAPIVolumeFile
.
The first element specifies that in the Container named client-container
,
the value of the limits.cpu
field in the format specified by 1m
should be
stored in a file named cpu_limit
. The divisor
field is optional and has the
default value of 1
which means cores for cpu and bytes for memory.
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/inject/dapi-volume-resources.yaml
Get a shell into the container that is running in your Pod:
kubectl exec -it kubernetes-downwardapi-volume-example-2 -- sh
In your shell, view the cpu_limit
file:
/# cat /etc/podinfo/cpu_limit
You can use similar commands to view the cpu_request
, mem_limit
and
mem_request
files.
Capabilities of the Downward API
The following information is available to containers through environment
variables and downwardAPI
volumes:
-
Information available via
fieldRef
:metadata.name
- the pod's namemetadata.namespace
- the pod's namespacemetadata.uid
- the pod's UIDmetadata.labels['<KEY>']
- the value of the pod's label<KEY>
(for example,metadata.labels['mylabel']
)metadata.annotations['<KEY>']
- the value of the pod's annotation<KEY>
(for example,metadata.annotations['myannotation']
)
-
Information available via
resourceFieldRef
:- A Container's CPU limit
- A Container's CPU request
- A Container's memory limit
- A Container's memory request
- A Container's hugepages limit (provided that the
DownwardAPIHugePages
feature gate is enabled) - A Container's hugepages request (provided that the
DownwardAPIHugePages
feature gate is enabled) - A Container's ephemeral-storage limit
- A Container's ephemeral-storage request
In addition, the following information is available through
downwardAPI
volume fieldRef
:
metadata.labels
- all of the pod's labels, formatted aslabel-key="escaped-label-value"
with one label per linemetadata.annotations
- all of the pod's annotations, formatted asannotation-key="escaped-annotation-value"
with one annotation per line
The following information is available through environment variables:
status.podIP
- the pod's IP addressspec.serviceAccountName
- the pod's service account namespec.nodeName
- the name of the node to which the scheduler always attempts to schedule the podstatus.hostIP
- the IP of the node to which the Pod is assigned
Project keys to specific paths and file permissions
You can project keys to specific paths and specific permissions on a per-file basis. For more information, see Secrets.
Motivation for the Downward API
It is sometimes useful for a container to have information about itself, without being overly coupled to Kubernetes. The Downward API allows containers to consume information about themselves or the cluster without using the Kubernetes client or API server.
An example is an existing application that assumes a particular well-known environment variable holds a unique identifier. One possibility is to wrap the application, but that is tedious and error prone, and it violates the goal of low coupling. A better option would be to use the Pod's name as an identifier, and inject the Pod's name into the well-known environment variable.
What's next
- Check the
PodSpec
API definition which defines the desired state of a Pod. - Check the
Volume
API definition which defines a generic volume in a Pod for containers to access. - Check the
DownwardAPIVolumeSource
API definition which defines a volume that contains Downward API information. - Check the
DownwardAPIVolumeFile
API definition which contains references to object or resource fields for populating a file in the Downward API volume. - Check the
ResourceFieldSelector
API definition which specifies the container resources and their output format.
4.6.6 - Distribute Credentials Securely Using Secrets
This page shows how to securely inject sensitive data, such as passwords and encryption keys, into Pods.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Convert your secret data to a base-64 representation
Suppose you want to have two pieces of secret data: a username my-app
and a password
39528$vdg7Jb
. First, use a base64 encoding tool to convert your username and password to a base64 representation. Here's an example using the commonly available base64 program:
echo -n 'my-app' | base64
echo -n '39528$vdg7Jb' | base64
The output shows that the base-64 representation of your username is bXktYXBw
,
and the base-64 representation of your password is Mzk1MjgkdmRnN0pi
.
Create a Secret
Here is a configuration file you can use to create a Secret that holds your username and password:
apiVersion: v1
kind: Secret
metadata:
name: test-secret
data:
username: bXktYXBw
password: Mzk1MjgkdmRnN0pi
-
Create the Secret
kubectl apply -f https://k8s.io/examples/pods/inject/secret.yaml
-
View information about the Secret:
kubectl get secret test-secret
Output:
NAME TYPE DATA AGE test-secret Opaque 2 1m
-
View more detailed information about the Secret:
kubectl describe secret test-secret
Output:
Name: test-secret Namespace: default Labels: <none> Annotations: <none> Type: Opaque Data ==== password: 13 bytes username: 7 bytes
Create a Secret directly with kubectl
If you want to skip the Base64 encoding step, you can create the
same Secret using the kubectl create secret
command. For example:
kubectl create secret generic test-secret --from-literal='username=my-app' --from-literal='password=39528$vdg7Jb'
This is more convenient. The detailed approach shown earlier runs through each step explicitly to demonstrate what is happening.
Create a Pod that has access to the secret data through a Volume
Here is a configuration file you can use to create a Pod:
apiVersion: v1
kind: Pod
metadata:
name: secret-test-pod
spec:
containers:
- name: test-container
image: nginx
volumeMounts:
# name must match the volume name below
- name: secret-volume
mountPath: /etc/secret-volume
# The secret data is exposed to Containers in the Pod through a Volume.
volumes:
- name: secret-volume
secret:
secretName: test-secret
-
Create the Pod:
kubectl apply -f https://k8s.io/examples/pods/inject/secret-pod.yaml
-
Verify that your Pod is running:
kubectl get pod secret-test-pod
Output:
NAME READY STATUS RESTARTS AGE secret-test-pod 1/1 Running 0 42m
-
Get a shell into the Container that is running in your Pod:
kubectl exec -i -t secret-test-pod -- /bin/bash
-
The secret data is exposed to the Container through a Volume mounted under
/etc/secret-volume
.In your shell, list the files in the
/etc/secret-volume
directory:# Run this in the shell inside the container ls /etc/secret-volume
The output shows two files, one for each piece of secret data:
password username
-
In your shell, display the contents of the
username
andpassword
files:# Run this in the shell inside the container echo "$( cat /etc/secret-volume/username )" echo "$( cat /etc/secret-volume/password )"
The output is your username and password:
my-app 39528$vdg7Jb
Define container environment variables using Secret data
Define a container environment variable with data from a single Secret
-
Define an environment variable as a key-value pair in a Secret:
kubectl create secret generic backend-user --from-literal=backend-username='backend-admin'
-
Assign the
backend-username
value defined in the Secret to theSECRET_USERNAME
environment variable in the Pod specification.apiVersion: v1 kind: Pod metadata: name: env-single-secret spec: containers: - name: envars-test-container image: nginx env: - name: SECRET_USERNAME valueFrom: secretKeyRef: name: backend-user key: backend-username
-
Create the Pod:
kubectl create -f https://k8s.io/examples/pods/inject/pod-single-secret-env-variable.yaml
-
In your shell, display the content of
SECRET_USERNAME
container environment variablekubectl exec -i -t env-single-secret -- /bin/sh -c 'echo $SECRET_USERNAME'
The output is
backend-admin
Define container environment variables with data from multiple Secrets
-
As with the previous example, create the Secrets first.
kubectl create secret generic backend-user --from-literal=backend-username='backend-admin' kubectl create secret generic db-user --from-literal=db-username='db-admin'
-
Define the environment variables in the Pod specification.
apiVersion: v1 kind: Pod metadata: name: envvars-multiple-secrets spec: containers: - name: envars-test-container image: nginx env: - name: BACKEND_USERNAME valueFrom: secretKeyRef: name: backend-user key: backend-username - name: DB_USERNAME valueFrom: secretKeyRef: name: db-user key: db-username
-
Create the Pod:
kubectl create -f https://k8s.io/examples/pods/inject/pod-multiple-secret-env-variable.yaml
-
In your shell, display the container environment variables
kubectl exec -i -t envvars-multiple-secrets -- /bin/sh -c 'env | grep _USERNAME'
The output is
DB_USERNAME=db-admin BACKEND_USERNAME=backend-admin
Configure all key-value pairs in a Secret as container environment variables
-
Create a Secret containing multiple key-value pairs
kubectl create secret generic test-secret --from-literal=username='my-app' --from-literal=password='39528$vdg7Jb'
-
Use envFrom to define all of the Secret's data as container environment variables. The key from the Secret becomes the environment variable name in the Pod.
apiVersion: v1 kind: Pod metadata: name: envfrom-secret spec: containers: - name: envars-test-container image: nginx envFrom: - secretRef: name: test-secret
-
Create the Pod:
kubectl create -f https://k8s.io/examples/pods/inject/pod-secret-envFrom.yaml
-
In your shell, display
username
andpassword
container environment variableskubectl exec -i -t envfrom-secret -- /bin/sh -c 'echo "username: $username\npassword: $password\n"'
The output is
username: my-app password: 39528$vdg7Jb
References
What's next
4.7 - Run Applications
4.7.1 - Run a Stateless Application Using a Deployment
This page shows how to run an application using a Kubernetes Deployment object.
Objectives
- Create an nginx deployment.
- Use kubectl to list information about the deployment.
- Update the deployment.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.9. To check the version, enterkubectl version
.
Creating and exploring an nginx deployment
You can run an application by creating a Kubernetes Deployment object, and you can describe a Deployment in a YAML file. For example, this YAML file describes a Deployment that runs the nginx:1.14.2 Docker image:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
-
Create a Deployment based on the YAML file:
kubectl apply -f https://k8s.io/examples/application/deployment.yaml
-
Display information about the Deployment:
kubectl describe deployment nginx-deployment
The output is similar to this:
Name: nginx-deployment Namespace: default CreationTimestamp: Tue, 30 Aug 2016 18:11:37 -0700 Labels: app=nginx Annotations: deployment.kubernetes.io/revision=1 Selector: app=nginx Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable StrategyType: RollingUpdate MinReadySeconds: 0 RollingUpdateStrategy: 1 max unavailable, 1 max surge Pod Template: Labels: app=nginx Containers: nginx: Image: nginx:1.14.2 Port: 80/TCP Environment: <none> Mounts: <none> Volumes: <none> Conditions: Type Status Reason ---- ------ ------ Available True MinimumReplicasAvailable Progressing True NewReplicaSetAvailable OldReplicaSets: <none> NewReplicaSet: nginx-deployment-1771418926 (2/2 replicas created) No events.
-
List the Pods created by the deployment:
kubectl get pods -l app=nginx
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-1771418926-7o5ns 1/1 Running 0 16h nginx-deployment-1771418926-r18az 1/1 Running 0 16h
-
Display information about a Pod:
kubectl describe pod <pod-name>
where
<pod-name>
is the name of one of your Pods.
Updating the deployment
You can update the deployment by applying a new YAML file. This YAML file specifies that the deployment should be updated to use nginx 1.16.1.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.16.1 # Update the version of nginx from 1.14.2 to 1.16.1
ports:
- containerPort: 80
-
Apply the new YAML file:
kubectl apply -f https://k8s.io/examples/application/deployment-update.yaml
-
Watch the deployment create pods with new names and delete the old pods:
kubectl get pods -l app=nginx
Scaling the application by increasing the replica count
You can increase the number of Pods in your Deployment by applying a new YAML
file. This YAML file sets replicas
to 4, which specifies that the Deployment
should have four Pods:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 4 # Update the replicas from 2 to 4
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
-
Apply the new YAML file:
kubectl apply -f https://k8s.io/examples/application/deployment-scale.yaml
-
Verify that the Deployment has four Pods:
kubectl get pods -l app=nginx
The output is similar to this:
NAME READY STATUS RESTARTS AGE nginx-deployment-148880595-4zdqq 1/1 Running 0 25s nginx-deployment-148880595-6zgi1 1/1 Running 0 25s nginx-deployment-148880595-fxcez 1/1 Running 0 2m nginx-deployment-148880595-rwovn 1/1 Running 0 2m
Deleting a deployment
Delete the deployment by name:
kubectl delete deployment nginx-deployment
ReplicationControllers -- the Old Way
The preferred way to create a replicated application is to use a Deployment, which in turn uses a ReplicaSet. Before the Deployment and ReplicaSet were added to Kubernetes, replicated applications were configured using a ReplicationController.
What's next
- Learn more about Deployment objects.
4.7.2 - Run a Single-Instance Stateful Application
This page shows you how to run a single-instance stateful application in Kubernetes using a PersistentVolume and a Deployment. The application is MySQL.
Objectives
- Create a PersistentVolume referencing a disk in your environment.
- Create a MySQL Deployment.
- Expose MySQL to other pods in the cluster at a known DNS name.
Before you begin
-
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
. -
You need to either have a dynamic PersistentVolume provisioner with a default StorageClass, or statically provision PersistentVolumes yourself to satisfy the PersistentVolumeClaims used here.
Deploy MySQL
You can run a stateful application by creating a Kubernetes Deployment and connecting it to an existing PersistentVolume using a PersistentVolumeClaim. For example, this YAML file describes a Deployment that runs MySQL and references the PersistentVolumeClaim. The file defines a volume mount for /var/lib/mysql, and then creates a PersistentVolumeClaim that looks for a 20G volume. This claim is satisfied by any existing volume that meets the requirements, or by a dynamic provisioner.
Note: The password is defined in the config yaml, and this is insecure. See Kubernetes Secrets for a secure solution.
apiVersion: v1
kind: Service
metadata:
name: mysql
spec:
ports:
- port: 3306
selector:
app: mysql
clusterIP: None
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql
spec:
selector:
matchLabels:
app: mysql
strategy:
type: Recreate
template:
metadata:
labels:
app: mysql
spec:
containers:
- image: mysql:5.6
name: mysql
env:
# Use secret in real usage
- name: MYSQL_ROOT_PASSWORD
value: password
ports:
- containerPort: 3306
name: mysql
volumeMounts:
- name: mysql-persistent-storage
mountPath: /var/lib/mysql
volumes:
- name: mysql-persistent-storage
persistentVolumeClaim:
claimName: mysql-pv-claim
apiVersion: v1
kind: PersistentVolume
metadata:
name: mysql-pv-volume
labels:
type: local
spec:
storageClassName: manual
capacity:
storage: 20Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/data"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-pv-claim
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
-
Deploy the PV and PVC of the YAML file:
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-pv.yaml
-
Deploy the contents of the YAML file:
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-deployment.yaml
-
Display information about the Deployment:
kubectl describe deployment mysql
The output is similar to this:
Name: mysql Namespace: default CreationTimestamp: Tue, 01 Nov 2016 11:18:45 -0700 Labels: app=mysql Annotations: deployment.kubernetes.io/revision=1 Selector: app=mysql Replicas: 1 desired | 1 updated | 1 total | 0 available | 1 unavailable StrategyType: Recreate MinReadySeconds: 0 Pod Template: Labels: app=mysql Containers: mysql: Image: mysql:5.6 Port: 3306/TCP Environment: MYSQL_ROOT_PASSWORD: password Mounts: /var/lib/mysql from mysql-persistent-storage (rw) Volumes: mysql-persistent-storage: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: mysql-pv-claim ReadOnly: false Conditions: Type Status Reason ---- ------ ------ Available False MinimumReplicasUnavailable Progressing True ReplicaSetUpdated OldReplicaSets: <none> NewReplicaSet: mysql-63082529 (1/1 replicas created) Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 33s 33s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set mysql-63082529 to 1
-
List the pods created by the Deployment:
kubectl get pods -l app=mysql
The output is similar to this:
NAME READY STATUS RESTARTS AGE mysql-63082529-2z3ki 1/1 Running 0 3m
-
Inspect the PersistentVolumeClaim:
kubectl describe pvc mysql-pv-claim
The output is similar to this:
Name: mysql-pv-claim Namespace: default StorageClass: Status: Bound Volume: mysql-pv-volume Labels: <none> Annotations: pv.kubernetes.io/bind-completed=yes pv.kubernetes.io/bound-by-controller=yes Capacity: 20Gi Access Modes: RWO Events: <none>
Accessing the MySQL instance
The preceding YAML file creates a service that
allows other Pods in the cluster to access the database. The Service option
clusterIP: None
lets the Service DNS name resolve directly to the
Pod's IP address. This is optimal when you have only one Pod
behind a Service and you don't intend to increase the number of Pods.
Run a MySQL client to connect to the server:
kubectl run -it --rm --image=mysql:5.6 --restart=Never mysql-client -- mysql -h mysql -ppassword
This command creates a new Pod in the cluster running a MySQL client and connects it to the server through the Service. If it connects, you know your stateful MySQL database is up and running.
Waiting for pod default/mysql-client-274442439-zyp6i to be running, status is Pending, pod ready: false
If you don't see a command prompt, try pressing enter.
mysql>
Updating
The image or any other part of the Deployment can be updated as usual
with the kubectl apply
command. Here are some precautions that are
specific to stateful apps:
- Don't scale the app. This setup is for single-instance apps only. The underlying PersistentVolume can only be mounted to one Pod. For clustered stateful apps, see the StatefulSet documentation.
- Use
strategy:
type: Recreate
in the Deployment configuration YAML file. This instructs Kubernetes to not use rolling updates. Rolling updates will not work, as you cannot have more than one Pod running at a time. TheRecreate
strategy will stop the first pod before creating a new one with the updated configuration.
Deleting a deployment
Delete the deployed objects by name:
kubectl delete deployment,svc mysql
kubectl delete pvc mysql-pv-claim
kubectl delete pv mysql-pv-volume
If you manually provisioned a PersistentVolume, you also need to manually delete it, as well as release the underlying resource. If you used a dynamic provisioner, it automatically deletes the PersistentVolume when it sees that you deleted the PersistentVolumeClaim. Some dynamic provisioners (such as those for EBS and PD) also release the underlying resource upon deleting the PersistentVolume.
What's next
-
Learn more about Deployment objects.
-
Learn more about Deploying applications
4.7.3 - Run a Replicated Stateful Application
This page shows how to run a replicated stateful application using a StatefulSet controller. This application is a replicated MySQL database. The example topology has a single primary server and multiple replicas, using asynchronous row-based replication.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.You need to either have a dynamic PersistentVolume provisioner with a default StorageClass, or statically provision PersistentVolumes yourself to satisfy the PersistentVolumeClaims used here.
- This tutorial assumes you are familiar with PersistentVolumes and StatefulSets, as well as other core concepts like Pods, Services, and ConfigMaps.
- Some familiarity with MySQL helps, but this tutorial aims to present general patterns that should be useful for other systems.
- You are using the default namespace or another namespace that does not contain any conflicting objects.
Objectives
- Deploy a replicated MySQL topology with a StatefulSet controller.
- Send MySQL client traffic.
- Observe resistance to downtime.
- Scale the StatefulSet up and down.
Deploy MySQL
The example MySQL deployment consists of a ConfigMap, two Services, and a StatefulSet.
ConfigMap
Create the ConfigMap from the following YAML configuration file:
apiVersion: v1
kind: ConfigMap
metadata:
name: mysql
labels:
app: mysql
data:
primary.cnf: |
# Apply this config only on the primary.
[mysqld]
log-bin
replica.cnf: |
# Apply this config only on replicas.
[mysqld]
super-read-only
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-configmap.yaml
This ConfigMap provides my.cnf
overrides that let you independently control
configuration on the primary MySQL server and replicas.
In this case, you want the primary server to be able to serve replication logs to replicas
and you want replicas to reject any writes that don't come via replication.
There's nothing special about the ConfigMap itself that causes different portions to apply to different Pods. Each Pod decides which portion to look at as it's initializing, based on information provided by the StatefulSet controller.
Services
Create the Services from the following YAML configuration file:
# Headless service for stable DNS entries of StatefulSet members.
apiVersion: v1
kind: Service
metadata:
name: mysql
labels:
app: mysql
spec:
ports:
- name: mysql
port: 3306
clusterIP: None
selector:
app: mysql
---
# Client service for connecting to any MySQL instance for reads.
# For writes, you must instead connect to the primary: mysql-0.mysql.
apiVersion: v1
kind: Service
metadata:
name: mysql-read
labels:
app: mysql
spec:
ports:
- name: mysql
port: 3306
selector:
app: mysql
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-services.yaml
The Headless Service provides a home for the DNS entries that the StatefulSet
controller creates for each Pod that's part of the set.
Because the Headless Service is named mysql
, the Pods are accessible by
resolving <pod-name>.mysql
from within any other Pod in the same Kubernetes
cluster and namespace.
The Client Service, called mysql-read
, is a normal Service with its own
cluster IP that distributes connections across all MySQL Pods that report
being Ready. The set of potential endpoints includes the primary MySQL server and all
replicas.
Note that only read queries can use the load-balanced Client Service. Because there is only one primary MySQL server, clients should connect directly to the primary MySQL Pod (through its DNS entry within the Headless Service) to execute writes.
StatefulSet
Finally, create the StatefulSet from the following YAML configuration file:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
spec:
selector:
matchLabels:
app: mysql
serviceName: mysql
replicas: 3
template:
metadata:
labels:
app: mysql
spec:
initContainers:
- name: init-mysql
image: mysql:5.7
command:
- bash
- "-c"
- |
set -ex
# Generate mysql server-id from pod ordinal index.
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
echo [mysqld] > /mnt/conf.d/server-id.cnf
# Add an offset to avoid reserved server-id=0 value.
echo server-id=$((100 + $ordinal)) >> /mnt/conf.d/server-id.cnf
# Copy appropriate conf.d files from config-map to emptyDir.
if [[ $ordinal -eq 0 ]]; then
cp /mnt/config-map/primary.cnf /mnt/conf.d/
else
cp /mnt/config-map/replica.cnf /mnt/conf.d/
fi
volumeMounts:
- name: conf
mountPath: /mnt/conf.d
- name: config-map
mountPath: /mnt/config-map
- name: clone-mysql
image: gcr.io/google-samples/xtrabackup:1.0
command:
- bash
- "-c"
- |
set -ex
# Skip the clone if data already exists.
[[ -d /var/lib/mysql/mysql ]] && exit 0
# Skip the clone on primary (ordinal index 0).
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
[[ $ordinal -eq 0 ]] && exit 0
# Clone data from previous peer.
ncat --recv-only mysql-$(($ordinal-1)).mysql 3307 | xbstream -x -C /var/lib/mysql
# Prepare the backup.
xtrabackup --prepare --target-dir=/var/lib/mysql
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
containers:
- name: mysql
image: mysql:5.7
env:
- name: MYSQL_ALLOW_EMPTY_PASSWORD
value: "1"
ports:
- name: mysql
containerPort: 3306
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
resources:
requests:
cpu: 500m
memory: 1Gi
livenessProbe:
exec:
command: ["mysqladmin", "ping"]
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
exec:
# Check we can execute queries over TCP (skip-networking is off).
command: ["mysql", "-h", "127.0.0.1", "-e", "SELECT 1"]
initialDelaySeconds: 5
periodSeconds: 2
timeoutSeconds: 1
- name: xtrabackup
image: gcr.io/google-samples/xtrabackup:1.0
ports:
- name: xtrabackup
containerPort: 3307
command:
- bash
- "-c"
- |
set -ex
cd /var/lib/mysql
# Determine binlog position of cloned data, if any.
if [[ -f xtrabackup_slave_info && "x$(<xtrabackup_slave_info)" != "x" ]]; then
# XtraBackup already generated a partial "CHANGE MASTER TO" query
# because we're cloning from an existing replica. (Need to remove the tailing semicolon!)
cat xtrabackup_slave_info | sed -E 's/;$//g' > change_master_to.sql.in
# Ignore xtrabackup_binlog_info in this case (it's useless).
rm -f xtrabackup_slave_info xtrabackup_binlog_info
elif [[ -f xtrabackup_binlog_info ]]; then
# We're cloning directly from primary. Parse binlog position.
[[ `cat xtrabackup_binlog_info` =~ ^(.*?)[[:space:]]+(.*?)$ ]] || exit 1
rm -f xtrabackup_binlog_info xtrabackup_slave_info
echo "CHANGE MASTER TO MASTER_LOG_FILE='${BASH_REMATCH[1]}',\
MASTER_LOG_POS=${BASH_REMATCH[2]}" > change_master_to.sql.in
fi
# Check if we need to complete a clone by starting replication.
if [[ -f change_master_to.sql.in ]]; then
echo "Waiting for mysqld to be ready (accepting connections)"
until mysql -h 127.0.0.1 -e "SELECT 1"; do sleep 1; done
echo "Initializing replication from clone position"
mysql -h 127.0.0.1 \
-e "$(<change_master_to.sql.in), \
MASTER_HOST='mysql-0.mysql', \
MASTER_USER='root', \
MASTER_PASSWORD='', \
MASTER_CONNECT_RETRY=10; \
START SLAVE;" || exit 1
# In case of container restart, attempt this at-most-once.
mv change_master_to.sql.in change_master_to.sql.orig
fi
# Start a server to send backups when requested by peers.
exec ncat --listen --keep-open --send-only --max-conns=1 3307 -c \
"xtrabackup --backup --slave-info --stream=xbstream --host=127.0.0.1 --user=root"
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
resources:
requests:
cpu: 100m
memory: 100Mi
volumes:
- name: conf
emptyDir: {}
- name: config-map
configMap:
name: mysql
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
kubectl apply -f https://k8s.io/examples/application/mysql/mysql-statefulset.yaml
You can watch the startup progress by running:
kubectl get pods -l app=mysql --watch
After a while, you should see all 3 Pods become Running:
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Running 0 2m
mysql-1 2/2 Running 0 1m
mysql-2 2/2 Running 0 1m
Press Ctrl+C to cancel the watch. If you don't see any progress, make sure you have a dynamic PersistentVolume provisioner enabled as mentioned in the prerequisites.
This manifest uses a variety of techniques for managing stateful Pods as part of a StatefulSet. The next section highlights some of these techniques to explain what happens as the StatefulSet creates Pods.
Understanding stateful Pod initialization
The StatefulSet controller starts Pods one at a time, in order by their ordinal index. It waits until each Pod reports being Ready before starting the next one.
In addition, the controller assigns each Pod a unique, stable name of the form
<statefulset-name>-<ordinal-index>
, which results in Pods named mysql-0
,
mysql-1
, and mysql-2
.
The Pod template in the above StatefulSet manifest takes advantage of these properties to perform orderly startup of MySQL replication.
Generating configuration
Before starting any of the containers in the Pod spec, the Pod first runs any Init Containers in the order defined.
The first Init Container, named init-mysql
, generates special MySQL config
files based on the ordinal index.
The script determines its own ordinal index by extracting it from the end of
the Pod name, which is returned by the hostname
command.
Then it saves the ordinal (with a numeric offset to avoid reserved values)
into a file called server-id.cnf
in the MySQL conf.d
directory.
This translates the unique, stable identity provided by the StatefulSet
controller into the domain of MySQL server IDs, which require the same
properties.
The script in the init-mysql
container also applies either primary.cnf
or
replica.cnf
from the ConfigMap by copying the contents into conf.d
.
Because the example topology consists of a single primary MySQL server and any number of
replicas, the script assigns ordinal 0
to be the primary server, and everyone
else to be replicas.
Combined with the StatefulSet controller's
deployment order guarantee,
this ensures the primary MySQL server is Ready before creating replicas, so they can begin
replicating.
Cloning existing data
In general, when a new Pod joins the set as a replica, it must assume the primary MySQL server might already have data on it. It also must assume that the replication logs might not go all the way back to the beginning of time. These conservative assumptions are the key to allow a running StatefulSet to scale up and down over time, rather than being fixed at its initial size.
The second Init Container, named clone-mysql
, performs a clone operation on
a replica Pod the first time it starts up on an empty PersistentVolume.
That means it copies all existing data from another running Pod,
so its local state is consistent enough to begin replicating from the primary server.
MySQL itself does not provide a mechanism to do this, so the example uses a
popular open-source tool called Percona XtraBackup.
During the clone, the source MySQL server might suffer reduced performance.
To minimize impact on the primary MySQL server, the script instructs each Pod to clone
from the Pod whose ordinal index is one lower.
This works because the StatefulSet controller always ensures Pod N
is
Ready before starting Pod N+1
.
Starting replication
After the Init Containers complete successfully, the regular containers run.
The MySQL Pods consist of a mysql
container that runs the actual mysqld
server, and an xtrabackup
container that acts as a
sidecar.
The xtrabackup
sidecar looks at the cloned data files and determines if
it's necessary to initialize MySQL replication on the replica.
If so, it waits for mysqld
to be ready and then executes the
CHANGE MASTER TO
and START SLAVE
commands with replication parameters
extracted from the XtraBackup clone files.
Once a replica begins replication, it remembers its primary MySQL server and
reconnects automatically if the server restarts or the connection dies.
Also, because replicas look for the primary server at its stable DNS name
(mysql-0.mysql
), they automatically find the primary server even if it gets a new
Pod IP due to being rescheduled.
Lastly, after starting replication, the xtrabackup
container listens for
connections from other Pods requesting a data clone.
This server remains up indefinitely in case the StatefulSet scales up, or in
case the next Pod loses its PersistentVolumeClaim and needs to redo the clone.
Sending client traffic
You can send test queries to the primary MySQL server (hostname mysql-0.mysql
)
by running a temporary container with the mysql:5.7
image and running the
mysql
client binary.
kubectl run mysql-client --image=mysql:5.7 -i --rm --restart=Never --\
mysql -h mysql-0.mysql <<EOF
CREATE DATABASE test;
CREATE TABLE test.messages (message VARCHAR(250));
INSERT INTO test.messages VALUES ('hello');
EOF
Use the hostname mysql-read
to send test queries to any server that reports
being Ready:
kubectl run mysql-client --image=mysql:5.7 -i -t --rm --restart=Never --\
mysql -h mysql-read -e "SELECT * FROM test.messages"
You should get output like this:
Waiting for pod default/mysql-client to be running, status is Pending, pod ready: false
+---------+
| message |
+---------+
| hello |
+---------+
pod "mysql-client" deleted
To demonstrate that the mysql-read
Service distributes connections across
servers, you can run SELECT @@server_id
in a loop:
kubectl run mysql-client-loop --image=mysql:5.7 -i -t --rm --restart=Never --\
bash -ic "while sleep 1; do mysql -h mysql-read -e 'SELECT @@server_id,NOW()'; done"
You should see the reported @@server_id
change randomly, because a different
endpoint might be selected upon each connection attempt:
+-------------+---------------------+
| @@server_id | NOW() |
+-------------+---------------------+
| 100 | 2006-01-02 15:04:05 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW() |
+-------------+---------------------+
| 102 | 2006-01-02 15:04:06 |
+-------------+---------------------+
+-------------+---------------------+
| @@server_id | NOW() |
+-------------+---------------------+
| 101 | 2006-01-02 15:04:07 |
+-------------+---------------------+
You can press Ctrl+C when you want to stop the loop, but it's useful to keep it running in another window so you can see the effects of the following steps.
Simulating Pod and Node downtime
To demonstrate the increased availability of reading from the pool of replicas
instead of a single server, keep the SELECT @@server_id
loop from above
running while you force a Pod out of the Ready state.
Break the Readiness Probe
The readiness probe
for the mysql
container runs the command mysql -h 127.0.0.1 -e 'SELECT 1'
to make sure the server is up and able to execute queries.
One way to force this readiness probe to fail is to break that command:
kubectl exec mysql-2 -c mysql -- mv /usr/bin/mysql /usr/bin/mysql.off
This reaches into the actual container's filesystem for Pod mysql-2
and
renames the mysql
command so the readiness probe can't find it.
After a few seconds, the Pod should report one of its containers as not Ready,
which you can check by running:
kubectl get pod mysql-2
Look for 1/2
in the READY
column:
NAME READY STATUS RESTARTS AGE
mysql-2 1/2 Running 0 3m
At this point, you should see your SELECT @@server_id
loop continue to run,
although it never reports 102
anymore.
Recall that the init-mysql
script defined server-id
as 100 + $ordinal
,
so server ID 102
corresponds to Pod mysql-2
.
Now repair the Pod and it should reappear in the loop output after a few seconds:
kubectl exec mysql-2 -c mysql -- mv /usr/bin/mysql.off /usr/bin/mysql
Delete Pods
The StatefulSet also recreates Pods if they're deleted, similar to what a ReplicaSet does for stateless Pods.
kubectl delete pod mysql-2
The StatefulSet controller notices that no mysql-2
Pod exists anymore,
and creates a new one with the same name and linked to the same
PersistentVolumeClaim.
You should see server ID 102
disappear from the loop output for a while
and then return on its own.
Drain a Node
If your Kubernetes cluster has multiple Nodes, you can simulate Node downtime (such as when Nodes are upgraded) by issuing a drain.
First determine which Node one of the MySQL Pods is on:
kubectl get pod mysql-2 -o wide
The Node name should show up in the last column:
NAME READY STATUS RESTARTS AGE IP NODE
mysql-2 2/2 Running 0 15m 10.244.5.27 kubernetes-node-9l2t
Then drain the Node by running the following command, which cordons it so
no new Pods may schedule there, and then evicts any existing Pods.
Replace <node-name>
with the name of the Node you found in the last step.
This might impact other applications on the Node, so it's best to only do this in a test cluster.
kubectl drain <node-name> --force --delete-emptydir-data --ignore-daemonsets
Now you can watch as the Pod reschedules on a different Node:
kubectl get pod mysql-2 -o wide --watch
It should look something like this:
NAME READY STATUS RESTARTS AGE IP NODE
mysql-2 2/2 Terminating 0 15m 10.244.1.56 kubernetes-node-9l2t
[...]
mysql-2 0/2 Pending 0 0s <none> kubernetes-node-fjlm
mysql-2 0/2 Init:0/2 0 0s <none> kubernetes-node-fjlm
mysql-2 0/2 Init:1/2 0 20s 10.244.5.32 kubernetes-node-fjlm
mysql-2 0/2 PodInitializing 0 21s 10.244.5.32 kubernetes-node-fjlm
mysql-2 1/2 Running 0 22s 10.244.5.32 kubernetes-node-fjlm
mysql-2 2/2 Running 0 30s 10.244.5.32 kubernetes-node-fjlm
And again, you should see server ID 102
disappear from the
SELECT @@server_id
loop output for a while and then return.
Now uncordon the Node to return it to a normal state:
kubectl uncordon <node-name>
Scaling the number of replicas
With MySQL replication, you can scale your read query capacity by adding replicas. With StatefulSet, you can do this with a single command:
kubectl scale statefulset mysql --replicas=5
Watch the new Pods come up by running:
kubectl get pods -l app=mysql --watch
Once they're up, you should see server IDs 103
and 104
start appearing in
the SELECT @@server_id
loop output.
You can also verify that these new servers have the data you added before they existed:
kubectl run mysql-client --image=mysql:5.7 -i -t --rm --restart=Never --\
mysql -h mysql-3.mysql -e "SELECT * FROM test.messages"
Waiting for pod default/mysql-client to be running, status is Pending, pod ready: false
+---------+
| message |
+---------+
| hello |
+---------+
pod "mysql-client" deleted
Scaling back down is also seamless:
kubectl scale statefulset mysql --replicas=3
Note, however, that while scaling up creates new PersistentVolumeClaims automatically, scaling down does not automatically delete these PVCs. This gives you the choice to keep those initialized PVCs around to make scaling back up quicker, or to extract data before deleting them.
You can see this by running:
kubectl get pvc -l app=mysql
Which shows that all 5 PVCs still exist, despite having scaled the StatefulSet down to 3:
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
data-mysql-0 Bound pvc-8acbf5dc-b103-11e6-93fa-42010a800002 10Gi RWO 20m
data-mysql-1 Bound pvc-8ad39820-b103-11e6-93fa-42010a800002 10Gi RWO 20m
data-mysql-2 Bound pvc-8ad69a6d-b103-11e6-93fa-42010a800002 10Gi RWO 20m
data-mysql-3 Bound pvc-50043c45-b1c5-11e6-93fa-42010a800002 10Gi RWO 2m
data-mysql-4 Bound pvc-500a9957-b1c5-11e6-93fa-42010a800002 10Gi RWO 2m
If you don't intend to reuse the extra PVCs, you can delete them:
kubectl delete pvc data-mysql-3
kubectl delete pvc data-mysql-4
Cleaning up
-
Cancel the
SELECT @@server_id
loop by pressing Ctrl+C in its terminal, or running the following from another terminal:kubectl delete pod mysql-client-loop --now
-
Delete the StatefulSet. This also begins terminating the Pods.
kubectl delete statefulset mysql
-
Verify that the Pods disappear. They might take some time to finish terminating.
kubectl get pods -l app=mysql
You'll know the Pods have terminated when the above returns:
No resources found.
-
Delete the ConfigMap, Services, and PersistentVolumeClaims.
kubectl delete configmap,service,pvc -l app=mysql
-
If you manually provisioned PersistentVolumes, you also need to manually delete them, as well as release the underlying resources. If you used a dynamic provisioner, it automatically deletes the PersistentVolumes when it sees that you deleted the PersistentVolumeClaims. Some dynamic provisioners (such as those for EBS and PD) also release the underlying resources upon deleting the PersistentVolumes.
What's next
- Learn more about scaling a StatefulSet.
- Learn more about debugging a StatefulSet.
- Learn more about deleting a StatefulSet.
- Learn more about force deleting StatefulSet Pods.
- Look in the Helm Charts repository for other stateful application examples.
4.7.4 - Scale a StatefulSet
This task shows how to scale a StatefulSet. Scaling a StatefulSet refers to increasing or decreasing the number of replicas.
Before you begin
-
StatefulSets are only available in Kubernetes version 1.5 or later. To check your version of Kubernetes, run
kubectl version
. -
Not all stateful applications scale nicely. If you are unsure about whether to scale your StatefulSets, see StatefulSet concepts or StatefulSet tutorial for further information.
-
You should perform scaling only when you are confident that your stateful application cluster is completely healthy.
Scaling StatefulSets
Use kubectl to scale StatefulSets
First, find the StatefulSet you want to scale.
kubectl get statefulsets <stateful-set-name>
Change the number of replicas of your StatefulSet:
kubectl scale statefulsets <stateful-set-name> --replicas=<new-replicas>
Make in-place updates on your StatefulSets
Alternatively, you can do in-place updates on your StatefulSets.
If your StatefulSet was initially created with kubectl apply
,
update .spec.replicas
of the StatefulSet manifests, and then do a kubectl apply
:
kubectl apply -f <stateful-set-file-updated>
Otherwise, edit that field with kubectl edit
:
kubectl edit statefulsets <stateful-set-name>
Or use kubectl patch
:
kubectl patch statefulsets <stateful-set-name> -p '{"spec":{"replicas":<new-replicas>}}'
Troubleshooting
Scaling down does not work right
You cannot scale down a StatefulSet when any of the stateful Pods it manages is unhealthy. Scaling down only takes place after those stateful Pods become running and ready.
If spec.replicas > 1, Kubernetes cannot determine the reason for an unhealthy Pod. It might be the result of a permanent fault or of a transient fault. A transient fault can be caused by a restart required by upgrading or maintenance.
If the Pod is unhealthy due to a permanent fault, scaling without correcting the fault may lead to a state where the StatefulSet membership drops below a certain minimum number of replicas that are needed to function correctly. This may cause your StatefulSet to become unavailable.
If the Pod is unhealthy due to a transient fault and the Pod might become available again, the transient error may interfere with your scale-up or scale-down operation. Some distributed databases have issues when nodes join and leave at the same time. It is better to reason about scaling operations at the application level in these cases, and perform scaling only when you are sure that your stateful application cluster is completely healthy.
What's next
- Learn more about deleting a StatefulSet.
4.7.5 - Delete a StatefulSet
This task shows you how to delete a StatefulSet.
Before you begin
- This task assumes you have an application running on your cluster represented by a StatefulSet.
Deleting a StatefulSet
You can delete a StatefulSet in the same way you delete other resources in Kubernetes: use the kubectl delete
command, and specify the StatefulSet either by file or by name.
kubectl delete -f <file.yaml>
kubectl delete statefulsets <statefulset-name>
You may need to delete the associated headless service separately after the StatefulSet itself is deleted.
kubectl delete service <service-name>
When deleting a StatefulSet through kubectl
, the StatefulSet scales down to 0. All Pods that are part of this workload are also deleted. If you want to delete only the StatefulSet and not the Pods, use --cascade=orphan
.
For example:
kubectl delete -f <file.yaml> --cascade=orphan
By passing --cascade=orphan
to kubectl delete
, the Pods managed by the StatefulSet are left behind even after the StatefulSet object itself is deleted. If the pods have a label app=myapp
, you can then delete them as follows:
kubectl delete pods -l app=myapp
Persistent Volumes
Deleting the Pods in a StatefulSet will not delete the associated volumes. This is to ensure that you have the chance to copy data off the volume before deleting it. Deleting the PVC after the pods have terminated might trigger deletion of the backing Persistent Volumes depending on the storage class and reclaim policy. You should never assume ability to access a volume after claim deletion.
Complete deletion of a StatefulSet
To delete everything in a StatefulSet, including the associated pods, you can run a series of commands similar to the following:
grace=$(kubectl get pods <stateful-set-pod> --template '{{.spec.terminationGracePeriodSeconds}}')
kubectl delete statefulset -l app=myapp
sleep $grace
kubectl delete pvc -l app=myapp
In the example above, the Pods have the label app=myapp
; substitute your own label as appropriate.
Force deletion of StatefulSet pods
If you find that some pods in your StatefulSet are stuck in the 'Terminating' or 'Unknown' states for an extended period of time, you may need to manually intervene to forcefully delete the pods from the apiserver. This is a potentially dangerous task. Refer to Force Delete StatefulSet Pods for details.
What's next
Learn more about force deleting StatefulSet Pods.
4.7.6 - Force Delete StatefulSet Pods
This page shows how to delete Pods which are part of a stateful set, and explains the considerations to keep in mind when doing so.
Before you begin
- This is a fairly advanced task and has the potential to violate some of the properties inherent to StatefulSet.
- Before proceeding, make yourself familiar with the considerations enumerated below.
StatefulSet considerations
In normal operation of a StatefulSet, there is never a need to force delete a StatefulSet Pod. The StatefulSet controller is responsible for creating, scaling and deleting members of the StatefulSet. It tries to ensure that the specified number of Pods from ordinal 0 through N-1 are alive and ready. StatefulSet ensures that, at any time, there is at most one Pod with a given identity running in a cluster. This is referred to as at most one semantics provided by a StatefulSet.
Manual force deletion should be undertaken with caution, as it has the potential to violate the at most one semantics inherent to StatefulSet. StatefulSets may be used to run distributed and clustered applications which have a need for a stable network identity and stable storage. These applications often have configuration which relies on an ensemble of a fixed number of members with fixed identities. Having multiple members with the same identity can be disastrous and may lead to data loss (e.g. split brain scenario in quorum-based systems).
Delete Pods
You can perform a graceful pod deletion with the following command:
kubectl delete pods <pod>
For the above to lead to graceful termination, the Pod must not specify a
pod.Spec.TerminationGracePeriodSeconds
of 0. The practice of setting a
pod.Spec.TerminationGracePeriodSeconds
of 0 seconds is unsafe and strongly discouraged
for StatefulSet Pods. Graceful deletion is safe and will ensure that the Pod
shuts down gracefully
before the kubelet deletes the name from the apiserver.
A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or 'Unknown' state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node. The only ways in which a Pod in such a state can be removed from the apiserver are as follows:
- The Node object is deleted (either by you, or by the Node Controller).
- The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
- Force deletion of the Pod by the user.
The recommended best practice is to use the first or second approach. If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object. If the Node is suffering from a network partition, then try to resolve this or wait for it to resolve. When the partition heals, the kubelet will complete the deletion of the Pod and free up its name in the apiserver.
Normally, the system completes the deletion once the Pod is no longer running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.
Force Deletion
Force deletions do not wait for confirmation from the kubelet that the Pod has been terminated. Irrespective of whether a force deletion is successful in killing a Pod, it will immediately free up the name from the apiserver. This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee.
When you force delete a StatefulSet pod, you are asserting that the Pod in question will never again make contact with other Pods in the StatefulSet and its name can be safely freed up for a replacement to be created.
If you want to delete a Pod forcibly using kubectl version >= 1.5, do the following:
kubectl delete pods <pod> --grace-period=0 --force
If you're using any version of kubectl <= 1.4, you should omit the --force
option and use:
kubectl delete pods <pod> --grace-period=0
If even after these commands the pod is stuck on Unknown
state, use the following command to remove the pod from the cluster:
kubectl patch pod <pod> -p '{"metadata":{"finalizers":null}}'
Always perform force deletion of StatefulSet Pods carefully and with complete knowledge of the risks involved.
What's next
Learn more about debugging a StatefulSet.
4.7.7 - Horizontal Pod Autoscaling
In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand.
Horizontal scaling means that the response to increased load is to deploy more Pods. This is different from vertical scaling, which for Kubernetes would mean assigning more resources (for example: memory or CPU) to the Pods that are already running for the workload.
If the load decreases, and the number of Pods is above the configured minimum, the HorizontalPodAutoscaler instructs the workload resource (the Deployment, StatefulSet, or other similar resource) to scale back down.
Horizontal pod autoscaling does not apply to objects that can't be scaled (for example: a DaemonSet.)
The HorizontalPodAutoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The horizontal pod autoscaling controller, running within the Kubernetes control plane, periodically adjusts the desired scale of its target (for example, a Deployment) to match observed metrics such as average CPU utilization, average memory utilization, or any other custom metric you specify.
There is walkthrough example of using horizontal pod autoscaling.
How does a HorizontalPodAutoscaler work?
Kubernetes implements horizontal pod autoscaling as a control loop that runs intermittently
(it is not a continuous process). The interval is set by the
--horizontal-pod-autoscaler-sync-period
parameter to the
kube-controller-manager
(and the default interval is 15 seconds).
Once during each period, the controller manager queries the resource utilization against the
metrics specified in each HorizontalPodAutoscaler definition. The controller manager
finds the target resource defined by the scaleTargetRef
,
then selects the pods based on the target resource's .spec.selector
labels, and obtains the metrics from either the resource metrics API (for per-pod resource metrics),
or the custom metrics API (for all other metrics).
-
For per-pod resource metrics (like CPU), the controller fetches the metrics from the resource metrics API for each Pod targeted by the HorizontalPodAutoscaler. Then, if a target utilization value is set, the controller calculates the utilization value as a percentage of the equivalent resource request on the containers in each Pod. If a target raw value is set, the raw metric values are used directly. The controller then takes the mean of the utilization or the raw value (depending on the type of target specified) across all targeted Pods, and produces a ratio used to scale the number of desired replicas.
Please note that if some of the Pod's containers do not have the relevant resource request set, CPU utilization for the Pod will not be defined and the autoscaler will not take any action for that metric. See the algorithm details section below for more information about how the autoscaling algorithm works.
-
For per-pod custom metrics, the controller functions similarly to per-pod resource metrics, except that it works with raw values, not utilization values.
-
For object metrics and external metrics, a single metric is fetched, which describes the object in question. This metric is compared to the target value, to produce a ratio as above. In the
autoscaling/v2
API version, this value can optionally be divided by the number of Pods before the comparison is made.
The common use for HorizontalPodAutoscaler is to configure it to fetch metrics from
aggregated APIs
(metrics.k8s.io
, custom.metrics.k8s.io
, or external.metrics.k8s.io
). The metrics.k8s.io
API is
usually provided by an add-on named Metrics Server, which needs to be launched separately.
For more information about resource metrics, see
Metrics Server.
Support for metrics APIs explains the stability guarantees and support status for these different APIs.
The HorizontalPodAutoscaler controller accesses corresponding workload resources that support scaling (such as Deployments
and StatefulSet). These resources each have a subresource named scale
, an interface that allows you to dynamically set the
number of replicas and examine each of their current states.
For general information about subresources in the Kubernetes API, see
Kubernetes API Concepts.
Algorithm details
From the most basic perspective, the HorizontalPodAutoscaler controller operates on the ratio between desired metric value and current metric value:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
For example, if the current metric value is 200m
, and the desired value
is 100m
, the number of replicas will be doubled, since 200.0 / 100.0 == 2.0
If the current value is instead 50m
, you'll halve the number of
replicas, since 50.0 / 100.0 == 0.5
. The control plane skips any scaling
action if the ratio is sufficiently close to 1.0 (within a globally-configurable
tolerance, 0.1 by default).
When a targetAverageValue
or targetAverageUtilization
is specified,
the currentMetricValue
is computed by taking the average of the given
metric across all Pods in the HorizontalPodAutoscaler's scale target.
Before checking the tolerance and deciding on the final values, the control
plane also considers whether any metrics are missing, and how many Pods
are Ready
.
All Pods with a deletion timestamp set (objects with a deletion timestamp are
in the process of being shut down / removed) are ignored, and all failed Pods
are discarded.
If a particular Pod is missing metrics, it is set aside for later; Pods with missing metrics will be used to adjust the final scaling amount.
When scaling on CPU, if any pod has yet to become ready (it's still initializing, or possibly is unhealthy) or the most recent metric point for the pod was before it became ready, that pod is set aside as well.
Due to technical constraints, the HorizontalPodAutoscaler controller
cannot exactly determine the first time a pod becomes ready when
determining whether to set aside certain CPU metrics. Instead, it
considers a Pod "not yet ready" if it's unready and transitioned to
unready within a short, configurable window of time since it started.
This value is configured with the --horizontal-pod-autoscaler-initial-readiness-delay
flag, and its default is 30
seconds. Once a pod has become ready, it considers any transition to
ready to be the first if it occurred within a longer, configurable time
since it started. This value is configured with the --horizontal-pod-autoscaler-cpu-initialization-period
flag, and its
default is 5 minutes.
The currentMetricValue / desiredMetricValue
base scale ratio is then
calculated using the remaining pods not set aside or discarded from above.
If there were any missing metrics, the control plane recomputes the average more conservatively, assuming those pods were consuming 100% of the desired value in case of a scale down, and 0% in case of a scale up. This dampens the magnitude of any potential scale.
Furthermore, if any not-yet-ready pods were present, and the workload would have scaled up without factoring in missing metrics or not-yet-ready pods, the controller conservatively assumes that the not-yet-ready pods are consuming 0% of the desired metric, further dampening the magnitude of a scale up.
After factoring in the not-yet-ready pods and missing metrics, the controller recalculates the usage ratio. If the new ratio reverses the scale direction, or is within the tolerance, the controller doesn't take any scaling action. In other cases, the new ratio is used to decide any change to the number of Pods.
Note that the original value for the average utilization is reported back via the HorizontalPodAutoscaler status, without factoring in the not-yet-ready pods or missing metrics, even when the new usage ratio is used.
If multiple metrics are specified in a HorizontalPodAutoscaler, this
calculation is done for each metric, and then the largest of the desired
replica counts is chosen. If any of these metrics cannot be converted
into a desired replica count (e.g. due to an error fetching the metrics
from the metrics APIs) and a scale down is suggested by the metrics which
can be fetched, scaling is skipped. This means that the HPA is still capable
of scaling up if one or more metrics give a desiredReplicas
greater than
the current value.
Finally, right before HPA scales the target, the scale recommendation is recorded. The
controller considers all recommendations within a configurable window choosing the
highest recommendation from within that window. This value can be configured using the --horizontal-pod-autoscaler-downscale-stabilization
flag, which defaults to 5 minutes.
This means that scaledowns will occur gradually, smoothing out the impact of rapidly
fluctuating metric values.
API Object
The Horizontal Pod Autoscaler is an API resource in the Kubernetes
autoscaling
API group. The current stable version can be found in
the autoscaling/v2
API version which includes support for scaling on
memory and custom metrics. The new fields introduced in
autoscaling/v2
are preserved as annotations when working with
autoscaling/v1
.
When you create a HorizontalPodAutoscaler API object, make sure the name specified is a valid DNS subdomain name. More details about the API object can be found at HorizontalPodAutoscaler Object.
Stability of workload scale
When managing the scale of a group of replicas using the HorizontalPodAutoscaler, it is possible that the number of replicas keeps fluctuating frequently due to the dynamic nature of the metrics evaluated. This is sometimes referred to as thrashing, or flapping. It's similar to the concept of hysteresis in cybernetics.
Autoscaling during rolling update
Kubernetes lets you perform a rolling update on a Deployment. In that
case, the Deployment manages the underlying ReplicaSets for you.
When you configure autoscaling for a Deployment, you bind a
HorizontalPodAutoscaler to a single Deployment. The HorizontalPodAutoscaler
manages the replicas
field of the Deployment. The deployment controller is responsible
for setting the replicas
of the underlying ReplicaSets so that they add up to a suitable
number during the rollout and also afterwards.
If you perform a rolling update of a StatefulSet that has an autoscaled number of replicas, the StatefulSet directly manages its set of Pods (there is no intermediate resource similar to ReplicaSet).
Support for resource metrics
Any HPA target can be scaled based on the resource usage of the pods in the scaling target.
When defining the pod specification the resource requests like cpu
and memory
should
be specified. This is used to determine the resource utilization and used by the HPA controller
to scale the target up or down. To use resource utilization based scaling specify a metric source
like this:
type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
With this metric the HPA controller will keep the average utilization of the pods in the scaling target at 60%. Utilization is the ratio between the current usage of resource to the requested resources of the pod. See Algorithm for more details about how the utilization is calculated and averaged.
Container resource metrics
Kubernetes v1.20 [alpha]
The HorizontalPodAutoscaler API also supports a container metric source where the HPA can track the resource usage of individual containers across a set of Pods, in order to scale the target resource. This lets you configure scaling thresholds for the containers that matter most in a particular Pod. For example, if you have a web application and a logging sidecar, you can scale based on the resource use of the web application, ignoring the sidecar container and its resource use.
If you revise the target resource to have a new Pod specification with a different set of containers, you should revise the HPA spec if that newly added container should also be used for scaling. If the specified container in the metric source is not present or only present in a subset of the pods then those pods are ignored and the recommendation is recalculated. See Algorithm for more details about the calculation. To use container resources for autoscaling define a metric source as follows:
type: ContainerResource
containerResource:
name: cpu
container: application
target:
type: Utilization
averageUtilization: 60
In the above example the HPA controller scales the target such that the average utilization of the cpu
in the application
container of all the pods is 60%.
If you change the name of a container that a HorizontalPodAutoscaler is tracking, you can make that change in a specific order to ensure scaling remains available and effective whilst the change is being applied. Before you update the resource that defines the container (such as a Deployment), you should update the associated HPA to track both the new and old container names. This way, the HPA is able to calculate a scaling recommendation throughout the update process.
Once you have rolled out the container name change to the workload resource, tidy up by removing the old container name from the HPA specification.
Scaling on custom metrics
Kubernetes v1.23 [stable]
(the autoscaling/v2beta2
API version previously provided this ability as a beta feature)
Provided that you use the autoscaling/v2
API version, you can configure a HorizontalPodAutoscaler
to scale based on a custom metric (that is not built in to Kubernetes or any Kubernetes component).
The HorizontalPodAutoscaler controller then queries for these custom metrics from the Kubernetes
API.
See Support for metrics APIs for the requirements.
Scaling on multiple metrics
Kubernetes v1.23 [stable]
(the autoscaling/v2beta2
API version previously provided this ability as a beta feature)
Provided that you use the autoscaling/v2
API version, you can specify multiple metrics for a
HorizontalPodAutoscaler to scale on. Then, the HorizontalPodAutoscaler controller evaluates each metric,
and proposes a new scale based on that metric. The HorizontalPodAutoscaler takes the maximum scale
recommended for each metric and sets the workload to that size (provided that this isn't larger than the
overall maximum that you configured).
Support for metrics APIs
By default, the HorizontalPodAutoscaler controller retrieves metrics from a series of APIs. In order for it to access these APIs, cluster administrators must ensure that:
-
The API aggregation layer is enabled.
-
The corresponding APIs are registered:
-
For resource metrics, this is the
metrics.k8s.io
API, generally provided by metrics-server. It can be launched as a cluster add-on. -
For custom metrics, this is the
custom.metrics.k8s.io
API. It's provided by "adapter" API servers provided by metrics solution vendors. Check with your metrics pipeline to see if there is a Kubernetes metrics adapter available. -
For external metrics, this is the
external.metrics.k8s.io
API. It may be provided by the custom metrics adapters provided above.
-
For more information on these different metrics paths and how they differ please see the relevant design proposals for the HPA V2, custom.metrics.k8s.io and external.metrics.k8s.io.
For examples of how to use them see the walkthrough for using custom metrics and the walkthrough for using external metrics.
Configurable scaling behavior
Kubernetes v1.23 [stable]
(the autoscaling/v2beta2
API version previously provided this ability as a beta feature)
If you use the v2
HorizontalPodAutoscaler API, you can use the behavior
field
(see the API reference)
to configure separate scale-up and scale-down behaviors.
You specify these behaviours by setting scaleUp
and / or scaleDown
under the behavior
field.
You can specify a stabilization window that prevents flapping the replica count for a scaling target. Scaling policies also let you controls the rate of change of replicas while scaling.
Scaling policies
One or more scaling policies can be specified in the behavior
section of the spec.
When multiple policies are specified the policy which allows the highest amount of
change is the policy which is selected by default. The following example shows this behavior
while scaling down:
behavior:
scaleDown:
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 10
periodSeconds: 60
periodSeconds
indicates the length of time in the past for which the policy must hold true.
The first policy (Pods) allows at most 4 replicas to be scaled down in one minute. The second policy
(Percent) allows at most 10% of the current replicas to be scaled down in one minute.
Since by default the policy which allows the highest amount of change is selected, the second policy will only be used when the number of pod replicas is more than 40. With 40 or less replicas, the first policy will be applied. For instance if there are 80 replicas and the target has to be scaled down to 10 replicas then during the first step 8 replicas will be reduced. In the next iteration when the number of replicas is 72, 10% of the pods is 7.2 but the number is rounded up to 8. On each loop of the autoscaler controller the number of pods to be change is re-calculated based on the number of current replicas. When the number of replicas falls below 40 the first policy (Pods) is applied and 4 replicas will be reduced at a time.
The policy selection can be changed by specifying the selectPolicy
field for a scaling
direction. By setting the value to Min
which would select the policy which allows the
smallest change in the replica count. Setting the value to Disabled
completely disables
scaling in that direction.
Stabilization window
The stabilization window is used to restrict the flapping of replicas count when the metrics used for scaling keep fluctuating. The autoscaling algorithm uses this window to infer a previous desired state and avoid unwanted changes to workload scale.
For example, in the following example snippet, a stabilization window is specified for scaleDown
.
behavior:
scaleDown:
stabilizationWindowSeconds: 300
When the metrics indicate that the target should be scaled down the algorithm looks into previously computed desired states, and uses the highest value from the specified interval. In the above example, all desired states from the past 5 minutes will be considered.
This approximates a rolling maximum, and avoids having the scaling algorithm frequently remove Pods only to trigger recreating an equivalent Pod just moments later.
Default Behavior
To use the custom scaling not all fields have to be specified. Only values which need to be customized can be specified. These custom values are merged with default values. The default values match the existing behavior in the HPA algorithm.
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
For scaling down the stabilization window is 300 seconds (or the value of the
--horizontal-pod-autoscaler-downscale-stabilization
flag if provided). There is only a single policy
for scaling down which allows a 100% of the currently running replicas to be removed which
means the scaling target can be scaled down to the minimum allowed replicas.
For scaling up there is no stabilization window. When the metrics indicate that the target should be
scaled up the target is scaled up immediately. There are 2 policies where 4 pods or a 100% of the currently
running replicas will be added every 15 seconds till the HPA reaches its steady state.
Example: change downscale stabilization window
To provide a custom downscale stabilization window of 1 minute, the following behavior would be added to the HPA:
behavior:
scaleDown:
stabilizationWindowSeconds: 60
Example: limit scale down rate
To limit the rate at which pods are removed by the HPA to 10% per minute, the following behavior would be added to the HPA:
behavior:
scaleDown:
policies:
- type: Percent
value: 10
periodSeconds: 60
To ensure that no more than 5 Pods are removed per minute, you can add a second scale-down
policy with a fixed size of 5, and set selectPolicy
to minimum. Setting selectPolicy
to Min
means
that the autoscaler chooses the policy that affects the smallest number of Pods:
behavior:
scaleDown:
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 5
periodSeconds: 60
selectPolicy: Min
Example: disable scale down
The selectPolicy
value of Disabled
turns off scaling the given direction.
So to prevent downscaling the following policy would be used:
behavior:
scaleDown:
selectPolicy: Disabled
Support for HorizontalPodAutoscaler in kubectl
HorizontalPodAutoscaler, like every API resource, is supported in a standard way by kubectl
.
You can create a new autoscaler using kubectl create
command.
You can list autoscalers by kubectl get hpa
or get detailed description by kubectl describe hpa
.
Finally, you can delete an autoscaler using kubectl delete hpa
.
In addition, there is a special kubectl autoscale
command for creating a HorizontalPodAutoscaler object.
For instance, executing kubectl autoscale rs foo --min=2 --max=5 --cpu-percent=80
will create an autoscaler for ReplicaSet foo, with target CPU utilization set to 80%
and the number of replicas between 2 and 5.
Implicit maintenance-mode deactivation
You can implicitly deactivate the HPA for a target without the
need to change the HPA configuration itself. If the target's desired replica count
is set to 0, and the HPA's minimum replica count is greater than 0, the HPA
stops adjusting the target (and sets the ScalingActive
Condition on itself
to false
) until you reactivate it by manually adjusting the target's desired
replica count or HPA's minimum replica count.
Migrating Deployments and StatefulSets to horizontal autoscaling
When an HPA is enabled, it is recommended that the value of spec.replicas
of
the Deployment and / or StatefulSet be removed from their
manifest(s). If this isn't done, any time
a change to that object is applied, for example via kubectl apply -f deployment.yaml
, this will instruct Kubernetes to scale the current number of Pods
to the value of the spec.replicas
key. This may not be
desired and could be troublesome when an HPA is active.
Keep in mind that the removal of spec.replicas
may incur a one-time
degradation of Pod counts as the default value of this key is 1 (reference
Deployment Replicas).
Upon the update, all Pods except 1 will begin their termination procedures. Any
deployment application afterwards will behave as normal and respect a rolling
update configuration as desired. You can avoid this degradation by choosing one of the following two
methods based on how you are modifying your deployments:
kubectl apply edit-last-applied deployment/<deployment_name>
- In the editor, remove
spec.replicas
. When you save and exit the editor,kubectl
applies the update. No changes to Pod counts happen at this step. - You can now remove
spec.replicas
from the manifest. If you use source code management, also commit your changes or take whatever other steps for revising the source code are appropriate for how you track updates. - From here on out you can run
kubectl apply -f deployment.yaml
When using the Server-Side Apply you can follow the transferring ownership guidelines, which cover this exact use case.
What's next
If you configure autoscaling in your cluster, you may also want to consider running a cluster-level autoscaler such as Cluster Autoscaler.
For more information on HorizontalPodAutoscaler:
- Read a walkthrough example for horizontal pod autoscaling.
- Read documentation for
kubectl autoscale
. - If you would like to write your own custom metrics adapter, check out the boilerplate to get started.
- Read the API reference for HorizontalPodAutoscaler.
4.7.8 - HorizontalPodAutoscaler Walkthrough
A HorizontalPodAutoscaler (HPA for short) automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand.
Horizontal scaling means that the response to increased load is to deploy more Pods. This is different from vertical scaling, which for Kubernetes would mean assigning more resources (for example: memory or CPU) to the Pods that are already running for the workload.
If the load decreases, and the number of Pods is above the configured minimum, the HorizontalPodAutoscaler instructs the workload resource (the Deployment, StatefulSet, or other similar resource) to scale back down.
This document walks you through an example of enabling HorizontalPodAutoscaler to automatically manage scale for an example web app. This example workload is Apache httpd running some PHP code.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.23. To check the version, enterkubectl version
.
If you're running an older
release of Kubernetes, refer to the version of the documentation for that release (see
available documentation versions.
To follow this walkthrough, you also need to use a cluster that has a Metrics Server deployed and configured. The Kubernetes Metrics Server collects resource metrics from the kubelets in your cluster, and exposes those metrics through the Kubernetes API, using an APIService to add new kinds of resource that represent metric readings.
To learn how to deploy the Metrics Server, see the metrics-server documentation.
Run and expose php-apache server
To demonstrate a HorizontalPodAutoscaler, you will first make a custom container image that uses
the php-apache
image from Docker Hub as its starting point. The Dockerfile
is ready-made for you,
and has the following content:
FROM php:5-apache
COPY index.php /var/www/html/index.php
RUN chmod a+rx index.php
This code defines a simple index.php
page that performs some CPU intensive computations,
in order to simulate load in your cluster.
<?php
$x = 0.0001;
for ($i = 0; $i <= 1000000; $i++) {
$x += sqrt($x);
}
echo "OK!";
?>
Once you have made that container image, start a Deployment that runs a container using the image you made, and expose it as a Service using the following manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache
spec:
selector:
matchLabels:
run: php-apache
replicas: 1
template:
metadata:
labels:
run: php-apache
spec:
containers:
- name: php-apache
image: k8s.gcr.io/hpa-example
ports:
- containerPort: 80
resources:
limits:
cpu: 500m
requests:
cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
name: php-apache
labels:
run: php-apache
spec:
ports:
- port: 80
selector:
run: php-apache
To do so, run the following command:
kubectl apply -f https://k8s.io/examples/application/php-apache.yaml
deployment.apps/php-apache created
service/php-apache created
Create the HorizontalPodAutoscaler
Now that the server is running, create the autoscaler using kubectl
. There is
kubectl autoscale
subcommand,
part of kubectl
, that helps you do this.
You will shortly run a command that creates a HorizontalPodAutoscaler that maintains between 1 and 10 replicas of the Pods controlled by the php-apache Deployment that you created in the first step of these instructions.
Roughly speaking, the HPA controller will increase and decrease
the number of replicas (by updating the Deployment) to maintain an average CPU utilization across all Pods of 50%.
The Deployment then updates the ReplicaSet - this is part of how all Deployments work in Kubernetes -
and then the ReplicaSet either adds or removes Pods based on the change to its .spec
.
Since each pod requests 200 milli-cores by kubectl run
, this means an average CPU usage of 100 milli-cores.
See Algorithm details for more details
on the algorithm.
Create the HorizontalPodAutoscaler:
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10
horizontalpodautoscaler.autoscaling/php-apache autoscaled
You can check the current status of the newly-made HorizontalPodAutoscaler, by running:
# You can use "hpa" or "horizontalpodautoscaler"; either name works OK.
kubectl get hpa
The output is similar to:
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
php-apache Deployment/php-apache/scale 0% / 50% 1 10 1 18s
(if you see other HorizontalPodAutoscalers with different names, that means they already existed, and isn't usually a problem).
Please note that the current CPU consumption is 0% as there are no clients sending requests to the server
(the TARGET
column shows the average across all the Pods controlled by the corresponding deployment).
Increase the load
Next, see how the autoscaler reacts to increased load. To do this, you'll start a different Pod to act as a client. The container within the client Pod runs in an infinite loop, sending queries to the php-apache service.
# Run this in a separate terminal
# so that the load generation continues and you can carry on with the rest of the steps
kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"
Now run:
# type Ctrl+C to end the watch when you're ready
kubectl get hpa php-apache --watch
Within a minute or so, you should see the higher CPU load; for example:
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
php-apache Deployment/php-apache/scale 305% / 50% 1 10 1 3m
and then, more replicas. For example:
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
php-apache Deployment/php-apache/scale 305% / 50% 1 10 7 3m
Here, CPU consumption has increased to 305% of the request. As a result, the Deployment was resized to 7 replicas:
kubectl get deployment php-apache
You should see the replica count matching the figure from the HorizontalPodAutoscaler
NAME READY UP-TO-DATE AVAILABLE AGE
php-apache 7/7 7 7 19m
Stop generating load
To finish the example, stop sending the load.
In the terminal where you created the Pod that runs a busybox
image, terminate
the load generation by typing <Ctrl> + C
.
Then verify the result state (after a minute or so):
# type Ctrl+C to end the watch when you're ready
kubectl get hpa php-apache --watch
The output is similar to:
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
php-apache Deployment/php-apache/scale 0% / 50% 1 10 1 11m
and the Deployment also shows that it has scaled down:
kubectl get deployment php-apache
NAME READY UP-TO-DATE AVAILABLE AGE
php-apache 1/1 1 1 27m
Once CPU utilization dropped to 0, the HPA automatically scaled the number of replicas back down to 1.
Autoscaling the replicas may take a few minutes.
Autoscaling on multiple metrics and custom metrics
You can introduce additional metrics to use when autoscaling the php-apache
Deployment
by making use of the autoscaling/v2
API version.
First, get the YAML of your HorizontalPodAutoscaler in the autoscaling/v2
form:
kubectl get hpa php-apache -o yaml > /tmp/hpa-v2.yaml
Open the /tmp/hpa-v2.yaml
file in an editor, and you should see YAML which looks like this:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
status:
observedGeneration: 1
lastScaleTime: <some-time>
currentReplicas: 1
desiredReplicas: 1
currentMetrics:
- type: Resource
resource:
name: cpu
current:
averageUtilization: 0
averageValue: 0
Notice that the targetCPUUtilizationPercentage
field has been replaced with an array called metrics
.
The CPU utilization metric is a resource metric, since it is represented as a percentage of a resource
specified on pod containers. Notice that you can specify other resource metrics besides CPU. By default,
the only other supported resource metric is memory. These resources do not change names from cluster
to cluster, and should always be available, as long as the metrics.k8s.io
API is available.
You can also specify resource metrics in terms of direct values, instead of as percentages of the
requested value, by using a target.type
of AverageValue
instead of Utilization
, and
setting the corresponding target.averageValue
field instead of the target.averageUtilization
.
There are two other types of metrics, both of which are considered custom metrics: pod metrics and object metrics. These metrics may have names which are cluster specific, and require a more advanced cluster monitoring setup.
The first of these alternative metric types is pod metrics. These metrics describe Pods, and
are averaged together across Pods and compared with a target value to determine the replica count.
They work much like resource metrics, except that they only support a target
type of AverageValue
.
Pod metrics are specified using a metric block like this:
type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
The second alternative metric type is object metrics. These metrics describe a different
object in the same namespace, instead of describing Pods. The metrics are not necessarily
fetched from the object; they only describe it. Object metrics support target
types of
both Value
and AverageValue
. With Value
, the target is compared directly to the returned
metric from the API. With AverageValue
, the value returned from the custom metrics API is divided
by the number of Pods before being compared to the target. The following example is the YAML
representation of the requests-per-second
metric.
type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: main-route
target:
type: Value
value: 2k
If you provide multiple such metric blocks, the HorizontalPodAutoscaler will consider each metric in turn. The HorizontalPodAutoscaler will calculate proposed replica counts for each metric, and then choose the one with the highest replica count.
For example, if you had your monitoring system collecting metrics about network traffic,
you could update the definition above using kubectl edit
to look like this:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
- type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: main-route
target:
type: Value
value: 10k
status:
observedGeneration: 1
lastScaleTime: <some-time>
currentReplicas: 1
desiredReplicas: 1
currentMetrics:
- type: Resource
resource:
name: cpu
current:
averageUtilization: 0
averageValue: 0
- type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: main-route
current:
value: 10k
Then, your HorizontalPodAutoscaler would attempt to ensure that each pod was consuming roughly 50% of its requested CPU, serving 1000 packets per second, and that all pods behind the main-route Ingress were serving a total of 10000 requests per second.
Autoscaling on more specific metrics
Many metrics pipelines allow you to describe metrics either by name or by a set of additional
descriptors called labels. For all non-resource metric types (pod, object, and external,
described below), you can specify an additional label selector which is passed to your metric
pipeline. For instance, if you collect a metric http_requests
with the verb
label, you can specify the following metric block to scale only on GET requests:
type: Object
object:
metric:
name: http_requests
selector: {matchLabels: {verb: GET}}
This selector uses the same syntax as the full Kubernetes label selectors. The monitoring pipeline
determines how to collapse multiple series into a single value, if the name and selector
match multiple series. The selector is additive, and cannot select metrics
that describe objects that are not the target object (the target pods in the case of the Pods
type, and the described object in the case of the Object
type).
Autoscaling on metrics not related to Kubernetes objects
Applications running on Kubernetes may need to autoscale based on metrics that don't have an obvious relationship to any object in the Kubernetes cluster, such as metrics describing a hosted service with no direct correlation to Kubernetes namespaces. In Kubernetes 1.10 and later, you can address this use case with external metrics.
Using external metrics requires knowledge of your monitoring system; the setup is
similar to that required when using custom metrics. External metrics allow you to autoscale your cluster
based on any metric available in your monitoring system. Provide a metric
block with a
name
and selector
, as above, and use the External
metric type instead of Object
.
If multiple time series are matched by the metricSelector
,
the sum of their values is used by the HorizontalPodAutoscaler.
External metrics support both the Value
and AverageValue
target types, which function exactly the same
as when you use the Object
type.
For example if your application processes tasks from a hosted queue service, you could add the following section to your HorizontalPodAutoscaler manifest to specify that you need one worker per 30 outstanding tasks.
- type: External
external:
metric:
name: queue_messages_ready
selector:
matchLabels:
queue: "worker_tasks"
target:
type: AverageValue
averageValue: 30
When possible, it's preferable to use the custom metric target types instead of external metrics, since it's easier for cluster administrators to secure the custom metrics API. The external metrics API potentially allows access to any metric, so cluster administrators should take care when exposing it.
Appendix: Horizontal Pod Autoscaler Status Conditions
When using the autoscaling/v2
form of the HorizontalPodAutoscaler, you will be able to see
status conditions set by Kubernetes on the HorizontalPodAutoscaler. These status conditions indicate
whether or not the HorizontalPodAutoscaler is able to scale, and whether or not it is currently restricted
in any way.
The conditions appear in the status.conditions
field. To see the conditions affecting a HorizontalPodAutoscaler,
we can use kubectl describe hpa
:
kubectl describe hpa cm-test
Name: cm-test
Namespace: prom
Labels: <none>
Annotations: <none>
CreationTimestamp: Fri, 16 Jun 2017 18:09:22 +0000
Reference: ReplicationController/cm-test
Metrics: ( current / target )
"http_requests" on pods: 66m / 500m
Min replicas: 1
Max replicas: 4
ReplicationController pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale the last scale time was sufficiently old as to warrant a new scale
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric http_requests
ScalingLimited False DesiredWithinRange the desired replica count is within the acceptable range
Events:
For this HorizontalPodAutoscaler, you can see several conditions in a healthy state. The first,
AbleToScale
, indicates whether or not the HPA is able to fetch and update scales, as well as
whether or not any backoff-related conditions would prevent scaling. The second, ScalingActive
,
indicates whether or not the HPA is enabled (i.e. the replica count of the target is not zero) and
is able to calculate desired scales. When it is False
, it generally indicates problems with
fetching metrics. Finally, the last condition, ScalingLimited
, indicates that the desired scale
was capped by the maximum or minimum of the HorizontalPodAutoscaler. This is an indication that
you may wish to raise or lower the minimum or maximum replica count constraints on your
HorizontalPodAutoscaler.
Quantities
All metrics in the HorizontalPodAutoscaler and metrics APIs are specified using
a special whole-number notation known in Kubernetes as a
quantity. For example,
the quantity 10500m
would be written as 10.5
in decimal notation. The metrics APIs
will return whole numbers without a suffix when possible, and will generally return
quantities in milli-units otherwise. This means you might see your metric value fluctuate
between 1
and 1500m
, or 1
and 1.5
when written in decimal notation.
Other possible scenarios
Creating the autoscaler declaratively
Instead of using kubectl autoscale
command to create a HorizontalPodAutoscaler imperatively we
can use the following manifest to create it declaratively:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
Then, create the autoscaler by executing the following command:
kubectl create -f https://k8s.io/examples/application/hpa/php-apache.yaml
horizontalpodautoscaler.autoscaling/php-apache created
4.7.9 - Specifying a Disruption Budget for your Application
Kubernetes v1.21 [stable]
This page shows how to limit the number of concurrent disruptions that your application experiences, allowing for higher availability while permitting the cluster administrator to manage the clusters nodes.
Before you begin
Your Kubernetes server must be at or later than version v1.21. To check the version, enterkubectl version
.
- You are the owner of an application running on a Kubernetes cluster that requires high availability.
- You should know how to deploy Replicated Stateless Applications and/or Replicated Stateful Applications.
- You should have read about Pod Disruptions.
- You should confirm with your cluster owner or service provider that they respect Pod Disruption Budgets.
Protecting an Application with a PodDisruptionBudget
- Identify what application you want to protect with a PodDisruptionBudget (PDB).
- Think about how your application reacts to disruptions.
- Create a PDB definition as a YAML file.
- Create the PDB object from the YAML file.
Identify an Application to Protect
The most common use case when you want to protect an application specified by one of the built-in Kubernetes controllers:
- Deployment
- ReplicationController
- ReplicaSet
- StatefulSet
In this case, make a note of the controller's .spec.selector
; the same
selector goes into the PDBs .spec.selector
.
From version 1.15 PDBs support custom controllers where the scale subresource is enabled.
You can also use PDBs with pods which are not controlled by one of the above controllers, or arbitrary groups of pods, but there are some restrictions, described in Arbitrary Controllers and Selectors.
Think about how your application reacts to disruptions
Decide how many instances can be down at the same time for a short period due to a voluntary disruption.
- Stateless frontends:
- Concern: don't reduce serving capacity by more than 10%.
- Solution: use PDB with minAvailable 90% for example.
- Concern: don't reduce serving capacity by more than 10%.
- Single-instance Stateful Application:
- Concern: do not terminate this application without talking to me.
- Possible Solution 1: Do not use a PDB and tolerate occasional downtime.
- Possible Solution 2: Set PDB with maxUnavailable=0. Have an understanding (outside of Kubernetes) that the cluster operator needs to consult you before termination. When the cluster operator contacts you, prepare for downtime, and then delete the PDB to indicate readiness for disruption. Recreate afterwards.
- Concern: do not terminate this application without talking to me.
- Multiple-instance Stateful application such as Consul, ZooKeeper, or etcd:
- Concern: Do not reduce number of instances below quorum, otherwise writes fail.
- Possible Solution 1: set maxUnavailable to 1 (works with varying scale of application).
- Possible Solution 2: set minAvailable to quorum-size (e.g. 3 when scale is 5). (Allows more disruptions at once).
- Concern: Do not reduce number of instances below quorum, otherwise writes fail.
- Restartable Batch Job:
- Concern: Job needs to complete in case of voluntary disruption.
- Possible solution: Do not create a PDB. The Job controller will create a replacement pod.
- Concern: Job needs to complete in case of voluntary disruption.
Rounding logic when specifying percentages
Values for minAvailable
or maxUnavailable
can be expressed as integers or as a percentage.
- When you specify an integer, it represents a number of Pods. For instance, if you set
minAvailable
to 10, then 10 Pods must always be available, even during a disruption. - When you specify a percentage by setting the value to a string representation of a percentage (eg.
"50%"
), it represents a percentage of total Pods. For instance, if you setmaxUnavailable
to"50%"
, then only 50% of the Pods can be unavailable during a disruption.
When you specify the value as a percentage, it may not map to an exact number of Pods. For example, if you have 7 Pods and
you set minAvailable
to "50%"
, it's not immediately obvious whether that means 3 Pods or 4 Pods must be available.
Kubernetes rounds up to the nearest integer, so in this case, 4 Pods must be available. You can examine the
code
that controls this behavior.
Specifying a PodDisruptionBudget
A PodDisruptionBudget
has three fields:
- A label selector
.spec.selector
to specify the set of pods to which it applies. This field is required. .spec.minAvailable
which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod.minAvailable
can be either an absolute number or a percentage..spec.maxUnavailable
(available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
You can specify only one of maxUnavailable
and minAvailable
in a single PodDisruptionBudget
.
maxUnavailable
can only be used to control the eviction of pods
that have an associated controller managing them. In the examples below, "desired replicas"
is the scale
of the controller managing the pods being selected by the
PodDisruptionBudget
.
Example 1: With a minAvailable
of 5, evictions are allowed as long as they leave behind
5 or more healthy pods among those selected by the PodDisruptionBudget's selector
.
Example 2: With a minAvailable
of 30%, evictions are allowed as long as at least 30%
of the number of desired replicas are healthy.
Example 3: With a maxUnavailable
of 5, evictions are allowed as long as there are at most 5
unhealthy replicas among the total number of desired replicas.
Example 4: With a maxUnavailable
of 30%, evictions are allowed as long as no more than 30%
of the desired replicas are unhealthy.
In typical usage, a single budget would be used for a collection of pods managed by a controller—for example, the pods in a single ReplicaSet or StatefulSet.
If you set maxUnavailable
to 0% or 0, or you set minAvailable
to 100% or the number of replicas,
you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload
object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods.
If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the
semantics of PodDisruptionBudget
.
You can find examples of pod disruption budgets defined below. They match pods with the label
app: zookeeper
.
Example PDB Using minAvailable:
Example PDB Using maxUnavailable:
For example, if the above zk-pdb
object selects the pods of a StatefulSet of size 3, both
specifications have the exact same meaning. The use of maxUnavailable
is recommended as it
automatically responds to changes in the number of replicas of the corresponding controller.
Create the PDB object
You can create or update the PDB object using kubectl.
kubectl apply -f mypdb.yaml
Check the status of the PDB
Use kubectl to check that your PDB is created.
Assuming you don't actually have pods matching app: zookeeper
in your namespace,
then you'll see something like this:
kubectl get poddisruptionbudgets
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
zk-pdb 2 N/A 0 7s
If there are matching pods (say, 3), then you would see something like this:
kubectl get poddisruptionbudgets
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
zk-pdb 2 N/A 1 7s
The non-zero value for ALLOWED DISRUPTIONS
means that the disruption controller has seen the pods,
counted the matching pods, and updated the status of the PDB.
You can get more information about the status of a PDB with this command:
kubectl get poddisruptionbudgets zk-pdb -o yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
annotations:
…
creationTimestamp: "2020-03-04T04:22:56Z"
generation: 1
name: zk-pdb
…
status:
currentHealthy: 3
desiredHealthy: 2
disruptionsAllowed: 1
expectedPods: 3
observedGeneration: 1
Arbitrary Controllers and Selectors
You can skip this section if you only use PDBs with the built-in application controllers (Deployment, ReplicationController, ReplicaSet, and StatefulSet), with the PDB selector matching the controller's selector.
You can use a PDB with pods controlled by another type of controller, by an "operator", or bare pods, but with these restrictions:
- only
.spec.minAvailable
can be used, not.spec.maxUnavailable
. - only an integer value can be used with
.spec.minAvailable
, not a percentage.
You can use a selector which selects a subset or superset of the pods belonging to a built-in controller. The eviction API will disallow eviction of any pod covered by multiple PDBs, so most users will want to avoid overlapping selectors. One reasonable use of overlapping PDBs is when pods are being transitioned from one PDB to another.
4.7.10 - Accessing the Kubernetes API from a Pod
This guide demonstrates how to access the Kubernetes API from within a pod.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Accessing the API from within a Pod
When accessing the API from within a Pod, locating and authenticating to the API server are slightly different to the external client case.
The easiest way to use the Kubernetes API from a Pod is to use one of the official client libraries. These libraries can automatically discover the API server and authenticate.
Using Official Client Libraries
From within a Pod, the recommended ways to connect to the Kubernetes API are:
-
For a Go client, use the official Go client library. The
rest.InClusterConfig()
function handles API host discovery and authentication automatically. See an example here. -
For a Python client, use the official Python client library. The
config.load_incluster_config()
function handles API host discovery and authentication automatically. See an example here. -
There are a number of other libraries available, please refer to the Client Libraries page.
In each case, the service account credentials of the Pod are used to communicate securely with the API server.
Directly accessing the REST API
While running in a Pod, the Kubernetes apiserver is accessible via a Service named
kubernetes
in the default
namespace. Therefore, Pods can use the
kubernetes.default.svc
hostname to query the API server. Official client libraries
do this automatically.
The recommended way to authenticate to the API server is with a
service account
credential. By default, a Pod
is associated with a service account, and a credential (token) for that
service account is placed into the filesystem tree of each container in that Pod,
at /var/run/secrets/kubernetes.io/serviceaccount/token
.
If available, a certificate bundle is placed into the filesystem tree of each
container at /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
, and should be
used to verify the serving certificate of the API server.
Finally, the default namespace to be used for namespaced API operations is placed in a file
at /var/run/secrets/kubernetes.io/serviceaccount/namespace
in each container.
Using kubectl proxy
If you would like to query the API without an official client library, you can run kubectl proxy
as the command
of a new sidecar container in the Pod. This way, kubectl proxy
will authenticate
to the API and expose it on the localhost
interface of the Pod, so that other containers
in the Pod can use it directly.
Without using a proxy
It is possible to avoid using the kubectl proxy by passing the authentication token directly to the API server. The internal certificate secures the connection.
# Point to the internal API server hostname
APISERVER=https://kubernetes.default.svc
# Path to ServiceAccount token
SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
# Read this Pod's namespace
NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
# Read the ServiceAccount bearer token
TOKEN=$(cat ${SERVICEACCOUNT}/token)
# Reference the internal certificate authority (CA)
CACERT=${SERVICEACCOUNT}/ca.crt
# Explore the API with TOKEN
curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api
The output will be similar to this:
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}
4.8 - Run Jobs
4.8.1 - Running Automated Tasks with a CronJob
CronJobs was promoted to general availability in Kubernetes v1.21. If you are using an older version of
Kubernetes, please refer to the documentation for the version of Kubernetes that you are using,
so that you see accurate information. Older Kubernetes versions do not support the batch/v1
CronJob API.
You can use a CronJob to run Jobs on a time-based schedule. These automated jobs run like Cron tasks on a Linux or UNIX system.
Cron jobs are useful for creating periodic and recurring tasks, like running backups or sending emails. Cron jobs can also schedule individual tasks for a specific time, such as if you want to schedule a job for a low activity period.
Cron jobs have limitations and idiosyncrasies. For example, in certain circumstances, a single cron job can create multiple jobs. Therefore, jobs should be idempotent.
For more limitations, see CronJobs.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Creating a Cron Job
Cron jobs require a config file.
This example cron job config .spec
file prints the current time and a hello message every minute:
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
Run the example CronJob by using this command:
kubectl create -f https://k8s.io/examples/application/job/cronjob.yaml
The output is similar to this:
cronjob.batch/hello created
After creating the cron job, get its status using this command:
kubectl get cronjob hello
The output is similar to this:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hello */1 * * * * False 0 <none> 10s
As you can see from the results of the command, the cron job has not scheduled or run any jobs yet. Watch for the job to be created in around one minute:
kubectl get jobs --watch
The output is similar to this:
NAME COMPLETIONS DURATION AGE
hello-4111706356 0/1 0s
hello-4111706356 0/1 0s 0s
hello-4111706356 1/1 5s 5s
Now you've seen one running job scheduled by the "hello" cron job. You can stop watching the job and view the cron job again to see that it scheduled the job:
kubectl get cronjob hello
The output is similar to this:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hello */1 * * * * False 0 50s 75s
You should see that the cron job hello
successfully scheduled a job at the time specified in LAST SCHEDULE
. There are currently 0 active jobs, meaning that the job has completed or failed.
Now, find the pods that the last scheduled job created and view the standard output of one of the pods.
# Replace "hello-4111706356" with the job name in your system
pods=$(kubectl get pods --selector=job-name=hello-4111706356 --output=jsonpath={.items[*].metadata.name})
Show pod log:
kubectl logs $pods
The output is similar to this:
Fri Feb 22 11:02:09 UTC 2019
Hello from the Kubernetes cluster
Deleting a Cron Job
When you don't need a cron job any more, delete it with kubectl delete cronjob <cronjob name>
:
kubectl delete cronjob hello
Deleting the cron job removes all the jobs and pods it created and stops it from creating additional jobs. You can read more about removing jobs in garbage collection.
Writing a Cron Job Spec
As with all other Kubernetes configs, a cron job needs apiVersion
, kind
, and metadata
fields. For general
information about working with config files, see deploying applications,
and using kubectl to manage resources documents.
A cron job config also needs a .spec
section.
.spec
, are applied only to the following runs.
Schedule
The .spec.schedule
is a required field of the .spec
.
It takes a Cron format string, such as 0 * * * *
or @hourly
, as schedule time of its jobs to be created and executed.
The format also includes extended "Vixie cron" step values. As explained in the FreeBSD manual:
Step values can be used in conjunction with ranges. Following a range with
/<number>
specifies skips of the number's value through the range. For example,0-23/2
can be used in the hours field to specify command execution every other hour (the alternative in the V7 standard is0,2,4,6,8,10,12,14,16,18,20,22
). Steps are also permitted after an asterisk, so if you want to say "every two hours", just use*/2
.
?
) in the schedule has the same meaning as an asterisk *
, that is, it stands for any of available value for a given field.
Job Template
The .spec.jobTemplate
is the template for the job, and it is required.
It has exactly the same schema as a Job, except that it is nested and does not have an apiVersion
or kind
.
For information about writing a job .spec
, see Writing a Job Spec.
Starting Deadline
The .spec.startingDeadlineSeconds
field is optional.
It stands for the deadline in seconds for starting the job if it misses its scheduled time for any reason.
After the deadline, the cron job does not start the job.
Jobs that do not meet their deadline in this way count as failed jobs.
If this field is not specified, the jobs have no deadline.
If the .spec.startingDeadlineSeconds
field is set (not null), the CronJob
controller measures the time between when a job is expected to be created and
now. If the difference is higher than that limit, it will skip this execution.
For example, if it is set to 200
, it allows a job to be created for up to 200
seconds after the actual schedule.
Concurrency Policy
The .spec.concurrencyPolicy
field is also optional.
It specifies how to treat concurrent executions of a job that is created by this cron job.
The spec may specify only one of the following concurrency policies:
Allow
(default): The cron job allows concurrently running jobsForbid
: The cron job does not allow concurrent runs; if it is time for a new job run and the previous job run hasn't finished yet, the cron job skips the new job runReplace
: If it is time for a new job run and the previous job run hasn't finished yet, the cron job replaces the currently running job run with a new job run
Note that concurrency policy only applies to the jobs created by the same cron job. If there are multiple cron jobs, their respective jobs are always allowed to run concurrently.
Suspend
The .spec.suspend
field is also optional.
If it is set to true
, all subsequent executions are suspended.
This setting does not apply to already started executions.
Defaults to false.
.spec.suspend
changes from true
to false
on an existing cron job without a starting deadline, the missed jobs are scheduled immediately.
Jobs History Limits
The .spec.successfulJobsHistoryLimit
and .spec.failedJobsHistoryLimit
fields are optional.
These fields specify how many completed and failed jobs should be kept.
By default, they are set to 3 and 1 respectively. Setting a limit to 0
corresponds to keeping none of the corresponding kind of jobs after they finish.
4.8.2 - Coarse Parallel Processing Using a Work Queue
In this example, we will run a Kubernetes Job with multiple parallel worker processes.
In this example, as each pod is created, it picks up one unit of work from a task queue, completes it, deletes it from the queue, and exits.
Here is an overview of the steps in this example:
- Start a message queue service. In this example, we use RabbitMQ, but you could use another one. In practice you would set up a message queue service once and reuse it for many jobs.
- Create a queue, and fill it with messages. Each message represents one task to be done. In this example, a message is an integer that we will do a lengthy computation on.
- Start a Job that works on tasks from the queue. The Job starts several pods. Each pod takes one task from the message queue, processes it, and repeats until the end of the queue is reached.
Before you begin
Be familiar with the basic, non-parallel, use of Job.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Starting a message queue service
This example uses RabbitMQ, however, you can adapt the example to use another AMQP-type message service.
In practice you could set up a message queue service once in a cluster and reuse it for many jobs, as well as for long-running services.
Start RabbitMQ as follows:
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.3/examples/celery-rabbitmq/rabbitmq-service.yaml
service "rabbitmq-service" created
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.3/examples/celery-rabbitmq/rabbitmq-controller.yaml
replicationcontroller "rabbitmq-controller" created
We will only use the rabbitmq part from the celery-rabbitmq example.
Testing the message queue service
Now, we can experiment with accessing the message queue. We will create a temporary interactive pod, install some tools on it, and experiment with queues.
First create a temporary interactive Pod.
# Create a temporary interactive container
kubectl run -i --tty temp --image ubuntu:18.04
Waiting for pod default/temp-loe07 to be running, status is Pending, pod ready: false
... [ previous line repeats several times .. hit return when it stops ] ...
Note that your pod name and command prompt will be different.
Next install the amqp-tools
so we can work with message queues.
# Install some tools
root@temp-loe07:/# apt-get update
.... [ lots of output ] ....
root@temp-loe07:/# apt-get install -y curl ca-certificates amqp-tools python dnsutils
.... [ lots of output ] ....
Later, we will make a docker image that includes these packages.
Next, we will check that we can discover the rabbitmq service:
# Note the rabbitmq-service has a DNS name, provided by Kubernetes:
root@temp-loe07:/# nslookup rabbitmq-service
Server: 10.0.0.10
Address: 10.0.0.10#53
Name: rabbitmq-service.default.svc.cluster.local
Address: 10.0.147.152
# Your address will vary.
If Kube-DNS is not setup correctly, the previous step may not work for you. You can also find the service IP in an env var:
# env | grep RABBIT | grep HOST
RABBITMQ_SERVICE_SERVICE_HOST=10.0.147.152
# Your address will vary.
Next we will verify we can create a queue, and publish and consume messages.
# In the next line, rabbitmq-service is the hostname where the rabbitmq-service
# can be reached. 5672 is the standard port for rabbitmq.
root@temp-loe07:/# export BROKER_URL=amqp://guest:guest@rabbitmq-service:5672
# If you could not resolve "rabbitmq-service" in the previous step,
# then use this command instead:
# root@temp-loe07:/# BROKER_URL=amqp://guest:guest@$RABBITMQ_SERVICE_SERVICE_HOST:5672
# Now create a queue:
root@temp-loe07:/# /usr/bin/amqp-declare-queue --url=$BROKER_URL -q foo -d
foo
# Publish one message to it:
root@temp-loe07:/# /usr/bin/amqp-publish --url=$BROKER_URL -r foo -p -b Hello
# And get it back.
root@temp-loe07:/# /usr/bin/amqp-consume --url=$BROKER_URL -q foo -c 1 cat && echo
Hello
root@temp-loe07:/#
In the last command, the amqp-consume
tool takes one message (-c 1
)
from the queue, and passes that message to the standard input of an arbitrary command. In this case, the program cat
prints out the characters read from standard input, and the echo adds a carriage
return so the example is readable.
Filling the Queue with tasks
Now let's fill the queue with some "tasks". In our example, our tasks are strings to be printed.
In a practice, the content of the messages might be:
- names of files to that need to be processed
- extra flags to the program
- ranges of keys in a database table
- configuration parameters to a simulation
- frame numbers of a scene to be rendered
In practice, if there is large data that is needed in a read-only mode by all pods of the Job, you will typically put that in a shared file system like NFS and mount that readonly on all the pods, or the program in the pod will natively read data from a cluster file system like HDFS.
For our example, we will create the queue and fill it using the amqp command line tools. In practice, you might write a program to fill the queue using an amqp client library.
/usr/bin/amqp-declare-queue --url=$BROKER_URL -q job1 -d
job1
for f in apple banana cherry date fig grape lemon melon
do
/usr/bin/amqp-publish --url=$BROKER_URL -r job1 -p -b $f
done
So, we filled the queue with 8 messages.
Create an Image
Now we are ready to create an image that we will run as a job.
We will use the amqp-consume
utility to read the message
from the queue and run our actual program. Here is a very simple
example program:
#!/usr/bin/env python
# Just prints standard out and sleeps for 10 seconds.
import sys
import time
print("Processing " + sys.stdin.readlines()[0])
time.sleep(10)
Give the script execution permission:
chmod +x worker.py
Now, build an image. If you are working in the source
tree, then change directory to examples/job/work-queue-1
.
Otherwise, make a temporary directory, change to it,
download the Dockerfile,
and worker.py. In either case,
build the image with this command:
docker build -t job-wq-1 .
For the Docker Hub, tag your app image with
your username and push to the Hub with the below commands. Replace
<username>
with your Hub username.
docker tag job-wq-1 <username>/job-wq-1
docker push <username>/job-wq-1
If you are using Google Container
Registry, tag
your app image with your project ID, and push to GCR. Replace
<project>
with your project ID.
docker tag job-wq-1 gcr.io/<project>/job-wq-1
gcloud docker -- push gcr.io/<project>/job-wq-1
Defining a Job
Here is a job definition. You'll need to make a copy of the Job and edit the
image to match the name you used, and call it ./job.yaml
.
apiVersion: batch/v1
kind: Job
metadata:
name: job-wq-1
spec:
completions: 8
parallelism: 2
template:
metadata:
name: job-wq-1
spec:
containers:
- name: c
image: gcr.io/<project>/job-wq-1
env:
- name: BROKER_URL
value: amqp://guest:guest@rabbitmq-service:5672
- name: QUEUE
value: job1
restartPolicy: OnFailure
In this example, each pod works on one item from the queue and then exits.
So, the completion count of the Job corresponds to the number of work items
done. So we set, .spec.completions: 8
for the example, since we put 8 items in the queue.
Running the Job
So, now run the Job:
kubectl apply -f ./job.yaml
Now wait a bit, then check on the job.
kubectl describe jobs/job-wq-1
Name: job-wq-1
Namespace: default
Selector: controller-uid=41d75705-92df-11e7-b85e-fa163ee3c11f
Labels: controller-uid=41d75705-92df-11e7-b85e-fa163ee3c11f
job-name=job-wq-1
Annotations: <none>
Parallelism: 2
Completions: 8
Start Time: Wed, 06 Sep 2017 16:42:02 +0800
Pods Statuses: 0 Running / 8 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=41d75705-92df-11e7-b85e-fa163ee3c11f
job-name=job-wq-1
Containers:
c:
Image: gcr.io/causal-jigsaw-637/job-wq-1
Port:
Environment:
BROKER_URL: amqp://guest:guest@rabbitmq-service:5672
QUEUE: job1
Mounts: <none>
Volumes: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
───────── ──────── ───── ──── ───────────── ────── ────── ───────
27s 27s 1 {job } Normal SuccessfulCreate Created pod: job-wq-1-hcobb
27s 27s 1 {job } Normal SuccessfulCreate Created pod: job-wq-1-weytj
27s 27s 1 {job } Normal SuccessfulCreate Created pod: job-wq-1-qaam5
27s 27s 1 {job } Normal SuccessfulCreate Created pod: job-wq-1-b67sr
26s 26s 1 {job } Normal SuccessfulCreate Created pod: job-wq-1-xe5hj
15s 15s 1 {job } Normal SuccessfulCreate Created pod: job-wq-1-w2zqe
14s 14s 1 {job } Normal SuccessfulCreate Created pod: job-wq-1-d6ppa
14s 14s 1 {job } Normal SuccessfulCreate Created pod: job-wq-1-p17e0
All our pods succeeded. Yay.
Alternatives
This approach has the advantage that you do not need to modify your "worker" program to be aware that there is a work queue.
It does require that you run a message queue service. If running a queue service is inconvenient, you may want to consider one of the other job patterns.
This approach creates a pod for every work item. If your work items only take a few seconds, though, creating a Pod for every work item may add a lot of overhead. Consider another example, that executes multiple work items per Pod.
In this example, we use the amqp-consume
utility to read the message
from the queue and run our actual program. This has the advantage that you
do not need to modify your program to be aware of the queue.
A different example, shows how to
communicate with the work queue using a client library.
Caveats
If the number of completions is set to less than the number of items in the queue, then not all items will be processed.
If the number of completions is set to more than the number of items in the queue, then the Job will not appear to be completed, even though all items in the queue have been processed. It will start additional pods which will block waiting for a message.
There is an unlikely race with this pattern. If the container is killed in between the time that the message is acknowledged by the amqp-consume command and the time that the container exits with success, or if the node crashes before the kubelet is able to post the success of the pod back to the api-server, then the Job will not appear to be complete, even though all items in the queue have been processed.
4.8.3 - Fine Parallel Processing Using a Work Queue
In this example, we will run a Kubernetes Job with multiple parallel worker processes in a given pod.
In this example, as each pod is created, it picks up one unit of work from a task queue, processes it, and repeats until the end of the queue is reached.
Here is an overview of the steps in this example:
- Start a storage service to hold the work queue. In this example, we use Redis to store our work items. In the previous example, we used RabbitMQ. In this example, we use Redis and a custom work-queue client library because AMQP does not provide a good way for clients to detect when a finite-length work queue is empty. In practice you would set up a store such as Redis once and reuse it for the work queues of many jobs, and other things.
- Create a queue, and fill it with messages. Each message represents one task to be done. In this example, a message is an integer that we will do a lengthy computation on.
- Start a Job that works on tasks from the queue. The Job starts several pods. Each pod takes one task from the message queue, processes it, and repeats until the end of the queue is reached.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Be familiar with the basic, non-parallel, use of Job.
Starting Redis
For this example, for simplicity, we will start a single instance of Redis. See the Redis Example for an example of deploying Redis scalably and redundantly.
You could also download the following files directly:
Filling the Queue with tasks
Now let's fill the queue with some "tasks". In our example, our tasks are strings to be printed.
Start a temporary interactive pod for running the Redis CLI.
kubectl run -i --tty temp --image redis --command "/bin/sh"
Waiting for pod default/redis2-c7h78 to be running, status is Pending, pod ready: false
Hit enter for command prompt
Now hit enter, start the redis CLI, and create a list with some work items in it.
# redis-cli -h redis
redis:6379> rpush job2 "apple"
(integer) 1
redis:6379> rpush job2 "banana"
(integer) 2
redis:6379> rpush job2 "cherry"
(integer) 3
redis:6379> rpush job2 "date"
(integer) 4
redis:6379> rpush job2 "fig"
(integer) 5
redis:6379> rpush job2 "grape"
(integer) 6
redis:6379> rpush job2 "lemon"
(integer) 7
redis:6379> rpush job2 "melon"
(integer) 8
redis:6379> rpush job2 "orange"
(integer) 9
redis:6379> lrange job2 0 -1
1) "apple"
2) "banana"
3) "cherry"
4) "date"
5) "fig"
6) "grape"
7) "lemon"
8) "melon"
9) "orange"
So, the list with key job2
will be our work queue.
Note: if you do not have Kube DNS setup correctly, you may need to change
the first step of the above block to redis-cli -h $REDIS_SERVICE_HOST
.
Create an Image
Now we are ready to create an image that we will run.
We will use a python worker program with a redis client to read the messages from the message queue.
A simple Redis work queue client library is provided, called rediswq.py (Download).
The "worker" program in each Pod of the Job uses the work queue client library to get work. Here it is:
#!/usr/bin/env python
import time
import rediswq
host="redis"
# Uncomment next two lines if you do not have Kube-DNS working.
# import os
# host = os.getenv("REDIS_SERVICE_HOST")
q = rediswq.RedisWQ(name="job2", host=host)
print("Worker with sessionID: " + q.sessionID())
print("Initial queue state: empty=" + str(q.empty()))
while not q.empty():
item = q.lease(lease_secs=10, block=True, timeout=2)
if item is not None:
itemstr = item.decode("utf-8")
print("Working on " + itemstr)
time.sleep(10) # Put your actual work here instead of sleep.
q.complete(item)
else:
print("Waiting for work")
print("Queue empty, exiting")
You could also download worker.py
,
rediswq.py
, and
Dockerfile
files, then build
the image:
docker build -t job-wq-2 .
Push the image
For the Docker Hub, tag your app image with
your username and push to the Hub with the below commands. Replace
<username>
with your Hub username.
docker tag job-wq-2 <username>/job-wq-2
docker push <username>/job-wq-2
You need to push to a public repository or configure your cluster to be able to access your private repository.
If you are using Google Container
Registry, tag
your app image with your project ID, and push to GCR. Replace
<project>
with your project ID.
docker tag job-wq-2 gcr.io/<project>/job-wq-2
gcloud docker -- push gcr.io/<project>/job-wq-2
Defining a Job
Here is the job definition:
apiVersion: batch/v1
kind: Job
metadata:
name: job-wq-2
spec:
parallelism: 2
template:
metadata:
name: job-wq-2
spec:
containers:
- name: c
image: gcr.io/myproject/job-wq-2
restartPolicy: OnFailure
Be sure to edit the job template to
change gcr.io/myproject
to your own path.
In this example, each pod works on several items from the queue and then exits when there are no more items. Since the workers themselves detect when the workqueue is empty, and the Job controller does not know about the workqueue, it relies on the workers to signal when they are done working. The workers signal that the queue is empty by exiting with success. So, as soon as any worker exits with success, the controller knows the work is done, and the Pods will exit soon. So, we set the completion count of the Job to 1. The job controller will wait for the other pods to complete too.
Running the Job
So, now run the Job:
kubectl apply -f ./job.yaml
Now wait a bit, then check on the job.
kubectl describe jobs/job-wq-2
Name: job-wq-2
Namespace: default
Selector: controller-uid=b1c7e4e3-92e1-11e7-b85e-fa163ee3c11f
Labels: controller-uid=b1c7e4e3-92e1-11e7-b85e-fa163ee3c11f
job-name=job-wq-2
Annotations: <none>
Parallelism: 2
Completions: <unset>
Start Time: Mon, 11 Jan 2016 17:07:59 -0800
Pods Statuses: 1 Running / 0 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=b1c7e4e3-92e1-11e7-b85e-fa163ee3c11f
job-name=job-wq-2
Containers:
c:
Image: gcr.io/exampleproject/job-wq-2
Port:
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
33s 33s 1 {job-controller } Normal SuccessfulCreate Created pod: job-wq-2-lglf8
kubectl logs pods/job-wq-2-7r7b2
Worker with sessionID: bbd72d0a-9e5c-4dd6-abf6-416cc267991f
Initial queue state: empty=False
Working on banana
Working on date
Working on lemon
As you can see, one of our pods worked on several work units.
Alternatives
If running a queue service or modifying your containers to use a work queue is inconvenient, you may want to consider one of the other job patterns.
If you have a continuous stream of background processing work to run, then
consider running your background workers with a ReplicaSet
instead,
and consider running a background processing library such as
https://github.com/resque/resque.
4.8.4 - Indexed Job for Parallel Processing with Static Work Assignment
Kubernetes v1.22 [beta]
In this example, you will run a Kubernetes Job that uses multiple parallel worker processes. Each worker is a different container running in its own Pod. The Pods have an index number that the control plane sets automatically, which allows each Pod to identify which part of the overall task to work on.
The pod index is available in the annotation
batch.kubernetes.io/job-completion-index
as a string representing its
decimal value. In order for the containerized task process to obtain this index,
you can publish the value of the annotation using the downward API
mechanism.
For convenience, the control plane automatically sets the downward API to
expose the index in the JOB_COMPLETION_INDEX
environment variable.
Here is an overview of the steps in this example:
- Define a Job manifest using indexed completion. The downward API allows you to pass the pod index annotation as an environment variable or file to the container.
- Start an
Indexed
Job based on that manifest.
Before you begin
You should already be familiar with the basic, non-parallel, use of Job.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.21. To check the version, enterkubectl version
.
Choose an approach
To access the work item from the worker program, you have a few options:
- Read the
JOB_COMPLETION_INDEX
environment variable. The Job controller automatically links this variable to the annotation containing the completion index. - Read a file that contains the completion index.
- Assuming that you can't modify the program, you can wrap it with a script that reads the index using any of the methods above and converts it into something that the program can use as input.
For this example, imagine that you chose option 3 and you want to run the rev utility. This program accepts a file as an argument and prints its content reversed.
rev data.txt
You'll use the rev
tool from the
busybox
container image.
As this is only an example, each Pod only does a tiny piece of work (reversing a short string). In a real workload you might, for example, create a Job that represents the task of producing 60 seconds of video based on scene data. Each work item in the video rendering Job would be to render a particular frame of that video clip. Indexed completion would mean that each Pod in the Job knows which frame to render and publish, by counting frames from the start of the clip.
Define an Indexed Job
Here is a sample Job manifest that uses Indexed
completion mode:
apiVersion: batch/v1
kind: Job
metadata:
name: 'indexed-job'
spec:
completions: 5
parallelism: 3
completionMode: Indexed
template:
spec:
restartPolicy: Never
initContainers:
- name: 'input'
image: 'docker.io/library/bash'
command:
- "bash"
- "-c"
- |
items=(foo bar baz qux xyz)
echo ${items[$JOB_COMPLETION_INDEX]} > /input/data.txt
volumeMounts:
- mountPath: /input
name: input
containers:
- name: 'worker'
image: 'docker.io/library/busybox'
command:
- "rev"
- "/input/data.txt"
volumeMounts:
- mountPath: /input
name: input
volumes:
- name: input
emptyDir: {}
In the example above, you use the builtin JOB_COMPLETION_INDEX
environment
variable set by the Job controller for all containers. An init container
maps the index to a static value and writes it to a file that is shared with the
container running the worker through an emptyDir volume.
Optionally, you can define your own environment variable through the downward
API
to publish the index to containers. You can also choose to load a list of values
from a ConfigMap as an environment variable or file.
Alternatively, you can directly use the downward API to pass the annotation value as a volume file, like shown in the following example:
apiVersion: batch/v1
kind: Job
metadata:
name: 'indexed-job'
spec:
completions: 5
parallelism: 3
completionMode: Indexed
template:
spec:
restartPolicy: Never
containers:
- name: 'worker'
image: 'docker.io/library/busybox'
command:
- "rev"
- "/input/data.txt"
volumeMounts:
- mountPath: /input
name: input
volumes:
- name: input
downwardAPI:
items:
- path: "data.txt"
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
Running the Job
Now run the Job:
# This uses the first approach (relying on $JOB_COMPLETION_INDEX)
kubectl apply -f https://kubernetes.io/examples/application/job/indexed-job.yaml
When you create this Job, the control plane creates a series of Pods, one for each index you specified. The value of .spec.parallelism
determines how many can run at once whereas .spec.completions
determines how many Pods the Job creates in total.
Because .spec.parallelism
is less than .spec.completions
, the control plane waits for some of the first Pods to complete before starting more of them.
Once you have created the Job, wait a moment then check on progress:
kubectl describe jobs/indexed-job
The output is similar to:
Name: indexed-job
Namespace: default
Selector: controller-uid=bf865e04-0b67-483b-9a90-74cfc4c3e756
Labels: controller-uid=bf865e04-0b67-483b-9a90-74cfc4c3e756
job-name=indexed-job
Annotations: <none>
Parallelism: 3
Completions: 5
Start Time: Thu, 11 Mar 2021 15:47:34 +0000
Pods Statuses: 2 Running / 3 Succeeded / 0 Failed
Completed Indexes: 0-2
Pod Template:
Labels: controller-uid=bf865e04-0b67-483b-9a90-74cfc4c3e756
job-name=indexed-job
Init Containers:
input:
Image: docker.io/library/bash
Port: <none>
Host Port: <none>
Command:
bash
-c
items=(foo bar baz qux xyz)
echo ${items[$JOB_COMPLETION_INDEX]} > /input/data.txt
Environment: <none>
Mounts:
/input from input (rw)
Containers:
worker:
Image: docker.io/library/busybox
Port: <none>
Host Port: <none>
Command:
rev
/input/data.txt
Environment: <none>
Mounts:
/input from input (rw)
Volumes:
input:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 4s job-controller Created pod: indexed-job-njkjj
Normal SuccessfulCreate 4s job-controller Created pod: indexed-job-9kd4h
Normal SuccessfulCreate 4s job-controller Created pod: indexed-job-qjwsz
Normal SuccessfulCreate 1s job-controller Created pod: indexed-job-fdhq5
Normal SuccessfulCreate 1s job-controller Created pod: indexed-job-ncslj
In this example, you run the Job with custom values for each index. You can inspect the output of one of the pods:
kubectl logs indexed-job-fdhq5 # Change this to match the name of a Pod from that Job
The output is similar to:
xuq
4.8.5 - Parallel Processing using Expansions
This task demonstrates running multiple Jobs based on a common template. You can use this approach to process batches of work in parallel.
For this example there are only three items: apple, banana, and cherry. The sample Jobs process each item by printing a string then pausing.
See using Jobs in real workloads to learn about how this pattern fits more realistic use cases.
Before you begin
You should be familiar with the basic, non-parallel, use of Job.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
For basic templating you need the command-line utility sed
.
To follow the advanced templating example, you need a working installation of Python, and the Jinja2 template library for Python.
Once you have Python set up, you can install Jinja2 by running:
pip install --user jinja2
Create Jobs based on a template
First, download the following template of a Job to a file called job-tmpl.yaml
.
Here's what you'll download:
apiVersion: batch/v1
kind: Job
metadata:
name: process-item-$ITEM
labels:
jobgroup: jobexample
spec:
template:
metadata:
name: jobexample
labels:
jobgroup: jobexample
spec:
containers:
- name: c
image: busybox:1.28
command: ["sh", "-c", "echo Processing item $ITEM && sleep 5"]
restartPolicy: Never
# Use curl to download job-tmpl.yaml
curl -L -s -O https://k8s.io/examples/application/job/job-tmpl.yaml
The file you downloaded is not yet a valid Kubernetes
manifest.
Instead that template is a YAML representation of a Job object with some placeholders
that need to be filled in before it can be used. The $ITEM
syntax is not meaningful to Kubernetes.
Create manifests from the template
The following shell snippet uses sed
to replace the string $ITEM
with the loop
variable, writing into a temporary directory named jobs
. Run this now:
# Expand the template into multiple files, one for each item to be processed.
mkdir ./jobs
for i in apple banana cherry
do
cat job-tmpl.yaml | sed "s/\$ITEM/$i/" > ./jobs/job-$i.yaml
done
Check if it worked:
ls jobs/
The output is similar to this:
job-apple.yaml
job-banana.yaml
job-cherry.yaml
You could use any type of template language (for example: Jinja2; ERB), or write a program to generate the Job manifests.
Create Jobs from the manifests
Next, create all the Jobs with one kubectl command:
kubectl create -f ./jobs
The output is similar to this:
job.batch/process-item-apple created
job.batch/process-item-banana created
job.batch/process-item-cherry created
Now, check on the jobs:
kubectl get jobs -l jobgroup=jobexample
The output is similar to this:
NAME COMPLETIONS DURATION AGE
process-item-apple 1/1 14s 22s
process-item-banana 1/1 12s 21s
process-item-cherry 1/1 12s 20s
Using the -l
option to kubectl selects only the Jobs that are part
of this group of jobs (there might be other unrelated jobs in the system).
You can check on the Pods as well using the same label selector:
kubectl get pods -l jobgroup=jobexample
The output is similar to:
NAME READY STATUS RESTARTS AGE
process-item-apple-kixwv 0/1 Completed 0 4m
process-item-banana-wrsf7 0/1 Completed 0 4m
process-item-cherry-dnfu9 0/1 Completed 0 4m
We can use this single command to check on the output of all jobs at once:
kubectl logs -f -l jobgroup=jobexample
The output should be:
Processing item apple
Processing item banana
Processing item cherry
Clean up
# Remove the Jobs you created
# Your cluster automatically cleans up their Pods
kubectl delete job -l jobgroup=jobexample
Use advanced template parameters
In the first example, each instance of the template had one parameter, and that parameter was also used in the Job's name. However, names are restricted to contain only certain characters.
This slightly more complex example uses the Jinja template language to generate manifests and then objects from those manifests, with a multiple parameters for each Job.
For this part of the task, you are going to use a one-line Python script to convert the template to a set of manifests.
First, copy and paste the following template of a Job object, into a file called job.yaml.jinja2
:
{% set params = [{ "name": "apple", "url": "http://dbpedia.org/resource/Apple", },
{ "name": "banana", "url": "http://dbpedia.org/resource/Banana", },
{ "name": "cherry", "url": "http://dbpedia.org/resource/Cherry" }]
%}
{% for p in params %}
{% set name = p["name"] %}
{% set url = p["url"] %}
---
apiVersion: batch/v1
kind: Job
metadata:
name: jobexample-{{ name }}
labels:
jobgroup: jobexample
spec:
template:
metadata:
name: jobexample
labels:
jobgroup: jobexample
spec:
containers:
- name: c
image: busybox:1.28
command: ["sh", "-c", "echo Processing URL {{ url }} && sleep 5"]
restartPolicy: Never
{% endfor %}
The above template defines two parameters for each Job object using a list of
python dicts (lines 1-4). A for
loop emits one Job manifest for each
set of parameters (remaining lines).
This example relies on a feature of YAML. One YAML file can contain multiple
documents (Kubernetes manifests, in this case), separated by ---
on a line
by itself.
You can pipe the output directly to kubectl
to create the Jobs.
Next, use this one-line Python program to expand the template:
alias render_template='python -c "from jinja2 import Template; import sys; print(Template(sys.stdin.read()).render());"'
Use render_template
to convert the parameters and template into a single
YAML file containing Kubernetes manifests:
# This requires the alias you defined earlier
cat job.yaml.jinja2 | render_template > jobs.yaml
You can view jobs.yaml
to verify that the render_template
script worked
correctly.
Once you are happy that render_template
is working how you intend,
you can pipe its output into kubectl
:
cat job.yaml.jinja2 | render_template | kubectl apply -f -
Kubernetes accepts and runs the Jobs you created.
Clean up
# Remove the Jobs you created
# Your cluster automatically cleans up their Pods
kubectl delete job -l jobgroup=jobexample
Using Jobs in real workloads
In a real use case, each Job performs some substantial computation, such as rendering a frame
of a movie, or processing a range of rows in a database. If you were rendering a movie
you would set $ITEM
to the frame number. If you were processing rows from a database
table, you would set $ITEM
to represent the range of database rows to process.
In the task, you ran a command to collect the output from Pods by fetching
their logs. In a real use case, each Pod for a Job writes its output to
durable storage before completing. You can use a PersistentVolume for each Job,
or an external storage service. For example, if you are rendering frames for a movie,
use HTTP to PUT
the rendered frame data to a URL, using a different URL for each
frame.
Labels on Jobs and Pods
After you create a Job, Kubernetes automatically adds additional labels that distinguish one Job's pods from another Job's pods.
In this example, each Job and its Pod template have a label:
jobgroup=jobexample
.
Kubernetes itself pays no attention to labels named jobgroup
. Setting a label
for all the Jobs you create from a template makes it convenient to operate on all
those Jobs at once.
In the first example you used a template to
create several Jobs. The template ensures that each Pod also gets the same label, so
you can check on all Pods for these templated Jobs with a single command.
jobgroup
is not special or reserved.
You can pick your own labelling scheme.
There are recommended labels
that you can use if you wish.
Alternatives
If you plan to create a large number of Job objects, you may find that:
- Even using labels, managing so many Jobs is cumbersome.
- If you create many Jobs in a batch, you might place high load on the Kubernetes control plane. Alternatively, the Kubernetes API server could rate limit you, temporarily rejecting your requests with a 429 status.
- You are limited by a resource quota on Jobs: the API server permanently rejects some of your requests when you create a great deal of work in one batch.
There are other job patterns that you can use to process large amounts of work without creating very many Job objects.
You could also consider writing your own controller to manage Job objects automatically.
4.9 - Access Applications in a Cluster
4.9.1 - Deploy and Access the Kubernetes Dashboard
Dashboard is a web-based Kubernetes user interface. You can use Dashboard to deploy containerized applications to a Kubernetes cluster, troubleshoot your containerized application, and manage the cluster resources. You can use Dashboard to get an overview of applications running on your cluster, as well as for creating or modifying individual Kubernetes resources (such as Deployments, Jobs, DaemonSets, etc). For example, you can scale a Deployment, initiate a rolling update, restart a pod or deploy new applications using a deploy wizard.
Dashboard also provides information on the state of Kubernetes resources in your cluster and on any errors that may have occurred.
Deploying the Dashboard UI
The Dashboard UI is not deployed by default. To deploy it, run the following command:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.5.0/aio/deploy/recommended.yaml
Accessing the Dashboard UI
To protect your cluster data, Dashboard deploys with a minimal RBAC configuration by default. Currently, Dashboard only supports logging in with a Bearer Token. To create a token for this demo, you can follow our guide on creating a sample user.
Command line proxy
You can enable access to the Dashboard using the kubectl
command-line tool,
by running the following command:
kubectl proxy
Kubectl will make Dashboard available at http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/.
The UI can only be accessed from the machine where the command is executed. See kubectl proxy --help
for more options.
Welcome view
When you access Dashboard on an empty cluster, you'll see the welcome page.
This page contains a link to this document as well as a button to deploy your first application.
In addition, you can view which system applications are running by default in the kube-system
namespace of your cluster, for example the Dashboard itself.
Deploying containerized applications
Dashboard lets you create and deploy a containerized application as a Deployment and optional Service with a simple wizard. You can either manually specify application details, or upload a YAML or JSON manifest file containing application configuration.
Click the CREATE button in the upper right corner of any page to begin.
Specifying application details
The deploy wizard expects that you provide the following information:
-
App name (mandatory): Name for your application. A label with the name will be added to the Deployment and Service, if any, that will be deployed.
The application name must be unique within the selected Kubernetes namespace. It must start with a lowercase character, and end with a lowercase character or a number, and contain only lowercase letters, numbers and dashes (-). It is limited to 24 characters. Leading and trailing spaces are ignored.
-
Container image (mandatory): The URL of a public Docker container image on any registry, or a private image (commonly hosted on the Google Container Registry or Docker Hub). The container image specification must end with a colon.
-
Number of pods (mandatory): The target number of Pods you want your application to be deployed in. The value must be a positive integer.
A Deployment will be created to maintain the desired number of Pods across your cluster.
-
Service (optional): For some parts of your application (e.g. frontends) you may want to expose a Service onto an external, maybe public IP address outside of your cluster (external Service).
Note: For external Services, you may need to open up one or more ports to do so.Other Services that are only visible from inside the cluster are called internal Services.
Irrespective of the Service type, if you choose to create a Service and your container listens on a port (incoming), you need to specify two ports. The Service will be created mapping the port (incoming) to the target port seen by the container. This Service will route to your deployed Pods. Supported protocols are TCP and UDP. The internal DNS name for this Service will be the value you specified as application name above.
If needed, you can expand the Advanced options section where you can specify more settings:
-
Description: The text you enter here will be added as an annotation to the Deployment and displayed in the application's details.
-
Labels: Default labels to be used for your application are application name and version. You can specify additional labels to be applied to the Deployment, Service (if any), and Pods, such as release, environment, tier, partition, and release track.
Example:
release=1.0 tier=frontend environment=pod track=stable
-
Namespace: Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces. They let you partition resources into logically named groups.
Dashboard offers all available namespaces in a dropdown list, and allows you to create a new namespace. The namespace name may contain a maximum of 63 alphanumeric characters and dashes (-) but can not contain capital letters. Namespace names should not consist of only numbers. If the name is set as a number, such as 10, the pod will be put in the default namespace.
In case the creation of the namespace is successful, it is selected by default. If the creation fails, the first namespace is selected.
-
Image Pull Secret: In case the specified Docker container image is private, it may require pull secret credentials.
Dashboard offers all available secrets in a dropdown list, and allows you to create a new secret. The secret name must follow the DNS domain name syntax, for example
new.image-pull.secret
. The content of a secret must be base64-encoded and specified in a.dockercfg
file. The secret name may consist of a maximum of 253 characters.In case the creation of the image pull secret is successful, it is selected by default. If the creation fails, no secret is applied.
-
CPU requirement (cores) and Memory requirement (MiB): You can specify the minimum resource limits for the container. By default, Pods run with unbounded CPU and memory limits.
-
Run command and Run command arguments: By default, your containers run the specified Docker image's default entrypoint command. You can use the command options and arguments to override the default.
-
Run as privileged: This setting determines whether processes in privileged containers are equivalent to processes running as root on the host. Privileged containers can make use of capabilities like manipulating the network stack and accessing devices.
-
Environment variables: Kubernetes exposes Services through environment variables. You can compose environment variable or pass arguments to your commands using the values of environment variables. They can be used in applications to find a Service. Values can reference other variables using the
$(VAR_NAME)
syntax.
Uploading a YAML or JSON file
Kubernetes supports declarative configuration. In this style, all configuration is stored in manifests (YAML or JSON configuration files). The manifests use Kubernetes API resource schemas.
As an alternative to specifying application details in the deploy wizard, you can define your application in one or more manifests, and upload the files using Dashboard.
Using Dashboard
Following sections describe views of the Kubernetes Dashboard UI; what they provide and how can they be used.
Navigation
When there are Kubernetes objects defined in the cluster, Dashboard shows them in the initial view. By default only objects from the default namespace are shown and this can be changed using the namespace selector located in the navigation menu.
Dashboard shows most Kubernetes object kinds and groups them in a few menu categories.
Admin overview
For cluster and namespace administrators, Dashboard lists Nodes, Namespaces and PersistentVolumes and has detail views for them. Node list view contains CPU and memory usage metrics aggregated across all Nodes. The details view shows the metrics for a Node, its specification, status, allocated resources, events and pods running on the node.
Workloads
Shows all applications running in the selected namespace. The view lists applications by workload kind (for example: Deployments, ReplicaSets, StatefulSets). Each workload kind can be viewed separately. The lists summarize actionable information about the workloads, such as the number of ready pods for a ReplicaSet or current memory usage for a Pod.
Detail views for workloads show status and specification information and surface relationships between objects. For example, Pods that ReplicaSet is controlling or new ReplicaSets and HorizontalPodAutoscalers for Deployments.
Services
Shows Kubernetes resources that allow for exposing services to external world and discovering them within a cluster. For that reason, Service and Ingress views show Pods targeted by them, internal endpoints for cluster connections and external endpoints for external users.
Storage
Storage view shows PersistentVolumeClaim resources which are used by applications for storing data.
ConfigMaps and Secrets
Shows all Kubernetes resources that are used for live configuration of applications running in clusters. The view allows for editing and managing config objects and displays secrets hidden by default.
Logs viewer
Pod lists and detail pages link to a logs viewer that is built into Dashboard. The viewer allows for drilling down logs from containers belonging to a single Pod.
What's next
For more information, see the Kubernetes Dashboard project page.
4.9.2 - Accessing Clusters
This topic discusses multiple ways to interact with clusters.
Accessing for the first time with kubectl
When accessing the Kubernetes API for the first time, we suggest using the
Kubernetes CLI, kubectl
.
To access a cluster, you need to know the location of the cluster and have credentials to access it. Typically, this is automatically set-up when you work through a Getting started guide, or someone else setup the cluster and provided you with credentials and a location.
Check the location and credentials that kubectl knows about with this command:
kubectl config view
Many of the examples provide an introduction to using
kubectl
, and complete documentation is found in the
kubectl reference.
Directly accessing the REST API
Kubectl handles locating and authenticating to the apiserver. If you want to directly access the REST API with an http client like curl or wget, or a browser, there are several ways to locate and authenticate:
- Run kubectl in proxy mode.
- Recommended approach.
- Uses stored apiserver location.
- Verifies identity of apiserver using self-signed cert. No MITM possible.
- Authenticates to apiserver.
- In future, may do intelligent client-side load-balancing and failover.
- Provide the location and credentials directly to the http client.
- Alternate approach.
- Works with some types of client code that are confused by using a proxy.
- Need to import a root cert into your browser to protect against MITM.
Using kubectl proxy
The following command runs kubectl in a mode where it acts as a reverse proxy. It handles locating the apiserver and authenticating. Run it like this:
kubectl proxy --port=8080
See kubectl proxy for more details.
Then you can explore the API with curl, wget, or a browser, replacing localhost with [::1] for IPv6, like so:
curl http://localhost:8080/api/
The output is similar to this:
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}
Without kubectl proxy
Use kubectl apply
and kubectl describe secret...
to create a token for the default service account with grep/cut:
First, create the Secret, requesting a token for the default ServiceAccount:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: default-token
annotations:
kubernetes.io/service-account.name: default
type: kubernetes.io/service-account-token
EOF
Next, wait for the token controller to populate the Secret with a token:
while ! kubectl describe secret default-token | grep -E '^token' >/dev/null; do
echo "waiting for token..." >&2
sleep 1
done
Capture and use the generated token:
APISERVER=$(kubectl config view --minify | grep server | cut -f 2- -d ":" | tr -d " ")
TOKEN=$(kubectl describe secret default-token | grep -E '^token' | cut -f2 -d':' | tr -d " ")
curl $APISERVER/api --header "Authorization: Bearer $TOKEN" --insecure
The output is similar to this:
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}
Using jsonpath
:
APISERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
TOKEN=$(kubectl get secret default-token -o jsonpath='{.data.token}' | base64 --decode)
curl $APISERVER/api --header "Authorization: Bearer $TOKEN" --insecure
The output is similar to this:
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}
The above examples use the --insecure
flag. This leaves it subject to MITM
attacks. When kubectl accesses the cluster it uses a stored root certificate
and client certificates to access the server. (These are installed in the
~/.kube
directory). Since cluster certificates are typically self-signed, it
may take special configuration to get your http client to use root
certificate.
On some clusters, the apiserver does not require authentication; it may serve on localhost, or be protected by a firewall. There is not a standard for this. Controlling Access to the API describes how a cluster admin can configure this.
Programmatic access to the API
Kubernetes officially supports Go and Python client libraries.
Go client
- To get the library, run the following command:
go get k8s.io/client-go@kubernetes-<kubernetes-version-number>
, see INSTALL.md for detailed installation instructions. See https://github.com/kubernetes/client-go to see which versions are supported. - Write an application atop of the client-go clients. Note that client-go defines its own API objects, so if needed, please import API definitions from client-go rather than from the main repository, e.g.,
import "k8s.io/client-go/kubernetes"
is correct.
The Go client can use the same kubeconfig file as the kubectl CLI does to locate and authenticate to the apiserver. See this example.
If the application is deployed as a Pod in the cluster, please refer to the next section.
Python client
To use Python client, run the following command: pip install kubernetes
. See Python Client Library page for more installation options.
The Python client can use the same kubeconfig file as the kubectl CLI does to locate and authenticate to the apiserver. See this example.
Other languages
There are client libraries for accessing the API from other languages. See documentation for other libraries for how they authenticate.
Accessing the API from a Pod
When accessing the API from a pod, locating and authenticating to the API server are somewhat different.
Please check Accessing the API from within a Pod for more details.
Accessing services running on the cluster
The previous section describes how to connect to the Kubernetes API server. For information about connecting to other services running on a Kubernetes cluster, see Access Cluster Services.
Requesting redirects
The redirect capabilities have been deprecated and removed. Please use a proxy (see below) instead.
So Many Proxies
There are several different proxies you may encounter when using Kubernetes:
-
The kubectl proxy:
- runs on a user's desktop or in a pod
- proxies from a localhost address to the Kubernetes apiserver
- client to proxy uses HTTP
- proxy to apiserver uses HTTPS
- locates apiserver
- adds authentication headers
-
The apiserver proxy:
- is a bastion built into the apiserver
- connects a user outside of the cluster to cluster IPs which otherwise might not be reachable
- runs in the apiserver processes
- client to proxy uses HTTPS (or http if apiserver so configured)
- proxy to target may use HTTP or HTTPS as chosen by proxy using available information
- can be used to reach a Node, Pod, or Service
- does load balancing when used to reach a Service
-
The kube proxy:
- runs on each node
- proxies UDP and TCP
- does not understand HTTP
- provides load balancing
- is only used to reach services
-
A Proxy/Load-balancer in front of apiserver(s):
- existence and implementation varies from cluster to cluster (e.g. nginx)
- sits between all clients and one or more apiservers
- acts as load balancer if there are several apiservers.
-
Cloud Load Balancers on external services:
- are provided by some cloud providers (e.g. AWS ELB, Google Cloud Load Balancer)
- are created automatically when the Kubernetes service has type
LoadBalancer
- use UDP/TCP only
- implementation varies by cloud provider.
Kubernetes users will typically not need to worry about anything other than the first two types. The cluster admin will typically ensure that the latter types are setup correctly.
4.9.3 - Configure Access to Multiple Clusters
This page shows how to configure access to multiple clusters by using
configuration files. After your clusters, users, and contexts are defined in
one or more configuration files, you can quickly switch between clusters by using the
kubectl config use-context
command.
kubeconfig
.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check that kubectl is installed,
run kubectl version --client
. The kubectl version should be
within one minor version of your
cluster's API server.
Define clusters, users, and contexts
Suppose you have two clusters, one for development work and one for scratch work.
In the development
cluster, your frontend developers work in a namespace called frontend
,
and your storage developers work in a namespace called storage
. In your scratch
cluster,
developers work in the default namespace, or they create auxiliary namespaces as they
see fit. Access to the development cluster requires authentication by certificate. Access
to the scratch cluster requires authentication by username and password.
Create a directory named config-exercise
. In your
config-exercise
directory, create a file named config-demo
with this content:
apiVersion: v1
kind: Config
preferences: {}
clusters:
- cluster:
name: development
- cluster:
name: scratch
users:
- name: developer
- name: experimenter
contexts:
- context:
name: dev-frontend
- context:
name: dev-storage
- context:
name: exp-scratch
A configuration file describes clusters, users, and contexts. Your config-demo
file
has the framework to describe two clusters, two users, and three contexts.
Go to your config-exercise
directory. Enter these commands to add cluster details to
your configuration file:
kubectl config --kubeconfig=config-demo set-cluster development --server=https://1.2.3.4 --certificate-authority=fake-ca-file
kubectl config --kubeconfig=config-demo set-cluster scratch --server=https://5.6.7.8 --insecure-skip-tls-verify
Add user details to your configuration file:
kubectl config --kubeconfig=config-demo set-credentials developer --client-certificate=fake-cert-file --client-key=fake-key-seefile
kubectl config --kubeconfig=config-demo set-credentials experimenter --username=exp --password=some-password
- To delete a user you can run
kubectl --kubeconfig=config-demo config unset users.<name>
- To remove a cluster, you can run
kubectl --kubeconfig=config-demo config unset clusters.<name>
- To remove a context, you can run
kubectl --kubeconfig=config-demo config unset contexts.<name>
Add context details to your configuration file:
kubectl config --kubeconfig=config-demo set-context dev-frontend --cluster=development --namespace=frontend --user=developer
kubectl config --kubeconfig=config-demo set-context dev-storage --cluster=development --namespace=storage --user=developer
kubectl config --kubeconfig=config-demo set-context exp-scratch --cluster=scratch --namespace=default --user=experimenter
Open your config-demo
file to see the added details. As an alternative to opening the
config-demo
file, you can use the config view
command.
kubectl config --kubeconfig=config-demo view
The output shows the two clusters, two users, and three contexts:
apiVersion: v1
clusters:
- cluster:
certificate-authority: fake-ca-file
server: https://1.2.3.4
name: development
- cluster:
insecure-skip-tls-verify: true
server: https://5.6.7.8
name: scratch
contexts:
- context:
cluster: development
namespace: frontend
user: developer
name: dev-frontend
- context:
cluster: development
namespace: storage
user: developer
name: dev-storage
- context:
cluster: scratch
namespace: default
user: experimenter
name: exp-scratch
current-context: ""
kind: Config
preferences: {}
users:
- name: developer
user:
client-certificate: fake-cert-file
client-key: fake-key-file
- name: experimenter
user:
password: some-password
username: exp
The fake-ca-file
, fake-cert-file
and fake-key-file
above are the placeholders
for the pathnames of the certificate files. You need to change these to the actual pathnames
of certificate files in your environment.
Sometimes you may want to use Base64-encoded data embedded here instead of separate
certificate files; in that case you need to add the suffix -data
to the keys, for example,
certificate-authority-data
, client-certificate-data
, client-key-data
.
Each context is a triple (cluster, user, namespace). For example, the
dev-frontend
context says, "Use the credentials of the developer
user to access the frontend
namespace of the development
cluster".
Set the current context:
kubectl config --kubeconfig=config-demo use-context dev-frontend
Now whenever you enter a kubectl
command, the action will apply to the cluster,
and namespace listed in the dev-frontend
context. And the command will use
the credentials of the user listed in the dev-frontend
context.
To see only the configuration information associated with
the current context, use the --minify
flag.
kubectl config --kubeconfig=config-demo view --minify
The output shows configuration information associated with the dev-frontend
context:
apiVersion: v1
clusters:
- cluster:
certificate-authority: fake-ca-file
server: https://1.2.3.4
name: development
contexts:
- context:
cluster: development
namespace: frontend
user: developer
name: dev-frontend
current-context: dev-frontend
kind: Config
preferences: {}
users:
- name: developer
user:
client-certificate: fake-cert-file
client-key: fake-key-file
Now suppose you want to work for a while in the scratch cluster.
Change the current context to exp-scratch
:
kubectl config --kubeconfig=config-demo use-context exp-scratch
Now any kubectl
command you give will apply to the default namespace of
the scratch
cluster. And the command will use the credentials of the user
listed in the exp-scratch
context.
View configuration associated with the new current context, exp-scratch
.
kubectl config --kubeconfig=config-demo view --minify
Finally, suppose you want to work for a while in the storage
namespace of the
development
cluster.
Change the current context to dev-storage
:
kubectl config --kubeconfig=config-demo use-context dev-storage
View configuration associated with the new current context, dev-storage
.
kubectl config --kubeconfig=config-demo view --minify
Create a second configuration file
In your config-exercise
directory, create a file named config-demo-2
with this content:
apiVersion: v1
kind: Config
preferences: {}
contexts:
- context:
cluster: development
namespace: ramp
user: developer
name: dev-ramp-up
The preceding configuration file defines a new context named dev-ramp-up
.
Set the KUBECONFIG environment variable
See whether you have an environment variable named KUBECONFIG
. If so, save the
current value of your KUBECONFIG
environment variable, so you can restore it later.
For example:
Linux
export KUBECONFIG_SAVED=$KUBECONFIG
Windows PowerShell
$Env:KUBECONFIG_SAVED=$ENV:KUBECONFIG
The KUBECONFIG
environment variable is a list of paths to configuration files. The list is
colon-delimited for Linux and Mac, and semicolon-delimited for Windows. If you have
a KUBECONFIG
environment variable, familiarize yourself with the configuration files
in the list.
Temporarily append two paths to your KUBECONFIG
environment variable. For example:
Linux
export KUBECONFIG=$KUBECONFIG:config-demo:config-demo-2
Windows PowerShell
$Env:KUBECONFIG=("config-demo;config-demo-2")
In your config-exercise
directory, enter this command:
kubectl config view
The output shows merged information from all the files listed in your KUBECONFIG
environment variable. In particular, notice that the merged information has the
dev-ramp-up
context from the config-demo-2
file and the three contexts from
the config-demo
file:
contexts:
- context:
cluster: development
namespace: frontend
user: developer
name: dev-frontend
- context:
cluster: development
namespace: ramp
user: developer
name: dev-ramp-up
- context:
cluster: development
namespace: storage
user: developer
name: dev-storage
- context:
cluster: scratch
namespace: default
user: experimenter
name: exp-scratch
For more information about how kubeconfig files are merged, see Organizing Cluster Access Using kubeconfig Files
Explore the $HOME/.kube directory
If you already have a cluster, and you can use kubectl
to interact with
the cluster, then you probably have a file named config
in the $HOME/.kube
directory.
Go to $HOME/.kube
, and see what files are there. Typically, there is a file named
config
. There might also be other configuration files in this directory. Briefly
familiarize yourself with the contents of these files.
Append $HOME/.kube/config to your KUBECONFIG environment variable
If you have a $HOME/.kube/config
file, and it's not already listed in your
KUBECONFIG
environment variable, append it to your KUBECONFIG
environment variable now.
For example:
Linux
export KUBECONFIG=$KUBECONFIG:$HOME/.kube/config
Windows Powershell
$Env:KUBECONFIG="$Env:KUBECONFIG;$HOME\.kube\config"
View configuration information merged from all the files that are now listed
in your KUBECONFIG
environment variable. In your config-exercise directory, enter:
kubectl config view
Clean up
Return your KUBECONFIG
environment variable to its original value. For example:
Linux
export KUBECONFIG=$KUBECONFIG_SAVED
Windows PowerShell
$Env:KUBECONFIG=$ENV:KUBECONFIG_SAVED
What's next
4.9.4 - Use Port Forwarding to Access Applications in a Cluster
This page shows how to use kubectl port-forward
to connect to a MongoDB
server running in a Kubernetes cluster. This type of connection can be useful
for database debugging.
Before you begin
-
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.10. To check the version, enterkubectl version
. -
Install MongoDB Shell.
Creating MongoDB deployment and service
-
Create a Deployment that runs MongoDB:
kubectl apply -f https://k8s.io/examples/application/mongodb/mongo-deployment.yaml
The output of a successful command verifies that the deployment was created:
deployment.apps/mongo created
View the pod status to check that it is ready:
kubectl get pods
The output displays the pod created:
NAME READY STATUS RESTARTS AGE mongo-75f59d57f4-4nd6q 1/1 Running 0 2m4s
View the Deployment's status:
kubectl get deployment
The output displays that the Deployment was created:
NAME READY UP-TO-DATE AVAILABLE AGE mongo 1/1 1 1 2m21s
The Deployment automatically manages a ReplicaSet. View the ReplicaSet status using:
kubectl get replicaset
The output displays that the ReplicaSet was created:
NAME DESIRED CURRENT READY AGE mongo-75f59d57f4 1 1 1 3m12s
-
Create a Service to expose MongoDB on the network:
kubectl apply -f https://k8s.io/examples/application/mongodb/mongo-service.yaml
The output of a successful command verifies that the Service was created:
service/mongo created
Check the Service created:
kubectl get service mongo
The output displays the service created:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE mongo ClusterIP 10.96.41.183 <none> 27017/TCP 11s
-
Verify that the MongoDB server is running in the Pod, and listening on port 27017:
# Change mongo-75f59d57f4-4nd6q to the name of the Pod kubectl get pod mongo-75f59d57f4-4nd6q --template='{{(index (index .spec.containers 0).ports 0).containerPort}}{{"\n"}}'
The output displays the port for MongoDB in that Pod:
27017
(this is the TCP port allocated to MongoDB on the internet).
Forward a local port to a port on the Pod
-
kubectl port-forward
allows using resource name, such as a pod name, to select a matching pod to port forward to.# Change mongo-75f59d57f4-4nd6q to the name of the Pod kubectl port-forward mongo-75f59d57f4-4nd6q 28015:27017
which is the same as
kubectl port-forward pods/mongo-75f59d57f4-4nd6q 28015:27017
or
kubectl port-forward deployment/mongo 28015:27017
or
kubectl port-forward replicaset/mongo-75f59d57f4 28015:27017
or
kubectl port-forward service/mongo 28015:27017
Any of the above commands works. The output is similar to this:
Forwarding from 127.0.0.1:28015 -> 27017 Forwarding from [::1]:28015 -> 27017
kubectl port-forward
does not return. To continue with the exercises, you will need to open another terminal.
-
Start the MongoDB command line interface:
mongosh --port 28015
-
At the MongoDB command line prompt, enter the
ping
command:db.runCommand( { ping: 1 } )
A successful ping request returns:
{ ok: 1 }
Optionally let kubectl choose the local port
If you don't need a specific local port, you can let kubectl
choose and allocate
the local port and thus relieve you from having to manage local port conflicts, with
the slightly simpler syntax:
kubectl port-forward deployment/mongo :27017
The kubectl
tool finds a local port number that is not in use (avoiding low ports numbers,
because these might be used by other applications). The output is similar to:
Forwarding from 127.0.0.1:63753 -> 27017
Forwarding from [::1]:63753 -> 27017
Discussion
Connections made to local port 28015 are forwarded to port 27017 of the Pod that is running the MongoDB server. With this connection in place, you can use your local workstation to debug the database that is running in the Pod.
kubectl port-forward
is implemented for TCP ports only.
The support for UDP protocol is tracked in
issue 47862.
What's next
Learn more about kubectl port-forward.
4.9.5 - Use a Service to Access an Application in a Cluster
This page shows how to create a Kubernetes Service object that external clients can use to access an application running in a cluster. The Service provides load balancing for an application that has two running instances.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Objectives
- Run two instances of a Hello World application.
- Create a Service object that exposes a node port.
- Use the Service object to access the running application.
Creating a service for an application running in two pods
Here is the configuration file for the application Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-world
spec:
selector:
matchLabels:
run: load-balancer-example
replicas: 2
template:
metadata:
labels:
run: load-balancer-example
spec:
containers:
- name: hello-world
image: gcr.io/google-samples/node-hello:1.0
ports:
- containerPort: 8080
protocol: TCP
-
Run a Hello World application in your cluster: Create the application Deployment using the file above:
kubectl apply -f https://k8s.io/examples/service/access/hello-application.yaml
The preceding command creates a Deployment and an associated ReplicaSet. The ReplicaSet has two Pods each of which runs the Hello World application.
-
Display information about the Deployment:
kubectl get deployments hello-world kubectl describe deployments hello-world
-
Display information about your ReplicaSet objects:
kubectl get replicasets kubectl describe replicasets
-
Create a Service object that exposes the deployment:
kubectl expose deployment hello-world --type=NodePort --name=example-service
-
Display information about the Service:
kubectl describe services example-service
The output is similar to this:
Name: example-service Namespace: default Labels: run=load-balancer-example Annotations: <none> Selector: run=load-balancer-example Type: NodePort IP: 10.32.0.16 Port: <unset> 8080/TCP TargetPort: 8080/TCP NodePort: <unset> 31496/TCP Endpoints: 10.200.1.4:8080,10.200.2.5:8080 Session Affinity: None Events: <none>
Make a note of the NodePort value for the service. For example, in the preceding output, the NodePort value is 31496.
-
List the pods that are running the Hello World application:
kubectl get pods --selector="run=load-balancer-example" --output=wide
The output is similar to this:
NAME READY STATUS ... IP NODE hello-world-2895499144-bsbk5 1/1 Running ... 10.200.1.4 worker1 hello-world-2895499144-m1pwt 1/1 Running ... 10.200.2.5 worker2
-
Get the public IP address of one of your nodes that is running a Hello World pod. How you get this address depends on how you set up your cluster. For example, if you are using Minikube, you can see the node address by running
kubectl cluster-info
. If you are using Google Compute Engine instances, you can use thegcloud compute instances list
command to see the public addresses of your nodes. -
On your chosen node, create a firewall rule that allows TCP traffic on your node port. For example, if your Service has a NodePort value of 31568, create a firewall rule that allows TCP traffic on port 31568. Different cloud providers offer different ways of configuring firewall rules.
-
Use the node address and node port to access the Hello World application:
curl http://<public-node-ip>:<node-port>
where
<public-node-ip>
is the public IP address of your node, and<node-port>
is the NodePort value for your service. The response to a successful request is a hello message:Hello Kubernetes!
Using a service configuration file
As an alternative to using kubectl expose
, you can use a
service configuration file
to create a Service.
Cleaning up
To delete the Service, enter this command:
kubectl delete services example-service
To delete the Deployment, the ReplicaSet, and the Pods that are running the Hello World application, enter this command:
kubectl delete deployment hello-world
What's next
Learn more about connecting applications with services.
4.9.6 - Connect a Frontend to a Backend Using Services
This task shows how to create a frontend and a backend microservice. The backend microservice is a hello greeter. The frontend exposes the backend using nginx and a Kubernetes Service object.
Objectives
- Create and run a sample
hello
backend microservice using a Deployment object. - Use a Service object to send traffic to the backend microservice's multiple replicas.
- Create and run a
nginx
frontend microservice, also using a Deployment object. - Configure the frontend microservice to send traffic to the backend microservice.
- Use a Service object of
type=LoadBalancer
to expose the frontend microservice outside the cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
This task uses Services with external load balancers, which require a supported environment. If your environment does not support this, you can use a Service of type NodePort instead.
Creating the backend using a Deployment
The backend is a simple hello greeter microservice. Here is the configuration file for the backend Deployment:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
spec:
selector:
matchLabels:
app: hello
tier: backend
track: stable
replicas: 3
template:
metadata:
labels:
app: hello
tier: backend
track: stable
spec:
containers:
- name: hello
image: "gcr.io/google-samples/hello-go-gke:1.0"
ports:
- name: http
containerPort: 80
...
Create the backend Deployment:
kubectl apply -f https://k8s.io/examples/service/access/backend-deployment.yaml
View information about the backend Deployment:
kubectl describe deployment backend
The output is similar to this:
Name: backend
Namespace: default
CreationTimestamp: Mon, 24 Oct 2016 14:21:02 -0700
Labels: app=hello
tier=backend
track=stable
Annotations: deployment.kubernetes.io/revision=1
Selector: app=hello,tier=backend,track=stable
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
Pod Template:
Labels: app=hello
tier=backend
track=stable
Containers:
hello:
Image: "gcr.io/google-samples/hello-go-gke:1.0"
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: hello-3621623197 (3/3 replicas created)
Events:
...
Creating the hello
Service object
The key to sending requests from a frontend to a backend is the backend Service. A Service creates a persistent IP address and DNS name entry so that the backend microservice can always be reached. A Service uses selectors to find the Pods that it routes traffic to.
First, explore the Service configuration file:
---
apiVersion: v1
kind: Service
metadata:
name: hello
spec:
selector:
app: hello
tier: backend
ports:
- protocol: TCP
port: 80
targetPort: http
...
In the configuration file, you can see that the Service, named hello
routes
traffic to Pods that have the labels app: hello
and tier: backend
.
Create the backend Service:
kubectl apply -f https://k8s.io/examples/service/access/backend-service.yaml
At this point, you have a backend
Deployment running three replicas of your hello
application, and you have a Service that can route traffic to them. However, this
service is neither available nor resolvable outside the cluster.
Creating the frontend
Now that you have your backend running, you can create a frontend that is accessible outside the cluster, and connects to the backend by proxying requests to it.
The frontend sends requests to the backend worker Pods by using the DNS name
given to the backend Service. The DNS name is hello
, which is the value
of the name
field in the examples/service/access/backend-service.yaml
configuration file.
The Pods in the frontend Deployment run a nginx image that is configured
to proxy requests to the hello
backend Service. Here is the nginx configuration file:
# The identifier Backend is internal to nginx, and used to name this specific upstream
upstream Backend {
# hello is the internal DNS name used by the backend Service inside Kubernetes
server hello;
}
server {
listen 80;
location / {
# The following statement will proxy traffic to the upstream named Backend
proxy_pass http://Backend;
}
}
Similar to the backend, the frontend has a Deployment and a Service. An important
difference to notice between the backend and frontend services, is that the
configuration for the frontend Service has type: LoadBalancer
, which means that
the Service uses a load balancer provisioned by your cloud provider and will be
accessible from outside the cluster.
---
apiVersion: v1
kind: Service
metadata:
name: frontend
spec:
selector:
app: hello
tier: frontend
ports:
- protocol: "TCP"
port: 80
targetPort: 80
type: LoadBalancer
...
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
selector:
matchLabels:
app: hello
tier: frontend
track: stable
replicas: 1
template:
metadata:
labels:
app: hello
tier: frontend
track: stable
spec:
containers:
- name: nginx
image: "gcr.io/google-samples/hello-frontend:1.0"
lifecycle:
preStop:
exec:
command: ["/usr/sbin/nginx","-s","quit"]
...
Create the frontend Deployment and Service:
kubectl apply -f https://k8s.io/examples/service/access/frontend-deployment.yaml
kubectl apply -f https://k8s.io/examples/service/access/frontend-service.yaml
The output verifies that both resources were created:
deployment.apps/frontend created
service/frontend created
Interact with the frontend Service
Once you've created a Service of type LoadBalancer, you can use this command to find the external IP:
kubectl get service frontend --watch
This displays the configuration for the frontend
Service and watches for
changes. Initially, the external IP is listed as <pending>
:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
frontend LoadBalancer 10.51.252.116 <pending> 80/TCP 10s
As soon as an external IP is provisioned, however, the configuration updates
to include the new IP under the EXTERNAL-IP
heading:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
frontend LoadBalancer 10.51.252.116 XXX.XXX.XXX.XXX 80/TCP 1m
That IP can now be used to interact with the frontend
service from outside the
cluster.
Send traffic through the frontend
The frontend and backend are now connected. You can hit the endpoint by using the curl command on the external IP of your frontend Service.
curl http://${EXTERNAL_IP} # replace this with the EXTERNAL-IP you saw earlier
The output shows the message generated by the backend:
{"message":"Hello"}
Cleaning up
To delete the Services, enter this command:
kubectl delete services frontend backend
To delete the Deployments, the ReplicaSets and the Pods that are running the backend and frontend applications, enter this command:
kubectl delete deployment frontend backend
What's next
- Learn more about Services
- Learn more about ConfigMaps
- Learn more about DNS for Service and Pods
4.9.7 - Create an External Load Balancer
This page shows how to create an external load balancer.
When creating a Service, you have the option of automatically creating a cloud load balancer. This provides an externally-accessible IP address that sends traffic to the correct port on your cluster nodes, provided your cluster runs in a supported environment and is configured with the correct cloud load balancer provider package.
You can also use an Ingress in place of Service. For more information, check the Ingress documentation.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your cluster must be running in a cloud or other environment that already has support for configuring external load balancers.
Create a Service
Create a Service from a manifest
To create an external load balancer, add the following line to your Service manifest:
type: LoadBalancer
Your manifest might then look like:
apiVersion: v1
kind: Service
metadata:
name: example-service
spec:
selector:
app: example
ports:
- port: 8765
targetPort: 9376
type: LoadBalancer
Create a Service using kubectl
You can alternatively create the service with the kubectl expose
command and
its --type=LoadBalancer
flag:
kubectl expose deployment example --port=8765 --target-port=9376 \
--name=example-service --type=LoadBalancer
This command creates a new Service using the same selectors as the referenced
resource (in the case of the example above, a
Deployment named example
).
For more information, including optional flags, refer to the
kubectl expose
reference.
Finding your IP address
You can find the IP address created for your service by getting the service
information through kubectl
:
kubectl describe services example-service
which should produce output similar to:
Name: example-service
Namespace: default
Labels: app=example
Annotations: <none>
Selector: app=example
Type: LoadBalancer
IP Families: <none>
IP: 10.3.22.96
IPs: 10.3.22.96
LoadBalancer Ingress: 192.0.2.89
Port: <unset> 8765/TCP
TargetPort: 9376/TCP
NodePort: <unset> 30593/TCP
Endpoints: 172.17.0.3:9376
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
The load balancer's IP address is listed next to LoadBalancer Ingress
.
If you are running your service on Minikube, you can find the assigned IP address and port with:
minikube service example-service --url
Preserving the client source IP
By default, the source IP seen in the target container is not the original
source IP of the client. To enable preservation of the client IP, the following
fields can be configured in the .spec
of the Service:
.spec.externalTrafficPolicy
- denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints. There are two available options:Cluster
(default) andLocal
.Cluster
obscures the client source IP and may cause a second hop to another node, but should have good overall load-spreading.Local
preserves the client source IP and avoids a second hop for LoadBalancer and NodePort type Services, but risks potentially imbalanced traffic spreading..spec.healthCheckNodePort
- specifies the health check node port (numeric port number) for the service. If you don't specifyhealthCheckNodePort
, the service controller allocates a port from your cluster's NodePort range.
You can configure that range by setting an API server command line option,--service-node-port-range
. The Service will use the user-specifiedhealthCheckNodePort
value if you specify it, provided that the Servicetype
is set to LoadBalancer andexternalTrafficPolicy
is set toLocal
.
Setting externalTrafficPolicy
to Local in the Service manifest
activates this feature. For example:
apiVersion: v1
kind: Service
metadata:
name: example-service
spec:
selector:
app: example
ports:
- port: 8765
targetPort: 9376
externalTrafficPolicy: Local
type: LoadBalancer
Caveats and limitations when preserving source IPs
Load balancing services from some cloud providers do not let you configure different weights for each target.
With each target weighted equally in terms of sending traffic to Nodes, external traffic is not equally load balanced across different Pods. The external load balancer is unaware of the number of Pods on each node that are used as a target.
Where NumServicePods << _NumNodes
or NumServicePods >> NumNodes
, a fairly close-to-equal
distribution will be seen, even without weights.
Internal pod to pod traffic should behave similar to ClusterIP services, with equal probability across all pods.
Garbage collecting load balancers
Kubernetes v1.17 [stable]
In usual case, the correlating load balancer resources in cloud provider should be cleaned up soon after a LoadBalancer type Service is deleted. But it is known that there are various corner cases where cloud resources are orphaned after the associated Service is deleted. Finalizer Protection for Service LoadBalancers was introduced to prevent this from happening. By using finalizers, a Service resource will never be deleted until the correlating load balancer resources are also deleted.
Specifically, if a Service has type
LoadBalancer, the service controller will attach
a finalizer named service.kubernetes.io/load-balancer-cleanup
.
The finalizer will only be removed after the load balancer resource is cleaned up.
This prevents dangling load balancer resources even in corner cases such as the
service controller crashing.
External load balancer providers
It is important to note that the datapath for this functionality is provided by a load balancer external to the Kubernetes cluster.
When the Service type
is set to LoadBalancer, Kubernetes provides functionality equivalent to type
equals ClusterIP to pods
within the cluster and extends it by programming the (external to Kubernetes) load balancer with entries for the nodes
hosting the relevant Kubernetes pods. The Kubernetes control plane automates the creation of the external load balancer,
health checks (if needed), and packet filtering rules (if needed). Once the cloud provider allocates an IP address for the load
balancer, the control plane looks up that external IP address and populates it into the Service object.
What's next
- Read about Service
- Read about Ingress
- Read Connecting Applications with Services
4.9.8 - List All Container Images Running in a Cluster
This page shows how to use kubectl to list all of the Container images for Pods running in a cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
In this exercise you will use kubectl to fetch all of the Pods running in a cluster, and format the output to pull out the list of Containers for each.
List all Container images in all namespaces
- Fetch all Pods in all namespaces using
kubectl get pods --all-namespaces
- Format the output to include only the list of Container image names
using
-o jsonpath={.items[*].spec.containers[*].image}
. This will recursively parse out theimage
field from the returned json.- See the jsonpath reference for further information on how to use jsonpath.
- Format the output using standard tools:
tr
,sort
,uniq
- Use
tr
to replace spaces with newlines - Use
sort
to sort the results - Use
uniq
to aggregate image counts
- Use
kubectl get pods --all-namespaces -o jsonpath="{.items[*].spec.containers[*].image}" |\
tr -s '[[:space:]]' '\n' |\
sort |\
uniq -c
The above command will recursively return all fields named image
for all items returned.
As an alternative, it is possible to use the absolute path to the image
field within the Pod. This ensures the correct field is retrieved
even when the field name is repeated,
e.g. many fields are called name
within a given item:
kubectl get pods --all-namespaces -o jsonpath="{.items[*].spec.containers[*].image}"
The jsonpath is interpreted as follows:
.items[*]
: for each returned value.spec
: get the spec.containers[*]
: for each container.image
: get the image
kubectl get pod nginx
,
the .items[*]
portion of the path should be omitted because a single
Pod is returned instead of a list of items.
List Container images by Pod
The formatting can be controlled further by using the range
operation to
iterate over elements individually.
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.image}{", "}{end}{end}' |\
sort
List Container images filtering by Pod label
To target only Pods matching a specific label, use the -l flag. The
following matches only Pods with labels matching app=nginx
.
kubectl get pods --all-namespaces -o jsonpath="{.items[*].spec.containers[*].image}" -l app=nginx
List Container images filtering by Pod namespace
To target only pods in a specific namespace, use the namespace flag. The
following matches only Pods in the kube-system
namespace.
kubectl get pods --namespace kube-system -o jsonpath="{.items[*].spec.containers[*].image}"
List Container images using a go-template instead of jsonpath
As an alternative to jsonpath, Kubectl supports using go-templates for formatting the output:
kubectl get pods --all-namespaces -o go-template --template="{{range .items}}{{range .spec.containers}}{{.image}} {{end}}{{end}}"
What's next
Reference
- Jsonpath reference guide
- Go template reference guide
4.9.9 - Set up Ingress on Minikube with the NGINX Ingress Controller
An Ingress is an API object that defines rules which allow external access to services in a cluster. An Ingress controller fulfills the rules set in the Ingress.
This page shows you how to set up a simple Ingress which routes requests to Service web or web2 depending on the HTTP URI.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.19. To check the version, enterkubectl version
.
If you are using an older Kubernetes version, switch to the documentation
for that version.
Create a Minikube cluster
- Using Katacoda
- Locally
- If you already installed Minikube
locally, run
minikube start
to create a cluster.
Enable the Ingress controller
-
To enable the NGINX Ingress controller, run the following command:
minikube addons enable ingress
-
Verify that the NGINX Ingress controller is running
kubectl get pods -n ingress-nginx
Note: It can take up to a minute before you see these pods running OK.The output is similar to:
NAME READY STATUS RESTARTS AGE ingress-nginx-admission-create-g9g49 0/1 Completed 0 11m ingress-nginx-admission-patch-rqp78 0/1 Completed 1 11m ingress-nginx-controller-59b45fb494-26npt 1/1 Running 0 11m
kubectl get pods -n kube-system
Note: It can take up to a minute before you see these pods running OK.The output is similar to:
NAME READY STATUS RESTARTS AGE default-http-backend-59868b7dd6-xb8tq 1/1 Running 0 1m kube-addon-manager-minikube 1/1 Running 0 3m kube-dns-6dcb57bcc8-n4xd4 3/3 Running 0 2m kubernetes-dashboard-5498ccf677-b8p5h 1/1 Running 0 2m nginx-ingress-controller-5984b97644-rnkrg 1/1 Running 0 1m storage-provisioner 1/1 Running 0 2m
Make sure that you see a Pod with a name that starts with
nginx-ingress-controller-
.
Deploy a hello, world app
-
Create a Deployment using the following command:
kubectl create deployment web --image=gcr.io/google-samples/hello-app:1.0
The output should be:
deployment.apps/web created
-
Expose the Deployment:
kubectl expose deployment web --type=NodePort --port=8080
The output should be:
service/web exposed
-
Verify the Service is created and is available on a node port:
kubectl get service web
The output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE web NodePort 10.104.133.249 <none> 8080:31637/TCP 12m
-
Visit the Service via NodePort:
minikube service web --url
The output is similar to:
http://172.17.0.15:31637
Note: Katacoda environment only: at the top of the terminal panel, click the plus sign, and then click Select port to view on Host 1. Enter the NodePort, in this case31637
, and then click Display Port.The output is similar to:
Hello, world! Version: 1.0.0 Hostname: web-55b8c6998d-8k564
You can now access the sample app via the Minikube IP address and NodePort. The next step lets you access the app using the Ingress resource.
Create an Ingress
The following manifest defines an Ingress that sends traffic to your Service via hello-world.info.
-
Create
example-ingress.yaml
from the following file:apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: example-ingress annotations: nginx.ingress.kubernetes.io/rewrite-target: /$1 spec: rules: - host: hello-world.info http: paths: - path: / pathType: Prefix backend: service: name: web port: number: 8080
-
Create the Ingress object by running the following command:
kubectl apply -f https://k8s.io/examples/service/networking/example-ingress.yaml
The output should be:
ingress.networking.k8s.io/example-ingress created
-
Verify the IP address is set:
kubectl get ingress
Note: This can take a couple of minutes.You should see an IPv4 address in the ADDRESS column; for example:
NAME CLASS HOSTS ADDRESS PORTS AGE example-ingress <none> hello-world.info 172.17.0.15 80 38s
-
Add the following line to the bottom of the
/etc/hosts
file on your computer (you will need administrator access):172.17.0.15 hello-world.info
Note: If you are running Minikube locally, useminikube ip
to get the external IP. The IP address displayed within the ingress list will be the internal IP.After you make this change, your web browser sends requests for hello-world.info URLs to Minikube.
-
Verify that the Ingress controller is directing traffic:
curl hello-world.info
You should see:
Hello, world! Version: 1.0.0 Hostname: web-55b8c6998d-8k564
Note: If you are running Minikube locally, you can visit hello-world.info from your browser.
Create a second Deployment
-
Create another Deployment using the following command:
kubectl create deployment web2 --image=gcr.io/google-samples/hello-app:2.0
The output should be:
deployment.apps/web2 created
-
Expose the second Deployment:
kubectl expose deployment web2 --port=8080 --type=NodePort
The output should be:
service/web2 exposed
Edit the existing Ingress
-
Edit the existing
example-ingress.yaml
manifest, and add the following lines at the end:- path: /v2 pathType: Prefix backend: service: name: web2 port: number: 8080
-
Apply the changes:
kubectl apply -f example-ingress.yaml
You should see:
ingress.networking/example-ingress configured
Test your Ingress
-
Access the 1st version of the Hello World app.
curl hello-world.info
The output is similar to:
Hello, world! Version: 1.0.0 Hostname: web-55b8c6998d-8k564
-
Access the 2nd version of the Hello World app.
curl hello-world.info/v2
The output is similar to:
Hello, world! Version: 2.0.0 Hostname: web2-75cd47646f-t8cjk
Note: If you are running Minikube locally, you can visit hello-world.info and hello-world.info/v2 from your browser.
What's next
- Read more about Ingress
- Read more about Ingress Controllers
- Read more about Services
4.9.10 - Communicate Between Containers in the Same Pod Using a Shared Volume
This page shows how to use a Volume to communicate between two Containers running in the same Pod. See also how to allow processes to communicate by sharing process namespace between containers.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Creating a Pod that runs two Containers
In this exercise, you create a Pod that runs two Containers. The two containers share a Volume that they can use to communicate. Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: two-containers
spec:
restartPolicy: Never
volumes:
- name: shared-data
emptyDir: {}
containers:
- name: nginx-container
image: nginx
volumeMounts:
- name: shared-data
mountPath: /usr/share/nginx/html
- name: debian-container
image: debian
volumeMounts:
- name: shared-data
mountPath: /pod-data
command: ["/bin/sh"]
args: ["-c", "echo Hello from the debian container > /pod-data/index.html"]
In the configuration file, you can see that the Pod has a Volume named
shared-data
.
The first container listed in the configuration file runs an nginx server. The
mount path for the shared Volume is /usr/share/nginx/html
.
The second container is based on the debian image, and has a mount path of
/pod-data
. The second container runs the following command and then terminates.
echo Hello from the debian container > /pod-data/index.html
Notice that the second container writes the index.html
file in the root
directory of the nginx server.
Create the Pod and the two Containers:
kubectl apply -f https://k8s.io/examples/pods/two-container-pod.yaml
View information about the Pod and the Containers:
kubectl get pod two-containers --output=yaml
Here is a portion of the output:
apiVersion: v1
kind: Pod
metadata:
...
name: two-containers
namespace: default
...
spec:
...
containerStatuses:
- containerID: docker://c1d8abd1 ...
image: debian
...
lastState:
terminated:
...
name: debian-container
...
- containerID: docker://96c1ff2c5bb ...
image: nginx
...
name: nginx-container
...
state:
running:
...
You can see that the debian Container has terminated, and the nginx Container is still running.
Get a shell to nginx Container:
kubectl exec -it two-containers -c nginx-container -- /bin/bash
In your shell, verify that nginx is running:
root@two-containers:/# apt-get update
root@two-containers:/# apt-get install curl procps
root@two-containers:/# ps aux
The output is similar to this:
USER PID ... STAT START TIME COMMAND
root 1 ... Ss 21:12 0:00 nginx: master process nginx -g daemon off;
Recall that the debian Container created the index.html
file in the nginx root
directory. Use curl
to send a GET request to the nginx server:
root@two-containers:/# curl localhost
The output shows that nginx serves a web page written by the debian container:
Hello from the debian container
Discussion
The primary reason that Pods can have multiple containers is to support helper applications that assist a primary application. Typical examples of helper applications are data pullers, data pushers, and proxies. Helper and primary applications often need to communicate with each other. Typically this is done through a shared filesystem, as shown in this exercise, or through the loopback network interface, localhost. An example of this pattern is a web server along with a helper program that polls a Git repository for new updates.
The Volume in this exercise provides a way for Containers to communicate during the life of the Pod. If the Pod is deleted and recreated, any data stored in the shared Volume is lost.
What's next
-
Learn more about patterns for composite containers.
-
Learn about composite containers for modular architecture.
-
See Configure a Pod to share process namespace between containers in a Pod
-
See Volume.
-
See Pod.
4.9.11 - Configure DNS for a Cluster
Kubernetes offers a DNS cluster addon, which most of the supported environments enable by default. In Kubernetes version 1.11 and later, CoreDNS is recommended and is installed by default with kubeadm.
For more information on how to configure CoreDNS for a Kubernetes cluster, see the Customizing DNS Service. An example demonstrating how to use Kubernetes DNS with kube-dns, see the Kubernetes DNS sample plugin.
4.9.12 - Access Services Running on Clusters
This page shows how to connect to services running on the Kubernetes cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Accessing services running on the cluster
In Kubernetes, nodes, pods and services all have their own IPs. In many cases, the node IPs, pod IPs, and some service IPs on a cluster will not be routable, so they will not be reachable from a machine outside the cluster, such as your desktop machine.
Ways to connect
You have several options for connecting to nodes, pods and services from outside the cluster:
- Access services through public IPs.
- Use a service with type
NodePort
orLoadBalancer
to make the service reachable outside the cluster. See the services and kubectl expose documentation. - Depending on your cluster environment, this may only expose the service to your corporate network, or it may expose it to the internet. Think about whether the service being exposed is secure. Does it do its own authentication?
- Place pods behind services. To access one specific pod from a set of replicas, such as for debugging, place a unique label on the pod and create a new service which selects this label.
- In most cases, it should not be necessary for application developer to directly access nodes via their nodeIPs.
- Use a service with type
- Access services, nodes, or pods using the Proxy Verb.
- Does apiserver authentication and authorization prior to accessing the remote service. Use this if the services are not secure enough to expose to the internet, or to gain access to ports on the node IP, or for debugging.
- Proxies may cause problems for some web applications.
- Only works for HTTP/HTTPS.
- Described here.
- Access from a node or pod in the cluster.
- Run a pod, and then connect to a shell in it using kubectl exec. Connect to other nodes, pods, and services from that shell.
- Some clusters may allow you to ssh to a node in the cluster. From there you may be able to access cluster services. This is a non-standard method, and will work on some clusters but not others. Browsers and other tools may or may not be installed. Cluster DNS may not work.
Discovering builtin services
Typically, there are several services which are started on a cluster by kube-system. Get a list of these
with the kubectl cluster-info
command:
kubectl cluster-info
The output is similar to this:
Kubernetes master is running at https://104.197.5.247
elasticsearch-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy
kibana-logging is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kibana-logging/proxy
kube-dns is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/kube-dns/proxy
grafana is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
heapster is running at https://104.197.5.247/api/v1/namespaces/kube-system/services/monitoring-heapster/proxy
This shows the proxy-verb URL for accessing each service.
For example, this cluster has cluster-level logging enabled (using Elasticsearch), which can be reached
at https://104.197.5.247/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy/
if suitable credentials are passed, or through a kubectl proxy at, for example:
http://localhost:8080/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy/
.
Manually constructing apiserver proxy URLs
As mentioned above, you use the kubectl cluster-info
command to retrieve the service's proxy URL. To create proxy URLs that include service endpoints, suffixes, and parameters, you append to the service's proxy URL:
http://
kubernetes_master_address
/api/v1/namespaces/
namespace_name
/services/
[https:]service_name[:port_name]
/proxy
If you haven't specified a name for your port, you don't have to specify port_name in the URL. You can also use the port number in place of the port_name for both named and unnamed ports.
By default, the API server proxies to your service using HTTP. To use HTTPS, prefix the service name with https:
:
http://<kubernetes_master_address>/api/v1/namespaces/<namespace_name>/services/<service_name>/proxy
The supported formats for the <service_name>
segment of the URL are:
<service_name>
- proxies to the default or unnamed port using http<service_name>:<port_name>
- proxies to the specified port name or port number using httphttps:<service_name>:
- proxies to the default or unnamed port using https (note the trailing colon)https:<service_name>:<port_name>
- proxies to the specified port name or port number using https
Examples
-
To access the Elasticsearch service endpoint
_search?q=user:kimchy
, you would use:http://104.197.5.247/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy/_search?q=user:kimchy
-
To access the Elasticsearch cluster health information
_cluster/health?pretty=true
, you would use:https://104.197.5.247/api/v1/namespaces/kube-system/services/elasticsearch-logging/proxy/_cluster/health?pretty=true
The health information is similar to this:
{ "cluster_name" : "kubernetes_logging", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 5, "active_shards" : 5, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 5 }
-
To access the https Elasticsearch service health information
_cluster/health?pretty=true
, you would use:https://104.197.5.247/api/v1/namespaces/kube-system/services/https:elasticsearch-logging/proxy/_cluster/health?pretty=true
Using web browsers to access services running on the cluster
You may be able to put an apiserver proxy URL into the address bar of a browser. However:
- Web browsers cannot usually pass tokens, so you may need to use basic (password) auth. Apiserver can be configured to accept basic auth, but your cluster may not be configured to accept basic auth.
- Some web apps may not work, particularly those with client side javascript that construct URLs in a way that is unaware of the proxy path prefix.
4.10 - Monitoring, Logging, and Debugging
4.10.1 - Application Introspection and Debugging
Once your application is running, you'll inevitably need to debug problems with it.
Earlier we described how you can use kubectl get pods
to retrieve simple status information about
your pods. But there are a number of ways to get even more information about your application.
Using kubectl describe pod
to fetch details about pods
For this example we'll use a Deployment to create two pods, similar to the earlier example.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "128Mi"
cpu: "500m"
ports:
- containerPort: 80
Create deployment by running following command:
kubectl apply -f https://k8s.io/examples/application/nginx-with-request.yaml
deployment.apps/nginx-deployment created
Check pod status by following command:
kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-67d4bdd6f5-cx2nz 1/1 Running 0 13s
nginx-deployment-67d4bdd6f5-w6kd7 1/1 Running 0 13s
We can retrieve a lot more information about each of these pods using kubectl describe pod
. For example:
kubectl describe pod nginx-deployment-67d4bdd6f5-w6kd7
Name: nginx-deployment-67d4bdd6f5-w6kd7
Namespace: default
Priority: 0
Node: kube-worker-1/192.168.0.113
Start Time: Thu, 17 Feb 2022 16:51:01 -0500
Labels: app=nginx
pod-template-hash=67d4bdd6f5
Annotations: <none>
Status: Running
IP: 10.88.0.3
IPs:
IP: 10.88.0.3
IP: 2001:db8::1
Controlled By: ReplicaSet/nginx-deployment-67d4bdd6f5
Containers:
nginx:
Container ID: containerd://5403af59a2b46ee5a23fb0ae4b1e077f7ca5c5fb7af16e1ab21c00e0e616462a
Image: nginx
Image ID: docker.io/library/nginx@sha256:2834dc507516af02784808c5f48b7cbe38b8ed5d0f4837f16e78d00deb7e7767
Port: 80/TCP
Host Port: 0/TCP
State: Running
Started: Thu, 17 Feb 2022 16:51:05 -0500
Ready: True
Restart Count: 0
Limits:
cpu: 500m
memory: 128Mi
Requests:
cpu: 500m
memory: 128Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgsgp (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-bgsgp:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34s default-scheduler Successfully assigned default/nginx-deployment-67d4bdd6f5-w6kd7 to kube-worker-1
Normal Pulling 31s kubelet Pulling image "nginx"
Normal Pulled 30s kubelet Successfully pulled image "nginx" in 1.146417389s
Normal Created 30s kubelet Created container nginx
Normal Started 30s kubelet Started container nginx
Here you can see configuration information about the container(s) and Pod (labels, resource requirements, etc.), as well as status information about the container(s) and Pod (state, readiness, restart count, events, etc.).
The container state is one of Waiting, Running, or Terminated. Depending on the state, additional information will be provided -- here you can see that for a container in Running state, the system tells you when the container started.
Ready tells you whether the container passed its last readiness probe. (In this case, the container does not have a readiness probe configured; the container is assumed to be ready if no readiness probe is configured.)
Restart Count tells you how many times the container has been restarted; this information can be useful for detecting crash loops in containers that are configured with a restart policy of 'always.'
Currently the only Condition associated with a Pod is the binary Ready condition, which indicates that the pod is able to service requests and should be added to the load balancing pools of all matching services.
Lastly, you see a log of recent events related to your Pod. The system compresses multiple identical events by indicating the first and last time it was seen and the number of times it was seen. "From" indicates the component that is logging the event, "SubobjectPath" tells you which object (e.g. container within the pod) is being referred to, and "Reason" and "Message" tell you what happened.
Example: debugging Pending Pods
A common scenario that you can detect using events is when you've created a Pod that won't fit on any node. For example, the Pod might request more resources than are free on any node, or it might specify a label selector that doesn't match any nodes. Let's say we created the previous Deployment with 5 replicas (instead of 2) and requesting 600 millicores instead of 500, on a four-node cluster where each (virtual) machine has 1 CPU. In that case one of the Pods will not be able to schedule. (Note that because of the cluster addon pods such as fluentd, skydns, etc., that run on each node, if we requested 1000 millicores then none of the Pods would be able to schedule.)
kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-1006230814-6winp 1/1 Running 0 7m
nginx-deployment-1006230814-fmgu3 1/1 Running 0 7m
nginx-deployment-1370807587-6ekbw 1/1 Running 0 1m
nginx-deployment-1370807587-fg172 0/1 Pending 0 1m
nginx-deployment-1370807587-fz9sd 0/1 Pending 0 1m
To find out why the nginx-deployment-1370807587-fz9sd pod is not running, we can use kubectl describe pod
on the pending Pod and look at its events:
kubectl describe pod nginx-deployment-1370807587-fz9sd
Name: nginx-deployment-1370807587-fz9sd
Namespace: default
Node: /
Labels: app=nginx,pod-template-hash=1370807587
Status: Pending
IP:
Controllers: ReplicaSet/nginx-deployment-1370807587
Containers:
nginx:
Image: nginx
Port: 80/TCP
QoS Tier:
memory: Guaranteed
cpu: Guaranteed
Limits:
cpu: 1
memory: 128Mi
Requests:
cpu: 1
memory: 128Mi
Environment Variables:
Volumes:
default-token-4bcbi:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-4bcbi
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 48s 7 {default-scheduler } Warning FailedScheduling pod (nginx-deployment-1370807587-fz9sd) failed to fit in any node
fit failure on node (kubernetes-node-6ta5): Node didn't have enough resource: CPU, requested: 1000, used: 1420, capacity: 2000
fit failure on node (kubernetes-node-wul5): Node didn't have enough resource: CPU, requested: 1000, used: 1100, capacity: 2000
Here you can see the event generated by the scheduler saying that the Pod failed to schedule for reason FailedScheduling
(and possibly others). The message tells us that there were not enough resources for the Pod on any of the nodes.
To correct this situation, you can use kubectl scale
to update your Deployment to specify four or fewer replicas. (Or you could leave the one Pod pending, which is harmless.)
Events such as the ones you saw at the end of kubectl describe pod
are persisted in etcd and provide high-level information on what is happening in the cluster. To list all events you can use
kubectl get events
but you have to remember that events are namespaced. This means that if you're interested in events for some namespaced object (e.g. what happened with Pods in namespace my-namespace
) you need to explicitly provide a namespace to the command:
kubectl get events --namespace=my-namespace
To see events from all namespaces, you can use the --all-namespaces
argument.
In addition to kubectl describe pod
, another way to get extra information about a pod (beyond what is provided by kubectl get pod
) is to pass the -o yaml
output format flag to kubectl get pod
. This will give you, in YAML format, even more information than kubectl describe pod
--essentially all of the information the system has about the Pod. Here you will see things like annotations (which are key-value metadata without the label restrictions, that is used internally by Kubernetes system components), restart policy, ports, and volumes.
kubectl get pod nginx-deployment-1006230814-6winp -o yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2022-02-17T21:51:01Z"
generateName: nginx-deployment-67d4bdd6f5-
labels:
app: nginx
pod-template-hash: 67d4bdd6f5
name: nginx-deployment-67d4bdd6f5-w6kd7
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: nginx-deployment-67d4bdd6f5
uid: 7d41dfd4-84c0-4be4-88ab-cedbe626ad82
resourceVersion: "1364"
uid: a6501da1-0447-4262-98eb-c03d4002222e
spec:
containers:
- image: nginx
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: 500m
memory: 128Mi
requests:
cpu: 500m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-bgsgp
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kube-worker-1
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-bgsgp
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-02-17T21:51:01Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-02-17T21:51:06Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-02-17T21:51:06Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-02-17T21:51:01Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://5403af59a2b46ee5a23fb0ae4b1e077f7ca5c5fb7af16e1ab21c00e0e616462a
image: docker.io/library/nginx:latest
imageID: docker.io/library/nginx@sha256:2834dc507516af02784808c5f48b7cbe38b8ed5d0f4837f16e78d00deb7e7767
lastState: {}
name: nginx
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2022-02-17T21:51:05Z"
hostIP: 192.168.0.113
phase: Running
podIP: 10.88.0.3
podIPs:
- ip: 10.88.0.3
- ip: 2001:db8::1
qosClass: Guaranteed
startTime: "2022-02-17T21:51:01Z"
Example: debugging a down/unreachable node
Sometimes when debugging it can be useful to look at the status of a node -- for example, because you've noticed strange behavior of a Pod that's running on the node, or to find out why a Pod won't schedule onto the node. As with Pods, you can use kubectl describe node
and kubectl get node -o yaml
to retrieve detailed information about nodes. For example, here's what you'll see if a node is down (disconnected from the network, or kubelet dies and won't restart, etc.). Notice the events that show the node is NotReady, and also notice that the pods are no longer running (they are evicted after five minutes of NotReady status).
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube-worker-1 NotReady <none> 1h v1.23.3
kubernetes-node-bols Ready <none> 1h v1.23.3
kubernetes-node-st6x Ready <none> 1h v1.23.3
kubernetes-node-unaj Ready <none> 1h v1.23.3
kubectl describe node kube-worker-1
Name: kube-worker-1
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=kube-worker-1
kubernetes.io/os=linux
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 17 Feb 2022 16:46:30 -0500
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: kube-worker-1
AcquireTime: <unset>
RenewTime: Thu, 17 Feb 2022 17:13:09 -0500
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 17 Feb 2022 17:09:13 -0500 Thu, 17 Feb 2022 17:09:13 -0500 WeaveIsUp Weave pod has set this
MemoryPressure Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 192.168.0.113
Hostname: kube-worker-1
Capacity:
cpu: 2
ephemeral-storage: 15372232Ki
hugepages-2Mi: 0
memory: 2025188Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 14167048988
hugepages-2Mi: 0
memory: 1922788Ki
pods: 110
System Info:
Machine ID: 9384e2927f544209b5d7b67474bbf92b
System UUID: aa829ca9-73d7-064d-9019-df07404ad448
Boot ID: 5a295a03-aaca-4340-af20-1327fa5dab5c
Kernel Version: 5.13.0-28-generic
OS Image: Ubuntu 21.10
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.5.9
Kubelet Version: v1.23.3
Kube-Proxy Version: v1.23.3
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default nginx-deployment-67d4bdd6f5-cx2nz 500m (25%) 500m (25%) 128Mi (6%) 128Mi (6%) 23m
default nginx-deployment-67d4bdd6f5-w6kd7 500m (25%) 500m (25%) 128Mi (6%) 128Mi (6%) 23m
kube-system kube-proxy-dnxbz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
kube-system weave-net-gjxxp 100m (5%) 0 (0%) 200Mi (10%) 0 (0%) 28m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1100m (55%) 1 (50%)
memory 456Mi (24%) 256Mi (13%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
...
kubectl get node kube-worker-1 -o yaml
apiVersion: v1
kind: Node
metadata:
annotations:
kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2022-02-17T21:46:30Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/arch: amd64
kubernetes.io/hostname: kube-worker-1
kubernetes.io/os: linux
name: kube-worker-1
resourceVersion: "4026"
uid: 98efe7cb-2978-4a0b-842a-1a7bf12c05f8
spec: {}
status:
addresses:
- address: 192.168.0.113
type: InternalIP
- address: kube-worker-1
type: Hostname
allocatable:
cpu: "2"
ephemeral-storage: "14167048988"
hugepages-2Mi: "0"
memory: 1922788Ki
pods: "110"
capacity:
cpu: "2"
ephemeral-storage: 15372232Ki
hugepages-2Mi: "0"
memory: 2025188Ki
pods: "110"
conditions:
- lastHeartbeatTime: "2022-02-17T22:20:32Z"
lastTransitionTime: "2022-02-17T22:20:32Z"
message: Weave pod has set this
reason: WeaveIsUp
status: "False"
type: NetworkUnavailable
- lastHeartbeatTime: "2022-02-17T22:20:15Z"
lastTransitionTime: "2022-02-17T22:13:25Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2022-02-17T22:20:15Z"
lastTransitionTime: "2022-02-17T22:13:25Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2022-02-17T22:20:15Z"
lastTransitionTime: "2022-02-17T22:13:25Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2022-02-17T22:20:15Z"
lastTransitionTime: "2022-02-17T22:15:15Z"
message: kubelet is posting ready status. AppArmor enabled
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
nodeInfo:
architecture: amd64
bootID: 22333234-7a6b-44d4-9ce1-67e31dc7e369
containerRuntimeVersion: containerd://1.5.9
kernelVersion: 5.13.0-28-generic
kubeProxyVersion: v1.23.3
kubeletVersion: v1.23.3
machineID: 9384e2927f544209b5d7b67474bbf92b
operatingSystem: linux
osImage: Ubuntu 21.10
systemUUID: aa829ca9-73d7-064d-9019-df07404ad448
What's next
Learn about additional debugging tools, including:
4.10.2 - Auditing
Kubernetes auditing provides a security-relevant, chronological set of records documenting the sequence of actions in a cluster. The cluster audits the activities generated by users, by applications that use the Kubernetes API, and by the control plane itself.
Auditing allows cluster administrators to answer the following questions:
- what happened?
- when did it happen?
- who initiated it?
- on what did it happen?
- where was it observed?
- from where was it initiated?
- to where was it going?
Audit records begin their lifecycle inside the kube-apiserver component. Each request on each stage of its execution generates an audit event, which is then pre-processed according to a certain policy and written to a backend. The policy determines what's recorded and the backends persist the records. The current backend implementations include logs files and webhooks.
Each request can be recorded with an associated stage. The defined stages are:
RequestReceived
- The stage for events generated as soon as the audit handler receives the request, and before it is delegated down the handler chain.ResponseStarted
- Once the response headers are sent, but before the response body is sent. This stage is only generated for long-running requests (e.g. watch).ResponseComplete
- The response body has been completed and no more bytes will be sent.Panic
- Events generated when a panic occurred.
The audit logging feature increases the memory consumption of the API server because some context required for auditing is stored for each request. Memory consumption depends on the audit logging configuration.
Audit policy
Audit policy defines rules about what events should be recorded and what data
they should include. The audit policy object structure is defined in the
audit.k8s.io
API group.
When an event is processed, it's
compared against the list of rules in order. The first matching rule sets the
audit level of the event. The defined audit levels are:
None
- don't log events that match this rule.Metadata
- log request metadata (requesting user, timestamp, resource, verb, etc.) but not request or response body.Request
- log event metadata and request body but not response body. This does not apply for non-resource requests.RequestResponse
- log event metadata, request and response bodies. This does not apply for non-resource requests.
You can pass a file with the policy to kube-apiserver
using the --audit-policy-file
flag. If the flag is omitted, no events are logged.
Note that the rules
field must be provided in the audit policy file.
A policy with no (0) rules is treated as illegal.
Below is an example audit policy file:
apiVersion: audit.k8s.io/v1 # This is required.
kind: Policy
# Don't generate audit events for all requests in RequestReceived stage.
omitStages:
- "RequestReceived"
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
resources:
- group: ""
# Resource "pods" doesn't match requests to any subresource of pods,
# which is consistent with the RBAC policy.
resources: ["pods"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
resources:
- group: ""
resources: ["pods/log", "pods/status"]
# Don't log requests to a configmap called "controller-leader"
- level: None
resources:
- group: ""
resources: ["configmaps"]
resourceNames: ["controller-leader"]
# Don't log watch requests by the "system:kube-proxy" on endpoints or services
- level: None
users: ["system:kube-proxy"]
verbs: ["watch"]
resources:
- group: "" # core API group
resources: ["endpoints", "services"]
# Don't log authenticated requests to certain non-resource URL paths.
- level: None
userGroups: ["system:authenticated"]
nonResourceURLs:
- "/api*" # Wildcard matching.
- "/version"
# Log the request body of configmap changes in kube-system.
- level: Request
resources:
- group: "" # core API group
resources: ["configmaps"]
# This rule only applies to resources in the "kube-system" namespace.
# The empty string "" can be used to select non-namespaced resources.
namespaces: ["kube-system"]
# Log configmap and secret changes in all other namespaces at the Metadata level.
- level: Metadata
resources:
- group: "" # core API group
resources: ["secrets", "configmaps"]
# Log all other resources in core and extensions at the Request level.
- level: Request
resources:
- group: "" # core API group
- group: "extensions" # Version of group should NOT be included.
# A catch-all rule to log all other requests at the Metadata level.
- level: Metadata
# Long-running requests like watches that fall under this rule will not
# generate an audit event in RequestReceived.
omitStages:
- "RequestReceived"
You can use a minimal audit policy file to log all requests at the Metadata
level:
# Log all requests at the Metadata level.
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
If you're crafting your own audit profile, you can use the audit profile for Google Container-Optimized OS as a starting point. You can check the configure-helper.sh script, which generates an audit policy file. You can see most of the audit policy file by looking directly at the script.
You can also refer to the Policy
configuration reference
for details about the fields defined.
Audit backends
Audit backends persist audit events to an external storage. Out of the box, the kube-apiserver provides two backends:
- Log backend, which writes events into the filesystem
- Webhook backend, which sends events to an external HTTP API
In all cases, audit events follow a structure defined by the Kubernetes API in the
audit.k8s.io
API group.
In case of patches, request body is a JSON array with patch operations, not a JSON object
with an appropriate Kubernetes API object. For example, the following request body is a valid patch
request to /apis/batch/v1/namespaces/some-namespace/jobs/some-job-name
:
[
{
"op": "replace",
"path": "/spec/parallelism",
"value": 0
},
{
"op": "remove",
"path": "/spec/template/spec/containers/0/terminationMessagePolicy"
}
]
Log backend
The log backend writes audit events to a file in JSONlines format.
You can configure the log audit backend using the following kube-apiserver
flags:
--audit-log-path
specifies the log file path that log backend uses to write audit events. Not specifying this flag disables log backend.-
means standard out--audit-log-maxage
defined the maximum number of days to retain old audit log files--audit-log-maxbackup
defines the maximum number of audit log files to retain--audit-log-maxsize
defines the maximum size in megabytes of the audit log file before it gets rotated
If your cluster's control plane runs the kube-apiserver as a Pod, remember to mount the hostPath
to the location of the policy file and log file, so that audit records are persisted. For example:
--audit-policy-file=/etc/kubernetes/audit-policy.yaml \
--audit-log-path=/var/log/kubernetes/audit/audit.log
then mount the volumes:
...
volumeMounts:
- mountPath: /etc/kubernetes/audit-policy.yaml
name: audit
readOnly: true
- mountPath: /var/log/kubernetes/audit/
name: audit-log
readOnly: false
and finally configure the hostPath
:
...
volumes:
- name: audit
hostPath:
path: /etc/kubernetes/audit-policy.yaml
type: File
- name: audit-log
hostPath:
path: /var/log/kubernetes/audit/
type: DirectoryOrCreate
Webhook backend
The webhook audit backend sends audit events to a remote web API, which is assumed to be a form of the Kubernetes API, including means of authentication. You can configure a webhook audit backend using the following kube-apiserver flags:
--audit-webhook-config-file
specifies the path to a file with a webhook configuration. The webhook configuration is effectively a specialized kubeconfig.--audit-webhook-initial-backoff
specifies the amount of time to wait after the first failed request before retrying. Subsequent requests are retried with exponential backoff.
The webhook config file uses the kubeconfig format to specify the remote address of the service and credentials used to connect to it.
Event batching
Both log and webhook backends support batching. Using webhook as an example, here's the list of
available flags. To get the same flag for log backend, replace webhook
with log
in the flag
name. By default, batching is enabled in webhook
and disabled in log
. Similarly, by default
throttling is enabled in webhook
and disabled in log
.
--audit-webhook-mode
defines the buffering strategy. One of the following:batch
- buffer events and asynchronously process them in batches. This is the default.blocking
- block API server responses on processing each individual event.blocking-strict
- Same as blocking, but when there is a failure during audit logging at the RequestReceived stage, the whole request to the kube-apiserver fails.
The following flags are used only in the batch
mode:
--audit-webhook-batch-buffer-size
defines the number of events to buffer before batching. If the rate of incoming events overflows the buffer, events are dropped.--audit-webhook-batch-max-size
defines the maximum number of events in one batch.--audit-webhook-batch-max-wait
defines the maximum amount of time to wait before unconditionally batching events in the queue.--audit-webhook-batch-throttle-qps
defines the maximum average number of batches generated per second.--audit-webhook-batch-throttle-burst
defines the maximum number of batches generated at the same moment if the allowed QPS was underutilized previously.
Parameter tuning
Parameters should be set to accommodate the load on the API server.
For example, if kube-apiserver receives 100 requests each second, and each request is audited only
on ResponseStarted
and ResponseComplete
stages, you should account for ≅200 audit
events being generated each second. Assuming that there are up to 100 events in a batch,
you should set throttling level at least 2 queries per second. Assuming that the backend can take up to
5 seconds to write events, you should set the buffer size to hold up to 5 seconds of events;
that is: 10 batches, or 1000 events.
In most cases however, the default parameters should be sufficient and you don't have to worry about setting them manually. You can look at the following Prometheus metrics exposed by kube-apiserver and in the logs to monitor the state of the auditing subsystem.
apiserver_audit_event_total
metric contains the total number of audit events exported.apiserver_audit_error_total
metric contains the total number of events dropped due to an error during exporting.
Log entry truncation
Both log and webhook backends support limiting the size of events that are logged. As an example, the following is the list of flags available for the log backend:
audit-log-truncate-enabled
whether event and batch truncating is enabled.audit-log-truncate-max-batch-size
maximum size in bytes of the batch sent to the underlying backend.audit-log-truncate-max-event-size
maximum size in bytes of the audit event sent to the underlying backend.
By default truncate is disabled in both webhook
and log
, a cluster administrator should set
audit-log-truncate-enabled
or audit-webhook-truncate-enabled
to enable the feature.
What's next
- Learn about Mutating webhook auditing annotations.
- Learn more about
Event
and thePolicy
resource types by reading the Audit configuration reference.
4.10.3 - Debug a StatefulSet
This task shows you how to debug a StatefulSet.
Before you begin
- You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster.
- You should have a StatefulSet running that you want to investigate.
Debugging a StatefulSet
In order to list all the pods which belong to a StatefulSet, which have a label app=myapp
set on them,
you can use the following:
kubectl get pods -l app=myapp
If you find that any Pods listed are in Unknown
or Terminating
state for an extended period of time,
refer to the Deleting StatefulSet Pods task for
instructions on how to deal with them.
You can debug individual Pods in a StatefulSet using the
Debugging Pods guide.
What's next
Learn more about debugging an init-container.
4.10.4 - Debug Init Containers
This page shows how to investigate problems related to the execution of
Init Containers. The example command lines below refer to the Pod as
<pod-name>
and the Init Containers as <init-container-1>
and
<init-container-2>
.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
- You should be familiar with the basics of Init Containers.
- You should have Configured an Init Container.
Checking the status of Init Containers
Display the status of your pod:
kubectl get pod <pod-name>
For example, a status of Init:1/2
indicates that one of two Init Containers
has completed successfully:
NAME READY STATUS RESTARTS AGE
<pod-name> 0/1 Init:1/2 0 7s
See Understanding Pod status for more examples of status values and their meanings.
Getting details about Init Containers
View more detailed information about Init Container execution:
kubectl describe pod <pod-name>
For example, a Pod with two Init Containers might show the following:
Init Containers:
<init-container-1>:
Container ID: ...
...
State: Terminated
Reason: Completed
Exit Code: 0
Started: ...
Finished: ...
Ready: True
Restart Count: 0
...
<init-container-2>:
Container ID: ...
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: ...
Finished: ...
Ready: False
Restart Count: 3
...
You can also access the Init Container statuses programmatically by reading the
status.initContainerStatuses
field on the Pod Spec:
kubectl get pod nginx --template '{{.status.initContainerStatuses}}'
This command will return the same information as above in raw JSON.
Accessing logs from Init Containers
Pass the Init Container name along with the Pod name to access its logs.
kubectl logs <pod-name> -c <init-container-2>
Init Containers that run a shell script print
commands as they're executed. For example, you can do this in Bash by running
set -x
at the beginning of the script.
Understanding Pod status
A Pod status beginning with Init:
summarizes the status of Init Container
execution. The table below describes some example status values that you might
see while debugging Init Containers.
Status | Meaning |
---|---|
Init:N/M |
The Pod has M Init Containers, and N have completed so far. |
Init:Error |
An Init Container has failed to execute. |
Init:CrashLoopBackOff |
An Init Container has failed repeatedly. |
Pending |
The Pod has not yet begun executing Init Containers. |
PodInitializing or Running |
The Pod has already finished executing Init Containers. |
4.10.5 - Debug Pods and ReplicationControllers
This page shows how to debug Pods and ReplicationControllers.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
- You should be familiar with the basics of Pods and with Pods' lifecycles.
Debugging Pods
The first step in debugging a pod is taking a look at it. Check the current state of the pod and recent events with the following command:
kubectl describe pods ${POD_NAME}
Look at the state of the containers in the pod. Are they all Running
? Have
there been recent restarts?
Continue debugging depending on the state of the pods.
My pod stays pending
If a pod is stuck in Pending
it means that it can not be scheduled onto a
node. Generally this is because there are insufficient resources of one type or
another that prevent scheduling. Look at the output of the kubectl describe ...
command above. There should be messages from the scheduler about why it
can not schedule your pod. Reasons include:
Insufficient resources
You may have exhausted the supply of CPU or Memory in your cluster. In this case you can try several things:
-
Add more nodes to the cluster.
-
Terminate unneeded pods to make room for pending pods.
-
Check that the pod is not larger than your nodes. For example, if all nodes have a capacity of
cpu:1
, then a pod with a request ofcpu: 1.1
will never be scheduled.You can check node capacities with the
kubectl get nodes -o <format>
command. Here are some example command lines that extract the necessary information:kubectl get nodes -o yaml | egrep '\sname:|cpu:|memory:' kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, cap: .status.capacity}'
The resource quota feature can be configured to limit the total amount of resources that can be consumed. If used in conjunction with namespaces, it can prevent one team from hogging all the resources.
Using hostPort
When you bind a pod to a hostPort
there are a limited number of places that
the pod can be scheduled. In most cases, hostPort
is unnecessary; try using a
service object to expose your pod. If you do require hostPort
then you can
only schedule as many pods as there are nodes in your container cluster.
My pod stays waiting
If a pod is stuck in the Waiting
state, then it has been scheduled to a
worker node, but it can't run on that machine. Again, the information from
kubectl describe ...
should be informative. The most common cause of
Waiting
pods is a failure to pull the image. There are three things to check:
- Make sure that you have the name of the image correct.
- Have you pushed the image to the repository?
- Try to manually pull the image to see if it can be pulled. For example, if you
use Docker on your PC, run
docker pull <image>
.
My pod is crashing or otherwise unhealthy
Once your pod has been scheduled, the methods described in Debug Running Pods are available for debugging.
Debugging ReplicationControllers
ReplicationControllers are fairly straightforward. They can either create pods or they can't. If they can't create pods, then please refer to the instructions above to debug your pods.
You can also use kubectl describe rc ${CONTROLLER_NAME}
to inspect events
related to the replication controller.
4.10.6 - Debug Running Pods
This page explains how to debug Pods running (or crashing) on a Node.
Before you begin
- Your Pod should already be scheduled and running. If your Pod is not yet running, start with Troubleshoot Applications.
- For some of the advanced debugging steps you need to know on which Node the
Pod is running and have shell access to run commands on that Node. You don't
need that access to run the standard debug steps that use
kubectl
.
Examining pod logs
First, look at the logs of the affected container:
kubectl logs ${POD_NAME} ${CONTAINER_NAME}
If your container has previously crashed, you can access the previous container's crash log with:
kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}
Debugging with container exec
If the container image includes
debugging utilities, as is the case with images built from Linux and Windows OS
base images, you can run commands inside a specific container with
kubectl exec
:
kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}
-c ${CONTAINER_NAME}
is optional. You can omit it for Pods that only contain a single container.
As an example, to look at the logs from a running Cassandra pod, you might run
kubectl exec cassandra -- cat /var/log/cassandra/system.log
You can run a shell that's connected to your terminal using the -i
and -t
arguments to kubectl exec
, for example:
kubectl exec -it cassandra -- sh
For more details, see Get a Shell to a Running Container.
Debugging with an ephemeral debug container
Kubernetes v1.23 [beta]
Ephemeral containers
are useful for interactive troubleshooting when kubectl exec
is insufficient
because a container has crashed or a container image doesn't include debugging
utilities, such as with distroless images.
Example debugging using ephemeral containers
You can use the kubectl debug
command to add ephemeral containers to a
running Pod. First, create a pod for the example:
kubectl run ephemeral-demo --image=k8s.gcr.io/pause:3.1 --restart=Never
The examples in this section use the pause
container image because it does not
contain debugging utilities, but this method works with all container
images.
If you attempt to use kubectl exec
to create a shell you will see an error
because there is no shell in this container image.
kubectl exec -it ephemeral-demo -- sh
OCI runtime exec failed: exec failed: container_linux.go:346: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown
You can instead add a debugging container using kubectl debug
. If you
specify the -i
/--interactive
argument, kubectl
will automatically attach
to the console of the Ephemeral Container.
kubectl debug -it ephemeral-demo --image=busybox:1.28 --target=ephemeral-demo
Defaulting debug container name to debugger-8xzrl.
If you don't see a command prompt, try pressing enter.
/ #
This command adds a new busybox container and attaches to it. The --target
parameter targets the process namespace of another container. It's necessary
here because kubectl run
does not enable process namespace sharing in the pod it
creates.
--target
parameter must be supported by the Container Runtime. When not supported,
the Ephemeral Container may not be started, or it may be started with an
isolated process namespace so that ps
does not reveal processes in other
containers.
You can view the state of the newly created ephemeral container using kubectl describe
:
kubectl describe pod ephemeral-demo
...
Ephemeral Containers:
debugger-8xzrl:
Container ID: docker://b888f9adfd15bd5739fefaa39e1df4dd3c617b9902082b1cfdc29c4028ffb2eb
Image: busybox
Image ID: docker-pullable://busybox@sha256:1828edd60c5efd34b2bf5dd3282ec0cc04d47b2ff9caa0b6d4f07a21d1c08084
Port: <none>
Host Port: <none>
State: Running
Started: Wed, 12 Feb 2020 14:25:42 +0100
Ready: False
Restart Count: 0
Environment: <none>
Mounts: <none>
...
Use kubectl delete
to remove the Pod when you're finished:
kubectl delete pod ephemeral-demo
Debugging using a copy of the Pod
Sometimes Pod configuration options make it difficult to troubleshoot in certain
situations. For example, you can't run kubectl exec
to troubleshoot your
container if your container image does not include a shell or if your application
crashes on startup. In these situations you can use kubectl debug
to create a
copy of the Pod with configuration values changed to aid debugging.
Copying a Pod while adding a new container
Adding a new container can be useful when your application is running but not behaving as you expect and you'd like to add additional troubleshooting utilities to the Pod.
For example, maybe your application's container images are built on busybox
but you need debugging utilities not included in busybox
. You can simulate
this scenario using kubectl run
:
kubectl run myapp --image=busybox:1.28 --restart=Never -- sleep 1d
Run this command to create a copy of myapp
named myapp-debug
that adds a
new Ubuntu container for debugging:
kubectl debug myapp -it --image=ubuntu --share-processes --copy-to=myapp-debug
Defaulting debug container name to debugger-w7xmf.
If you don't see a command prompt, try pressing enter.
root@myapp-debug:/#
kubectl debug
automatically generates a container name if you don't choose one using the--container
flag.- The
-i
flag causeskubectl debug
to attach to the new container by default. You can prevent this by specifying--attach=false
. If your session becomes disconnected you can reattach usingkubectl attach
. - The
--share-processes
allows the containers in this Pod to see processes from the other containers in the Pod. For more information about how this works, see Share Process Namespace between Containers in a Pod.
Don't forget to clean up the debugging Pod when you're finished with it:
kubectl delete pod myapp myapp-debug
Copying a Pod while changing its command
Sometimes it's useful to change the command for a container, for example to add a debugging flag or because the application is crashing.
To simulate a crashing application, use kubectl run
to create a container
that immediately exits:
kubectl run --image=busybox:1.28 myapp -- false
You can see using kubectl describe pod myapp
that this container is crashing:
Containers:
myapp:
Image: busybox
...
Args:
false
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
You can use kubectl debug
to create a copy of this Pod with the command
changed to an interactive shell:
kubectl debug myapp -it --copy-to=myapp-debug --container=myapp -- sh
If you don't see a command prompt, try pressing enter.
/ #
Now you have an interactive shell that you can use to perform tasks like checking filesystem paths or running the container command manually.
- To change the command of a specific container you must
specify its name using
--container
orkubectl debug
will instead create a new container to run the command you specified. - The
-i
flag causeskubectl debug
to attach to the container by default. You can prevent this by specifying--attach=false
. If your session becomes disconnected you can reattach usingkubectl attach
.
Don't forget to clean up the debugging Pod when you're finished with it:
kubectl delete pod myapp myapp-debug
Copying a Pod while changing container images
In some situations you may want to change a misbehaving Pod from its normal production container images to an image containing a debugging build or additional utilities.
As an example, create a Pod using kubectl run
:
kubectl run myapp --image=busybox:1.28 --restart=Never -- sleep 1d
Now use kubectl debug
to make a copy and change its container image
to ubuntu
:
kubectl debug myapp --copy-to=myapp-debug --set-image=*=ubuntu
The syntax of --set-image
uses the same container_name=image
syntax as
kubectl set image
. *=ubuntu
means change the image of all containers
to ubuntu
.
Don't forget to clean up the debugging Pod when you're finished with it:
kubectl delete pod myapp myapp-debug
Debugging via a shell on the node
If none of these approaches work, you can find the Node on which the Pod is
running and create a privileged Pod running in the host namespaces. To create
an interactive shell on a node using kubectl debug
, run:
kubectl debug node/mynode -it --image=ubuntu
Creating debugging pod node-debugger-mynode-pdx84 with container debugger on node mynode.
If you don't see a command prompt, try pressing enter.
root@ek8s:/#
When creating a debugging session on a node, keep in mind that:
kubectl debug
automatically generates the name of the new Pod based on the name of the Node.- The container runs in the host IPC, Network, and PID namespaces.
- The root filesystem of the Node will be mounted at
/host
.
Don't forget to clean up the debugging Pod when you're finished with it:
kubectl delete pod node-debugger-mynode-pdx84
4.10.7 - Debug Services
An issue that comes up rather frequently for new installations of Kubernetes is that a Service is not working properly. You've run your Pods through a Deployment (or other workload controller) and created a Service, but you get no response when you try to access it. This document will hopefully help you to figure out what's going wrong.
Running commands in a Pod
For many steps here you will want to see what a Pod running in the cluster sees. The simplest way to do this is to run an interactive busybox Pod:
kubectl run -it --rm --restart=Never busybox --image=gcr.io/google-containers/busybox sh
If you already have a running Pod that you prefer to use, you can run a command in it using:
kubectl exec <POD-NAME> -c <CONTAINER-NAME> -- <COMMAND>
Setup
For the purposes of this walk-through, let's run some Pods. Since you're probably debugging your own Service you can substitute your own details, or you can follow along and get a second data point.
kubectl create deployment hostnames --image=k8s.gcr.io/serve_hostname
deployment.apps/hostnames created
kubectl
commands will print the type and name of the resource created or mutated, which can then be used in subsequent commands.
Let's scale the deployment to 3 replicas.
kubectl scale deployment hostnames --replicas=3
deployment.apps/hostnames scaled
Note that this is the same as if you had started the Deployment with the following YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: hostnames
name: hostnames
spec:
selector:
matchLabels:
app: hostnames
replicas: 3
template:
metadata:
labels:
app: hostnames
spec:
containers:
- name: hostnames
image: k8s.gcr.io/serve_hostname
The label "app" is automatically set by kubectl create deployment
to the name of the
Deployment.
You can confirm your Pods are running:
kubectl get pods -l app=hostnames
NAME READY STATUS RESTARTS AGE
hostnames-632524106-bbpiw 1/1 Running 0 2m
hostnames-632524106-ly40y 1/1 Running 0 2m
hostnames-632524106-tlaok 1/1 Running 0 2m
You can also confirm that your Pods are serving. You can get the list of Pod IP addresses and test them directly.
kubectl get pods -l app=hostnames \
-o go-template='{{range .items}}{{.status.podIP}}{{"\n"}}{{end}}'
10.244.0.5
10.244.0.6
10.244.0.7
The example container used for this walk-through serves its own hostname via HTTP on port 9376, but if you are debugging your own app, you'll want to use whatever port number your Pods are listening on.
From within a pod:
for ep in 10.244.0.5:9376 10.244.0.6:9376 10.244.0.7:9376; do
wget -qO- $ep
done
This should produce something like:
hostnames-632524106-bbpiw
hostnames-632524106-ly40y
hostnames-632524106-tlaok
If you are not getting the responses you expect at this point, your Pods
might not be healthy or might not be listening on the port you think they are.
You might find kubectl logs
to be useful for seeing what is happening, or
perhaps you need to kubectl exec
directly into your Pods and debug from
there.
Assuming everything has gone to plan so far, you can start to investigate why your Service doesn't work.
Does the Service exist?
The astute reader will have noticed that you did not actually create a Service yet - that is intentional. This is a step that sometimes gets forgotten, and is the first thing to check.
What would happen if you tried to access a non-existent Service? If you have another Pod that consumes this Service by name you would get something like:
wget -O- hostnames
Resolving hostnames (hostnames)... failed: Name or service not known.
wget: unable to resolve host address 'hostnames'
The first thing to check is whether that Service actually exists:
kubectl get svc hostnames
No resources found.
Error from server (NotFound): services "hostnames" not found
Let's create the Service. As before, this is for the walk-through - you can use your own Service's details here.
kubectl expose deployment hostnames --port=80 --target-port=9376
service/hostnames exposed
And read it back:
kubectl get svc hostnames
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hostnames ClusterIP 10.0.1.175 <none> 80/TCP 5s
Now you know that the Service exists.
As before, this is the same as if you had started the Service with YAML:
apiVersion: v1
kind: Service
metadata:
labels:
app: hostnames
name: hostnames
spec:
selector:
app: hostnames
ports:
- name: default
protocol: TCP
port: 80
targetPort: 9376
In order to highlight the full range of configuration, the Service you created here uses a different port number than the Pods. For many real-world Services, these values might be the same.
Any Network Policy Ingress rules affecting the target Pods?
If you have deployed any Network Policy Ingress rules which may affect incoming
traffic to hostnames-*
Pods, these need to be reviewed.
Please refer to Network Policies for more details.
Does the Service work by DNS name?
One of the most common ways that clients consume a Service is through a DNS name.
From a Pod in the same Namespace:
nslookup hostnames
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: hostnames
Address 1: 10.0.1.175 hostnames.default.svc.cluster.local
If this fails, perhaps your Pod and Service are in different Namespaces, try a namespace-qualified name (again, from within a Pod):
nslookup hostnames.default
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: hostnames.default
Address 1: 10.0.1.175 hostnames.default.svc.cluster.local
If this works, you'll need to adjust your app to use a cross-namespace name, or run your app and Service in the same Namespace. If this still fails, try a fully-qualified name:
nslookup hostnames.default.svc.cluster.local
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: hostnames.default.svc.cluster.local
Address 1: 10.0.1.175 hostnames.default.svc.cluster.local
Note the suffix here: "default.svc.cluster.local". The "default" is the Namespace you're operating in. The "svc" denotes that this is a Service. The "cluster.local" is your cluster domain, which COULD be different in your own cluster.
You can also try this from a Node in the cluster:
nslookup hostnames.default.svc.cluster.local 10.0.0.10
Server: 10.0.0.10
Address: 10.0.0.10#53
Name: hostnames.default.svc.cluster.local
Address: 10.0.1.175
If you are able to do a fully-qualified name lookup but not a relative one, you
need to check that your /etc/resolv.conf
file in your Pod is correct. From
within a Pod:
cat /etc/resolv.conf
You should see something like:
nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local example.com
options ndots:5
The nameserver
line must indicate your cluster's DNS Service. This is
passed into kubelet
with the --cluster-dns
flag.
The search
line must include an appropriate suffix for you to find the
Service name. In this case it is looking for Services in the local
Namespace ("default.svc.cluster.local"), Services in all Namespaces
("svc.cluster.local"), and lastly for names in the cluster ("cluster.local").
Depending on your own install you might have additional records after that (up
to 6 total). The cluster suffix is passed into kubelet
with the
--cluster-domain
flag. Throughout this document, the cluster suffix is
assumed to be "cluster.local". Your own clusters might be configured
differently, in which case you should change that in all of the previous
commands.
The options
line must set ndots
high enough that your DNS client library
considers search paths at all. Kubernetes sets this to 5 by default, which is
high enough to cover all of the DNS names it generates.
Does any Service work by DNS name?
If the above still fails, DNS lookups are not working for your Service. You can take a step back and see what else is not working. The Kubernetes master Service should always work. From within a Pod:
nslookup kubernetes.default
Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.0.0.1 kubernetes.default.svc.cluster.local
If this fails, please see the kube-proxy section of this document, or even go back to the top of this document and start over, but instead of debugging your own Service, debug the DNS Service.
Does the Service work by IP?
Assuming you have confirmed that DNS works, the next thing to test is whether your
Service works by its IP address. From a Pod in your cluster, access the
Service's IP (from kubectl get
above).
for i in $(seq 1 3); do
wget -qO- 10.0.1.175:80
done
This should produce something like:
hostnames-632524106-bbpiw
hostnames-632524106-ly40y
hostnames-632524106-tlaok
If your Service is working, you should get correct responses. If not, there are a number of things that could be going wrong. Read on.
Is the Service defined correctly?
It might sound silly, but you should really double and triple check that your Service is correct and matches your Pod's port. Read back your Service and verify it:
kubectl get service hostnames -o json
{
"kind": "Service",
"apiVersion": "v1",
"metadata": {
"name": "hostnames",
"namespace": "default",
"uid": "428c8b6c-24bc-11e5-936d-42010af0a9bc",
"resourceVersion": "347189",
"creationTimestamp": "2015-07-07T15:24:29Z",
"labels": {
"app": "hostnames"
}
},
"spec": {
"ports": [
{
"name": "default",
"protocol": "TCP",
"port": 80,
"targetPort": 9376,
"nodePort": 0
}
],
"selector": {
"app": "hostnames"
},
"clusterIP": "10.0.1.175",
"type": "ClusterIP",
"sessionAffinity": "None"
},
"status": {
"loadBalancer": {}
}
}
- Is the Service port you are trying to access listed in
spec.ports[]
? - Is the
targetPort
correct for your Pods (some Pods use a different port than the Service)? - If you meant to use a numeric port, is it a number (9376) or a string "9376"?
- If you meant to use a named port, do your Pods expose a port with the same name?
- Is the port's
protocol
correct for your Pods?
Does the Service have any Endpoints?
If you got this far, you have confirmed that your Service is correctly defined and is resolved by DNS. Now let's check that the Pods you ran are actually being selected by the Service.
Earlier you saw that the Pods were running. You can re-check that:
kubectl get pods -l app=hostnames
NAME READY STATUS RESTARTS AGE
hostnames-632524106-bbpiw 1/1 Running 0 1h
hostnames-632524106-ly40y 1/1 Running 0 1h
hostnames-632524106-tlaok 1/1 Running 0 1h
The -l app=hostnames
argument is a label selector configured on the Service.
The "AGE" column says that these Pods are about an hour old, which implies that they are running fine and not crashing.
The "RESTARTS" column says that these pods are not crashing frequently or being restarted. Frequent restarts could lead to intermittent connectivity issues. If the restart count is high, read more about how to debug pods.
Inside the Kubernetes system is a control loop which evaluates the selector of every Service and saves the results into a corresponding Endpoints object.
kubectl get endpoints hostnames
NAME ENDPOINTS
hostnames 10.244.0.5:9376,10.244.0.6:9376,10.244.0.7:9376
This confirms that the endpoints controller has found the correct Pods for
your Service. If the ENDPOINTS
column is <none>
, you should check that
the spec.selector
field of your Service actually selects for
metadata.labels
values on your Pods. A common mistake is to have a typo or
other error, such as the Service selecting for app=hostnames
, but the
Deployment specifying run=hostnames
, as in versions previous to 1.18, where
the kubectl run
command could have been also used to create a Deployment.
Are the Pods working?
At this point, you know that your Service exists and has selected your Pods. At the beginning of this walk-through, you verified the Pods themselves. Let's check again that the Pods are actually working - you can bypass the Service mechanism and go straight to the Pods, as listed by the Endpoints above.
From within a Pod:
for ep in 10.244.0.5:9376 10.244.0.6:9376 10.244.0.7:9376; do
wget -qO- $ep
done
This should produce something like:
hostnames-632524106-bbpiw
hostnames-632524106-ly40y
hostnames-632524106-tlaok
You expect each Pod in the Endpoints list to return its own hostname. If this is not what happens (or whatever the correct behavior is for your own Pods), you should investigate what's happening there.
Is the kube-proxy working?
If you get here, your Service is running, has Endpoints, and your Pods are actually serving. At this point, the whole Service proxy mechanism is suspect. Let's confirm it, piece by piece.
The default implementation of Services, and the one used on most clusters, is kube-proxy. This is a program that runs on every node and configures one of a small set of mechanisms for providing the Service abstraction. If your cluster does not use kube-proxy, the following sections will not apply, and you will have to investigate whatever implementation of Services you are using.
Is kube-proxy running?
Confirm that kube-proxy
is running on your Nodes. Running directly on a
Node, you should get something like the below:
ps auxw | grep kube-proxy
root 4194 0.4 0.1 101864 17696 ? Sl Jul04 25:43 /usr/local/bin/kube-proxy --master=https://kubernetes-master --kubeconfig=/var/lib/kube-proxy/kubeconfig --v=2
Next, confirm that it is not failing something obvious, like contacting the
master. To do this, you'll have to look at the logs. Accessing the logs
depends on your Node OS. On some OSes it is a file, such as
/var/log/kube-proxy.log, while other OSes use journalctl
to access logs. You
should see something like:
I1027 22:14:53.995134 5063 server.go:200] Running in resource-only container "/kube-proxy"
I1027 22:14:53.998163 5063 server.go:247] Using iptables Proxier.
I1027 22:14:53.999055 5063 server.go:255] Tearing down userspace rules. Errors here are acceptable.
I1027 22:14:54.038140 5063 proxier.go:352] Setting endpoints for "kube-system/kube-dns:dns-tcp" to [10.244.1.3:53]
I1027 22:14:54.038164 5063 proxier.go:352] Setting endpoints for "kube-system/kube-dns:dns" to [10.244.1.3:53]
I1027 22:14:54.038209 5063 proxier.go:352] Setting endpoints for "default/kubernetes:https" to [10.240.0.2:443]
I1027 22:14:54.038238 5063 proxier.go:429] Not syncing iptables until Services and Endpoints have been received from master
I1027 22:14:54.040048 5063 proxier.go:294] Adding new service "default/kubernetes:https" at 10.0.0.1:443/TCP
I1027 22:14:54.040154 5063 proxier.go:294] Adding new service "kube-system/kube-dns:dns" at 10.0.0.10:53/UDP
I1027 22:14:54.040223 5063 proxier.go:294] Adding new service "kube-system/kube-dns:dns-tcp" at 10.0.0.10:53/TCP
If you see error messages about not being able to contact the master, you should double-check your Node configuration and installation steps.
One of the possible reasons that kube-proxy
cannot run correctly is that the
required conntrack
binary cannot be found. This may happen on some Linux
systems, depending on how you are installing the cluster, for example, you are
installing Kubernetes from scratch. If this is the case, you need to manually
install the conntrack
package (e.g. sudo apt install conntrack
on Ubuntu)
and then retry.
Kube-proxy can run in one of a few modes. In the log listed above, the
line Using iptables Proxier
indicates that kube-proxy is running in
"iptables" mode. The most common other mode is "ipvs". The older "userspace"
mode has largely been replaced by these.
Iptables mode
In "iptables" mode, you should see something like the following on a Node:
iptables-save | grep hostnames
-A KUBE-SEP-57KPRZ3JQVENLNBR -s 10.244.3.6/32 -m comment --comment "default/hostnames:" -j MARK --set-xmark 0x00004000/0x00004000
-A KUBE-SEP-57KPRZ3JQVENLNBR -p tcp -m comment --comment "default/hostnames:" -m tcp -j DNAT --to-destination 10.244.3.6:9376
-A KUBE-SEP-WNBA2IHDGP2BOBGZ -s 10.244.1.7/32 -m comment --comment "default/hostnames:" -j MARK --set-xmark 0x00004000/0x00004000
-A KUBE-SEP-WNBA2IHDGP2BOBGZ -p tcp -m comment --comment "default/hostnames:" -m tcp -j DNAT --to-destination 10.244.1.7:9376
-A KUBE-SEP-X3P2623AGDH6CDF3 -s 10.244.2.3/32 -m comment --comment "default/hostnames:" -j MARK --set-xmark 0x00004000/0x00004000
-A KUBE-SEP-X3P2623AGDH6CDF3 -p tcp -m comment --comment "default/hostnames:" -m tcp -j DNAT --to-destination 10.244.2.3:9376
-A KUBE-SERVICES -d 10.0.1.175/32 -p tcp -m comment --comment "default/hostnames: cluster IP" -m tcp --dport 80 -j KUBE-SVC-NWV5X2332I4OT4T3
-A KUBE-SVC-NWV5X2332I4OT4T3 -m comment --comment "default/hostnames:" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-WNBA2IHDGP2BOBGZ
-A KUBE-SVC-NWV5X2332I4OT4T3 -m comment --comment "default/hostnames:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-X3P2623AGDH6CDF3
-A KUBE-SVC-NWV5X2332I4OT4T3 -m comment --comment "default/hostnames:" -j KUBE-SEP-57KPRZ3JQVENLNBR
For each port of each Service, there should be 1 rule in KUBE-SERVICES
and
one KUBE-SVC-<hash>
chain. For each Pod endpoint, there should be a small
number of rules in that KUBE-SVC-<hash>
and one KUBE-SEP-<hash>
chain with
a small number of rules in it. The exact rules will vary based on your exact
config (including node-ports and load-balancers).
IPVS mode
In "ipvs" mode, you should see something like the following on a Node:
ipvsadm -ln
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
...
TCP 10.0.1.175:80 rr
-> 10.244.0.5:9376 Masq 1 0 0
-> 10.244.0.6:9376 Masq 1 0 0
-> 10.244.0.7:9376 Masq 1 0 0
...
For each port of each Service, plus any NodePorts, external IPs, and
load-balancer IPs, kube-proxy will create a virtual server. For each Pod
endpoint, it will create corresponding real servers. In this example, service
hostnames(10.0.1.175:80
) has 3 endpoints(10.244.0.5:9376
,
10.244.0.6:9376
, 10.244.0.7:9376
).
Userspace mode
In rare cases, you may be using "userspace" mode. From your Node:
iptables-save | grep hostnames
-A KUBE-PORTALS-CONTAINER -d 10.0.1.175/32 -p tcp -m comment --comment "default/hostnames:default" -m tcp --dport 80 -j REDIRECT --to-ports 48577
-A KUBE-PORTALS-HOST -d 10.0.1.175/32 -p tcp -m comment --comment "default/hostnames:default" -m tcp --dport 80 -j DNAT --to-destination 10.240.115.247:48577
There should be 2 rules for each port of your Service (only one in this example) - a "KUBE-PORTALS-CONTAINER" and a "KUBE-PORTALS-HOST".
Almost nobody should be using the "userspace" mode any more, so you won't spend more time on it here.
Is kube-proxy proxying?
Assuming you do see one the above cases, try again to access your Service by IP from one of your Nodes:
curl 10.0.1.175:80
hostnames-632524106-bbpiw
If this fails and you are using the userspace proxy, you can try accessing the proxy directly. If you are using the iptables proxy, skip this section.
Look back at the iptables-save
output above, and extract the
port number that kube-proxy
is using for your Service. In the above
examples it is "48577". Now connect to that:
curl localhost:48577
hostnames-632524106-tlaok
If this still fails, look at the kube-proxy
logs for specific lines like:
Setting endpoints for default/hostnames:default to [10.244.0.5:9376 10.244.0.6:9376 10.244.0.7:9376]
If you don't see those, try restarting kube-proxy
with the -v
flag set to 4, and
then look at the logs again.
Edge case: A Pod fails to reach itself via the Service IP
This might sound unlikely, but it does happen and it is supposed to work.
This can happen when the network is not properly configured for "hairpin"
traffic, usually when kube-proxy
is running in iptables
mode and Pods
are connected with bridge network. The Kubelet
exposes a hairpin-mode
flag that allows endpoints of a Service to loadbalance
back to themselves if they try to access their own Service VIP. The
hairpin-mode
flag must either be set to hairpin-veth
or
promiscuous-bridge
.
The common steps to trouble shoot this are as follows:
- Confirm
hairpin-mode
is set tohairpin-veth
orpromiscuous-bridge
. You should see something like the below.hairpin-mode
is set topromiscuous-bridge
in the following example.
ps auxw | grep kubelet
root 3392 1.1 0.8 186804 65208 ? Sl 00:51 11:11 /usr/local/bin/kubelet --enable-debugging-handlers=true --config=/etc/kubernetes/manifests --allow-privileged=True --v=4 --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --configure-cbr0=true --cgroup-root=/ --system-cgroups=/system --hairpin-mode=promiscuous-bridge --runtime-cgroups=/docker-daemon --kubelet-cgroups=/kubelet --babysit-daemons=true --max-pods=110 --serialize-image-pulls=false --outofdisk-transition-frequency=0
- Confirm the effective
hairpin-mode
. To do this, you'll have to look at kubelet log. Accessing the logs depends on your Node OS. On some OSes it is a file, such as /var/log/kubelet.log, while other OSes usejournalctl
to access logs. Please be noted that the effective hairpin mode may not match--hairpin-mode
flag due to compatibility. Check if there is any log lines with key wordhairpin
in kubelet.log. There should be log lines indicating the effective hairpin mode, like something below.
I0629 00:51:43.648698 3252 kubelet.go:380] Hairpin mode set to "promiscuous-bridge"
- If the effective hairpin mode is
hairpin-veth
, ensure theKubelet
has the permission to operate in/sys
on node. If everything works properly, you should see something like:
for intf in /sys/devices/virtual/net/cbr0/brif/*; do cat $intf/hairpin_mode; done
1
1
1
1
- If the effective hairpin mode is
promiscuous-bridge
, ensureKubelet
has the permission to manipulate linux bridge on node. Ifcbr0
bridge is used and configured properly, you should see:
ifconfig cbr0 |grep PROMISC
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1460 Metric:1
- Seek help if none of above works out.
Seek help
If you get this far, something very strange is happening. Your Service is
running, has Endpoints, and your Pods are actually serving. You have DNS
working, and kube-proxy
does not seem to be misbehaving. And yet your
Service is not working. Please let us know what is going on, so we can help
investigate!
Contact us on Slack or Forum or GitHub.
What's next
Visit troubleshooting document for more information.
4.10.8 - Debugging Kubernetes nodes with crictl
Kubernetes v1.11 [stable]
crictl
is a command-line interface for CRI-compatible container runtimes.
You can use it to inspect and debug container runtimes and applications on a
Kubernetes node. crictl
and its source are hosted in the
cri-tools repository.
Before you begin
crictl
requires a Linux operating system with a CRI runtime.
Installing crictl
You can download a compressed archive crictl
from the cri-tools
release page, for several
different architectures. Download the version that corresponds to your version
of Kubernetes. Extract it and move it to a location on your system path, such as
/usr/local/bin/
.
General usage
The crictl
command has several subcommands and runtime flags. Use
crictl help
or crictl <subcommand> help
for more details.
You can set the endpoint for crictl
by doing one of the following:
- Set the
--runtime-endpoint
and--image-endpoint
flags. - Set the
CONTAINER_RUNTIME_ENDPOINT
andIMAGE_SERVICE_ENDPOINT
environment variables. - Set the endpoint in the configuration file
/etc/crictl.yaml
. To specify a different file, use the--config=PATH_TO_FILE
flag when you runcrictl
.
crictl
attempts to connect to a list of known
endpoints, which might result in an impact to performance.
You can also specify timeout values when connecting to the server and enable or
disable debugging, by specifying timeout
or debug
values in the configuration
file or using the --timeout
and --debug
command-line flags.
To view or edit the current configuration, view or edit the contents of
/etc/crictl.yaml
. For example, the configuration when using the containerd
container runtime would be similar to this:
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 10
debug: true
To learn more about crictl
, refer to the crictl
documentation.
Example crictl commands
The following examples show some crictl
commands and example output.
crictl
to create pod sandboxes or containers on a running
Kubernetes cluster, the Kubelet will eventually delete them. crictl
is not a
general purpose workflow tool, but a tool that is useful for debugging.
List pods
List all pods:
crictl pods
The output is similar to this:
POD ID CREATED STATE NAME NAMESPACE ATTEMPT
926f1b5a1d33a About a minute ago Ready sh-84d7dcf559-4r2gq default 0
4dccb216c4adb About a minute ago Ready nginx-65899c769f-wv2gp default 0
a86316e96fa89 17 hours ago Ready kube-proxy-gblk4 kube-system 0
919630b8f81f1 17 hours ago Ready nvidia-device-plugin-zgbbv kube-system 0
List pods by name:
crictl pods --name nginx-65899c769f-wv2gp
The output is similar to this:
POD ID CREATED STATE NAME NAMESPACE ATTEMPT
4dccb216c4adb 2 minutes ago Ready nginx-65899c769f-wv2gp default 0
List pods by label:
crictl pods --label run=nginx
The output is similar to this:
POD ID CREATED STATE NAME NAMESPACE ATTEMPT
4dccb216c4adb 2 minutes ago Ready nginx-65899c769f-wv2gp default 0
List images
List all images:
crictl images
The output is similar to this:
IMAGE TAG IMAGE ID SIZE
busybox latest 8c811b4aec35f 1.15MB
k8s-gcrio.azureedge.net/hyperkube-amd64 v1.10.3 e179bbfe5d238 665MB
k8s-gcrio.azureedge.net/pause-amd64 3.1 da86e6ba6ca19 742kB
nginx latest cd5239a0906a6 109MB
List images by repository:
crictl images nginx
The output is similar to this:
IMAGE TAG IMAGE ID SIZE
nginx latest cd5239a0906a6 109MB
Only list image IDs:
crictl images -q
The output is similar to this:
sha256:8c811b4aec35f259572d0f79207bc0678df4c736eeec50bc9fec37ed936a472a
sha256:e179bbfe5d238de6069f3b03fccbecc3fb4f2019af741bfff1233c4d7b2970c5
sha256:da86e6ba6ca197bf6bc5e9d900febd906b133eaa4750e6bed647b0fbe50ed43e
sha256:cd5239a0906a6ccf0562354852fae04bc5b52d72a2aff9a871ddb6bd57553569
List containers
List all containers:
crictl ps -a
The output is similar to this:
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT
1f73f2d81bf98 busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 7 minutes ago Running sh 1
9c5951df22c78 busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 8 minutes ago Exited sh 0
87d3992f84f74 nginx@sha256:d0a8828cccb73397acb0073bf34f4d7d8aa315263f1e7806bf8c55d8ac139d5f 8 minutes ago Running nginx 0
1941fb4da154f k8s-gcrio.azureedge.net/hyperkube-amd64@sha256:00d814b1f7763f4ab5be80c58e98140dfc69df107f253d7fdd714b30a714260a 18 hours ago Running kube-proxy 0
List running containers:
crictl ps
The output is similar to this:
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT
1f73f2d81bf98 busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 6 minutes ago Running sh 1
87d3992f84f74 nginx@sha256:d0a8828cccb73397acb0073bf34f4d7d8aa315263f1e7806bf8c55d8ac139d5f 7 minutes ago Running nginx 0
1941fb4da154f k8s-gcrio.azureedge.net/hyperkube-amd64@sha256:00d814b1f7763f4ab5be80c58e98140dfc69df107f253d7fdd714b30a714260a 17 hours ago Running kube-proxy 0
Execute a command in a running container
crictl exec -i -t 1f73f2d81bf98 ls
The output is similar to this:
bin dev etc home proc root sys tmp usr var
Get a container's logs
Get all container logs:
crictl logs 87d3992f84f74
The output is similar to this:
10.240.0.96 - - [06/Jun/2018:02:45:49 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
10.240.0.96 - - [06/Jun/2018:02:45:50 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
10.240.0.96 - - [06/Jun/2018:02:45:51 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
Get only the latest N
lines of logs:
crictl logs --tail=1 87d3992f84f74
The output is similar to this:
10.240.0.96 - - [06/Jun/2018:02:45:51 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.47.0" "-"
Run a pod sandbox
Using crictl
to run a pod sandbox is useful for debugging container runtimes.
On a running Kubernetes cluster, the sandbox will eventually be stopped and
deleted by the Kubelet.
-
Create a JSON file like the following:
{ "metadata": { "name": "nginx-sandbox", "namespace": "default", "attempt": 1, "uid": "hdishd83djaidwnduwk28bcsb" }, "logDirectory": "/tmp", "linux": { } }
-
Use the
crictl runp
command to apply the JSON and run the sandbox.crictl runp pod-config.json
The ID of the sandbox is returned.
Create a container
Using crictl
to create a container is useful for debugging container runtimes.
On a running Kubernetes cluster, the sandbox will eventually be stopped and
deleted by the Kubelet.
-
Pull a busybox image
crictl pull busybox
Image is up to date for busybox@sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47
-
Create configs for the pod and the container:
Pod config:
{ "metadata": { "name": "nginx-sandbox", "namespace": "default", "attempt": 1, "uid": "hdishd83djaidwnduwk28bcsb" }, "log_directory": "/tmp", "linux": { } }
Container config:
{ "metadata": { "name": "busybox" }, "image":{ "image": "busybox" }, "command": [ "top" ], "log_path":"busybox.log", "linux": { } }
-
Create the container, passing the ID of the previously-created pod, the container config file, and the pod config file. The ID of the container is returned.
crictl create f84dd361f8dc51518ed291fbadd6db537b0496536c1d2d6c05ff943ce8c9a54f container-config.json pod-config.json
-
List all containers and verify that the newly-created container has its state set to
Created
.crictl ps -a
The output is similar to this:
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 3e025dd50a72d busybox 32 seconds ago Created busybox 0
Start a container
To start a container, pass its ID to crictl start
:
crictl start 3e025dd50a72d956c4f14881fbb5b1080c9275674e95fb67f965f6478a957d60
The output is similar to this:
3e025dd50a72d956c4f14881fbb5b1080c9275674e95fb67f965f6478a957d60
Check the container has its state set to Running
.
crictl ps
The output is similar to this:
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT
3e025dd50a72d busybox About a minute ago Running busybox 0
What's next
4.10.9 - Determine the Reason for Pod Failure
This page shows how to write and read a Container termination message.
Termination messages provide a way for containers to write information about fatal events to a location where it can be easily retrieved and surfaced by tools like dashboards and monitoring software. In most cases, information that you put in a termination message should also be written to the general Kubernetes logs.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Writing and reading a termination message
In this exercise, you create a Pod that runs one container. The configuration file specifies a command that runs when the container starts.
apiVersion: v1
kind: Pod
metadata:
name: termination-demo
spec:
containers:
- name: termination-demo-container
image: debian
command: ["/bin/sh"]
args: ["-c", "sleep 10 && echo Sleep expired > /dev/termination-log"]
-
Create a Pod based on the YAML configuration file:
kubectl apply -f https://k8s.io/examples/debug/termination.yaml
In the YAML file, in the
command
andargs
fields, you can see that the container sleeps for 10 seconds and then writes "Sleep expired" to the/dev/termination-log
file. After the container writes the "Sleep expired" message, it terminates. -
Display information about the Pod:
kubectl get pod termination-demo
Repeat the preceding command until the Pod is no longer running.
-
Display detailed information about the Pod:
kubectl get pod termination-demo --output=yaml
The output includes the "Sleep expired" message:
apiVersion: v1 kind: Pod ... lastState: terminated: containerID: ... exitCode: 0 finishedAt: ... message: | Sleep expired ...
-
Use a Go template to filter the output so that it includes only the termination message:
kubectl get pod termination-demo -o go-template="{{range .status.containerStatuses}}{{.lastState.terminated.message}}{{end}}"
If you are running a multi-container pod, you can use a Go template to include the container's name. By doing so, you can discover which of the containers is failing:
kubectl get pod multi-container-pod -o go-template='{{range .status.containerStatuses}}{{printf "%s:\n%s\n\n" .name .lastState.terminated.message}}{{end}}'
Customizing the termination message
Kubernetes retrieves termination messages from the termination message file
specified in the terminationMessagePath
field of a Container, which has a default
value of /dev/termination-log
. By customizing this field, you can tell Kubernetes
to use a different file. Kubernetes use the contents from the specified file to
populate the Container's status message on both success and failure.
The termination message is intended to be brief final status, such as an assertion failure message.
The kubelet truncates messages that are longer than 4096 bytes. The total message length across all
containers will be limited to 12KiB. The default termination message path is /dev/termination-log
.
You cannot set the termination message path after a Pod is launched
In the following example, the container writes termination messages to
/tmp/my-log
for Kubernetes to retrieve:
apiVersion: v1
kind: Pod
metadata:
name: msg-path-demo
spec:
containers:
- name: msg-path-demo-container
image: debian
terminationMessagePath: "/tmp/my-log"
Moreover, users can set the terminationMessagePolicy
field of a Container for
further customization. This field defaults to "File
" which means the termination
messages are retrieved only from the termination message file. By setting the
terminationMessagePolicy
to "FallbackToLogsOnError
", you can tell Kubernetes
to use the last chunk of container log output if the termination message file
is empty and the container exited with an error. The log output is limited to
2048 bytes or 80 lines, whichever is smaller.
What's next
- See the
terminationMessagePath
field in Container. - Learn about retrieving logs.
- Learn about Go templates.
4.10.10 - Developing and debugging services locally
Kubernetes applications usually consist of multiple, separate services, each running in its own container. Developing and debugging these services on a remote Kubernetes cluster can be cumbersome, requiring you to get a shell on a running container in order to run debugging tools.
telepresence
is a tool to ease the process of developing and debugging services locally while proxying the service to a remote Kubernetes cluster. Using telepresence
allows you to use custom tools, such as a debugger and IDE, for a local service and provides the service full access to ConfigMap, secrets, and the services running on the remote cluster.
This document describes using telepresence
to develop and debug services running on a remote cluster locally.
Before you begin
- Kubernetes cluster is installed
kubectl
is configured to communicate with the cluster- Telepresence is installed
Connecting your local machine to a remote Kubernetes cluster
After installing telepresence
, run telepresence connect
to launch it's Daemon and connect your local workstation to the cluster.
$ telepresence connect
Launching Telepresence Daemon
...
Connected to context default (https://<cluster public IP>)
You can curl services using the Kubernetes syntax e.g. curl -ik https://kubernetes.default
Developing or debugging an existing service
When developing an application on Kubernetes, you typically program or debug a single service. The service might require access to other services for testing and debugging. One option is to use the continuous deployment pipeline, but even the fastest deployment pipeline introduces a delay in the program or debug cycle.
Use the telepresence intercept $SERVICE_NAME --port $LOCAL_PORT:REMOTE_PORT
command to create an "intercept" for rerouting remote service traffic.
Where:
$SERVICE_NAME
is the name of your local service$LOCAL_PORT
is the port that your service is running on your local workstation- And
$REMOTE_PORT
is the port your service listens to in the cluster
Running this command tells Telepresence to send remote traffic to your local service instead of the service in the remote Kubernetes cluster. Make edits to your service source code locally, save, and see the corresponding changes when accessing your remote application take effect immediately. You can also run your local service using a debugger or any other local development tool.
How does Telepresence work?
Telepresence installs a traffic-agent sidecar next to your existing application's container running in the remote cluster. It then captures all traffic requests going into the Pod, and instead of forwarding this to the application in the remote cluster, it routes all traffic (when you create a global intercept) or a subset of the traffic (when you create a personal intercept) to your local development environment.
What's next
If you're interested in a hands-on tutorial, check out this tutorial that walks through locally developing the Guestbook application on Google Kubernetes Engine.
For further reading, visit the Telepresence website.
4.10.11 - Get a Shell to a Running Container
This page shows how to use kubectl exec
to get a shell to a
running container.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Getting a shell to a container
In this exercise, you create a Pod that has one container. The container runs the nginx image. Here is the configuration file for the Pod:
apiVersion: v1
kind: Pod
metadata:
name: shell-demo
spec:
volumes:
- name: shared-data
emptyDir: {}
containers:
- name: nginx
image: nginx
volumeMounts:
- name: shared-data
mountPath: /usr/share/nginx/html
hostNetwork: true
dnsPolicy: Default
Create the Pod:
kubectl apply -f https://k8s.io/examples/application/shell-demo.yaml
Verify that the container is running:
kubectl get pod shell-demo
Get a shell to the running container:
kubectl exec --stdin --tty shell-demo -- /bin/bash
--
) separates the arguments you want to pass to the command from the kubectl arguments.
In your shell, list the root directory:
# Run this inside the container
ls /
In your shell, experiment with other commands. Here are some examples:
# You can run these example commands inside the container
ls /
cat /proc/mounts
cat /proc/1/maps
apt-get update
apt-get install -y tcpdump
tcpdump
apt-get install -y lsof
lsof
apt-get install -y procps
ps aux
ps aux | grep nginx
Writing the root page for nginx
Look again at the configuration file for your Pod. The Pod
has an emptyDir
volume, and the container mounts the volume
at /usr/share/nginx/html
.
In your shell, create an index.html
file in the /usr/share/nginx/html
directory:
# Run this inside the container
echo 'Hello shell demo' > /usr/share/nginx/html/index.html
In your shell, send a GET request to the nginx server:
# Run this in the shell inside your container
apt-get update
apt-get install curl
curl http://localhost/
The output shows the text that you wrote to the index.html
file:
Hello shell demo
When you are finished with your shell, enter exit
.
exit # To quit the shell in the container
Running individual commands in a container
In an ordinary command window, not your shell, list the environment variables in the running container:
kubectl exec shell-demo env
Experiment with running other commands. Here are some examples:
kubectl exec shell-demo -- ps aux
kubectl exec shell-demo -- ls /
kubectl exec shell-demo -- cat /proc/1/mounts
Opening a shell when a Pod has more than one container
If a Pod has more than one container, use --container
or -c
to
specify a container in the kubectl exec
command. For example,
suppose you have a Pod named my-pod, and the Pod has two containers
named main-app and helper-app. The following command would open a
shell to the main-app container.
kubectl exec -i -t my-pod --container main-app -- /bin/bash
-i
and -t
are the same as the long options --stdin
and --tty
What's next
- Read about kubectl exec
4.10.12 - Monitor Node Health
Node Problem Detector is a daemon for monitoring and reporting about a node's health.
You can run Node Problem Detector as a DaemonSet
or as a standalone daemon.
Node Problem Detector collects information about node problems from various daemons
and reports these conditions to the API server as NodeCondition
and Event.
To learn how to install and use Node Problem Detector, see Node Problem Detector project documentation.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Limitations
-
Node Problem Detector only supports file based kernel log. Log tools such as
journald
are not supported. -
Node Problem Detector uses the kernel log format for reporting kernel issues. To learn how to extend the kernel log format, see Add support for another log format.
Enabling Node Problem Detector
Some cloud providers enable Node Problem Detector as an Addon.
You can also enable Node Problem Detector with kubectl
or by creating an Addon pod.
Using kubectl to enable Node Problem Detector
kubectl
provides the most flexible management of Node Problem Detector.
You can overwrite the default configuration to fit it into your environment or
to detect customized node problems. For example:
-
Create a Node Problem Detector configuration similar to
node-problem-detector.yaml
:apiVersion: apps/v1 kind: DaemonSet metadata: name: node-problem-detector-v0.1 namespace: kube-system labels: k8s-app: node-problem-detector version: v0.1 kubernetes.io/cluster-service: "true" spec: selector: matchLabels: k8s-app: node-problem-detector version: v0.1 kubernetes.io/cluster-service: "true" template: metadata: labels: k8s-app: node-problem-detector version: v0.1 kubernetes.io/cluster-service: "true" spec: hostNetwork: true containers: - name: node-problem-detector image: k8s.gcr.io/node-problem-detector:v0.1 securityContext: privileged: true resources: limits: cpu: "200m" memory: "100Mi" requests: cpu: "20m" memory: "20Mi" volumeMounts: - name: log mountPath: /log readOnly: true volumes: - name: log hostPath: path: /var/log/
Note: You should verify that the system log directory is right for your operating system distribution. -
Start node problem detector with
kubectl
:kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
Using an Addon pod to enable Node Problem Detector
If you are using a custom cluster bootstrap solution and don't need to overwrite the default configuration, you can leverage the Addon pod to further automate the deployment.
Create node-problem-detector.yaml
, and save the configuration in the Addon pod's
directory /etc/kubernetes/addons/node-problem-detector
on a control plane node.
Overwrite the configuration
The default configuration is embedded when building the Docker image of Node Problem Detector.
However, you can use a ConfigMap
to overwrite the configuration:
-
Change the configuration files in
config/
-
Create the
ConfigMap
node-problem-detector-config
:kubectl create configmap node-problem-detector-config --from-file=config/
-
Change the
node-problem-detector.yaml
to use theConfigMap
:apiVersion: apps/v1 kind: DaemonSet metadata: name: node-problem-detector-v0.1 namespace: kube-system labels: k8s-app: node-problem-detector version: v0.1 kubernetes.io/cluster-service: "true" spec: selector: matchLabels: k8s-app: node-problem-detector version: v0.1 kubernetes.io/cluster-service: "true" template: metadata: labels: k8s-app: node-problem-detector version: v0.1 kubernetes.io/cluster-service: "true" spec: hostNetwork: true containers: - name: node-problem-detector image: k8s.gcr.io/node-problem-detector:v0.1 securityContext: privileged: true resources: limits: cpu: "200m" memory: "100Mi" requests: cpu: "20m" memory: "20Mi" volumeMounts: - name: log mountPath: /log readOnly: true - name: config # Overwrite the config/ directory with ConfigMap volume mountPath: /config readOnly: true volumes: - name: log hostPath: path: /var/log/ - name: config # Define ConfigMap volume configMap: name: node-problem-detector-config
-
Recreate the Node Problem Detector with the new configuration file:
# If you have a node-problem-detector running, delete before recreating kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
kubectl
.
Overwriting a configuration is not supported if a Node Problem Detector runs as a cluster Addon.
The Addon manager does not support ConfigMap
.
Kernel Monitor
Kernel Monitor is a system log monitor daemon supported in the Node Problem Detector. Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
config/kernel-monitor.json
. The rule list is extensible. You can expand the rule list by overwriting the
configuration.
Add new NodeConditions
To support a new NodeCondition
, create a condition definition within the conditions
field in
config/kernel-monitor.json
, for example:
{
"type": "NodeConditionType",
"reason": "CamelCaseDefaultNodeConditionReason",
"message": "arbitrary default node condition message"
}
Detect new problems
To detect new problems, you can extend the rules
field in config/kernel-monitor.json
with a new rule definition:
{
"type": "temporary/permanent",
"condition": "NodeConditionOfPermanentIssue",
"reason": "CamelCaseShortReason",
"message": "regexp matching the issue in the kernel log"
}
Configure path for the kernel log device
Check your kernel log path location in your operating system (OS) distribution.
The Linux kernel log device is usually presented as /dev/kmsg
. However, the log path location varies by OS distribution.
The log
field in config/kernel-monitor.json
represents the log path inside the container.
You can configure the log
field to match the device path as seen by the Node Problem Detector.
Add support for another log format
Kernel monitor uses the
Translator
plugin to translate the internal data structure of the kernel log.
You can implement a new translator for a new log format.
Recommendations and restrictions
It is recommended to run the Node Problem Detector in your cluster to monitor node health. When running the Node Problem Detector, you can expect extra resource overhead on each node. Usually this is fine, because:
- The kernel log grows relatively slowly.
- A resource limit is set for the Node Problem Detector.
- Even under high load, the resource usage is acceptable. For more information, see the Node Problem Detector benchmark result.
4.10.13 - Resource metrics pipeline
For Kubernetes, the Metrics API offers a basic set of metrics to support automatic scaling and similar use cases. This API makes information available about resource usage for node and pod, including metrics for CPU and memory. If you deploy the Metrics API into your cluster, clients of the Kubernetes API can then query for this information, and you can use Kubernetes' access control mechanisms to manage permissions to do so.
The HorizontalPodAutoscaler (HPA) and VerticalPodAutoscaler (VPA) use data from the metrics API to adjust workload replicas and resources to meet customer demand.
You can also view the resource metrics using the
kubectl top
command.
Figure 1 illustrates the architecture of the resource metrics pipeline.
Figure 1. Resource Metrics Pipeline
The architecture components, from right to left in the figure, consist of the following:
-
cAdvisor: Daemon for collecting, aggregating and exposing container metrics included in Kubelet.
-
kubelet: Node agent for managing container resources. Resource metrics are accessible using the
/metrics/resource
and/stats
kubelet API endpoints. -
Summary API: API provided by the kubelet for discovering and retrieving per-node summarized stats available through the
/stats
endpoint. -
metrics-server: Cluster addon component that collects and aggregates resource metrics pulled from each kubelet. The API server serves Metrics API for use by HPA, VPA, and by the
kubectl top
command. Metrics Server is a reference implementation of the Metrics API. -
Metrics API: Kubernetes API supporting access to CPU and memory used for workload autoscaling. To make this work in your cluster, you need an API extension server that provides the Metrics API.
Note: cAdvisor supports reading metrics from cgroups, which works with typical container runtimes on Linux. If you use a container runtime that uses another resource isolation mechanism, for example virtualization, then that container runtime must support CRI Container Metrics in order for metrics to be available to the kubelet.
Metrics API
The metrics-server implements the Metrics API. This API allows you to access CPU and memory usage for the nodes and pods in your cluster. Its primary role is to feed resource usage metrics to K8s autoscaler components.
Here is an example of the Metrics API request for a minikube
node piped through jq
for easier
reading:
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes/minikube" | jq '.'
Here is the same API call using curl
:
curl http://localhost:8080/apis/metrics.k8s.io/v1beta1/nodes/minikube
Sample response:
{
"kind": "NodeMetrics",
"apiVersion": "metrics.k8s.io/v1beta1",
"metadata": {
"name": "minikube",
"selfLink": "/apis/metrics.k8s.io/v1beta1/nodes/minikube",
"creationTimestamp": "2022-01-27T18:48:43Z"
},
"timestamp": "2022-01-27T18:48:33Z",
"window": "30s",
"usage": {
"cpu": "487558164n",
"memory": "732212Ki"
}
}
Here is an example of the Metrics API request for a kube-scheduler-minikube
pod contained in the
kube-system
namespace and piped through jq
for easier reading:
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/kube-system/pods/kube-scheduler-minikube" | jq '.'
Here is the same API call using curl
:
curl http://localhost:8080/apis/metrics.k8s.io/v1beta1/namespaces/kube-system/pods/kube-scheduler-minikube
Sample response:
{
"kind": "PodMetrics",
"apiVersion": "metrics.k8s.io/v1beta1",
"metadata": {
"name": "kube-scheduler-minikube",
"namespace": "kube-system",
"selfLink": "/apis/metrics.k8s.io/v1beta1/namespaces/kube-system/pods/kube-scheduler-minikube",
"creationTimestamp": "2022-01-27T19:25:00Z"
},
"timestamp": "2022-01-27T19:24:31Z",
"window": "30s",
"containers": [
{
"name": "kube-scheduler",
"usage": {
"cpu": "9559630n",
"memory": "22244Ki"
}
}
]
}
The Metrics API is defined in the k8s.io/metrics
repository. You must enable the API aggregation layer
and register an APIService
for the metrics.k8s.io
API.
To learn more about the Metrics API, see resource metrics API design, the metrics-server repository and the resource metrics API.
Measuring resource usage
CPU
CPU is reported as the average core usage measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers, and 1 hyper-thread on bare-metal Intel processors.
This value is derived by taking a rate over a cumulative CPU counter provided by the kernel (in both Linux and Windows kernels). The time window used to calculate CPU is shown under window field in Metrics API.
To learn more about how Kubernetes allocates and measures CPU resources, see meaning of CPU.
Memory
Memory is reported as the working set, measured in bytes, at the instant the metric was collected.
In an ideal world, the "working set" is the amount of memory in-use that cannot be freed under memory pressure. However, calculation of the working set varies by host OS, and generally makes heavy use of heuristics to produce an estimate.
The Kubernetes model for a container's working set expects that the container runtime counts anonymous memory associated with the container in question. The working set metric typically also includes some cached (file-backed) memory, because the host OS cannot always reclaim pages.
To learn more about how Kubernetes allocates and measures memory resources, see meaning of memory.
Metrics Server
The metrics-server fetches resource metrics from the kubelets and exposes them in the Kubernetes
API server through the Metrics API for use by the HPA and VPA. You can also view these metrics
using the kubectl top
command.
The metrics-server uses the Kubernetes API to track nodes and pods in your cluster. The metrics-server queries each node over HTTP to fetch metrics. The metrics-server also builds an internal view of pod metadata, and keeps a cache of pod health. That cached pod health information is available via the extension API that the metrics-server makes available.
For example with an HPA query, the metrics-server needs to identify which pods fulfill the label selectors in the deployment.
The metrics-server calls the kubelet API to collect metrics from each node. Depending on the metrics-server version it uses:
- Metrics resource endpoint
/metrics/resource
in version v0.6.0+ or - Summary API endpoint
/stats/summary
in older versions
To learn more about the metrics-server, see the metrics-server repository.
You can also check out the following:
- metrics-server design
- metrics-server FAQ
- metrics-server known issues
- metrics-server releases
- Horizontal Pod Autoscaling
Summary API source
The kubelet gathers stats at the node, volume, pod and container level, and emits this information in the Summary API for consumers to read.
Here is an example of a Summary API request for a minikube
node:
kubectl get --raw "/api/v1/nodes/minikube/proxy/stats/summary"
Here is the same API call using curl
:
curl http://localhost:8080/api/v1/nodes/minikube/proxy/stats/summary
/stats/summary
endpoint will be replaced by the /metrics/resource
endpoint
beginning with metrics-server 0.6.x.
4.10.14 - Tools for Monitoring Resources
To scale an application and provide a reliable service, you need to understand how the application behaves when it is deployed. You can examine application performance in a Kubernetes cluster by examining the containers, pods, services, and the characteristics of the overall cluster. Kubernetes provides detailed information about an application's resource usage at each of these levels. This information allows you to evaluate your application's performance and where bottlenecks can be removed to improve overall performance.
In Kubernetes, application monitoring does not depend on a single monitoring solution. On new clusters, you can use resource metrics or full metrics pipelines to collect monitoring statistics.
Resource metrics pipeline
The resource metrics pipeline provides a limited set of metrics related to
cluster components such as the
Horizontal Pod Autoscaler
controller, as well as the kubectl top
utility.
These metrics are collected by the lightweight, short-term, in-memory
metrics-server and
are exposed via the metrics.k8s.io
API.
metrics-server discovers all nodes on the cluster and
queries each node's
kubelet for CPU and
memory usage. The kubelet acts as a bridge between the Kubernetes master and
the nodes, managing the pods and containers running on a machine. The kubelet
translates each pod into its constituent containers and fetches individual
container usage statistics from the container runtime through the container
runtime interface. The kubelet fetches this information from the integrated
cAdvisor for the legacy Docker integration. It then exposes the aggregated pod
resource usage statistics through the metrics-server Resource Metrics API.
This API is served at /metrics/resource/v1beta1
on the kubelet's authenticated and
read-only ports.
Full metrics pipeline
A full metrics pipeline gives you access to richer metrics. Kubernetes can
respond to these metrics by automatically scaling or adapting the cluster
based on its current state, using mechanisms such as the Horizontal Pod
Autoscaler. The monitoring pipeline fetches metrics from the kubelet and
then exposes them to Kubernetes via an adapter by implementing either the
custom.metrics.k8s.io
or external.metrics.k8s.io
API.
Prometheus, a CNCF project, can natively monitor Kubernetes, nodes, and Prometheus itself. Full metrics pipeline projects that are not part of the CNCF are outside the scope of Kubernetes documentation.
4.10.15 - Troubleshoot Applications
This guide is to help users debug applications that are deployed into Kubernetes and not behaving correctly. This is not a guide for people who want to debug their cluster. For that you should check out this guide.
Diagnosing the problem
The first step in troubleshooting is triage. What is the problem? Is it your Pods, your Replication Controller or your Service?
Debugging Pods
The first step in debugging a Pod is taking a look at it. Check the current state of the Pod and recent events with the following command:
kubectl describe pods ${POD_NAME}
Look at the state of the containers in the pod. Are they all Running
? Have there been recent restarts?
Continue debugging depending on the state of the pods.
My pod stays pending
If a Pod is stuck in Pending
it means that it can not be scheduled onto a node. Generally this is because
there are insufficient resources of one type or another that prevent scheduling. Look at the output of the
kubectl describe ...
command above. There should be messages from the scheduler about why it can not schedule
your pod. Reasons include:
-
You don't have enough resources: You may have exhausted the supply of CPU or Memory in your cluster, in this case you need to delete Pods, adjust resource requests, or add new nodes to your cluster. See Compute Resources document for more information.
-
You are using
hostPort
: When you bind a Pod to ahostPort
there are a limited number of places that pod can be scheduled. In most cases,hostPort
is unnecessary, try using a Service object to expose your Pod. If you do requirehostPort
then you can only schedule as many Pods as there are nodes in your Kubernetes cluster.
My pod stays waiting
If a Pod is stuck in the Waiting
state, then it has been scheduled to a worker node, but it can't run on that machine.
Again, the information from kubectl describe ...
should be informative. The most common cause of Waiting
pods is a failure to pull the image. There are three things to check:
- Make sure that you have the name of the image correct.
- Have you pushed the image to the registry?
- Try to manually pull the image to see if the image can be pulled. For example,
if you use Docker on your PC, run
docker pull <image>
.
My pod is crashing or otherwise unhealthy
Once your pod has been scheduled, the methods described in Debug Running Pods are available for debugging.
My pod is running but not doing what I told it to do
If your pod is not behaving as you expected, it may be that there was an error in your
pod description (e.g. mypod.yaml
file on your local machine), and that the error
was silently ignored when you created the pod. Often a section of the pod description
is nested incorrectly, or a key name is typed incorrectly, and so the key is ignored.
For example, if you misspelled command
as commnd
then the pod will be created but
will not use the command line you intended it to use.
The first thing to do is to delete your pod and try creating it again with the --validate
option.
For example, run kubectl apply --validate -f mypod.yaml
.
If you misspelled command
as commnd
then will give an error like this:
I0805 10:43:25.129850 46757 schema.go:126] unknown field: commnd
I0805 10:43:25.129973 46757 schema.go:129] this may be a false alarm, see https://github.com/kubernetes/kubernetes/issues/6842
pods/mypod
The next thing to check is whether the pod on the apiserver
matches the pod you meant to create (e.g. in a yaml file on your local machine).
For example, run kubectl get pods/mypod -o yaml > mypod-on-apiserver.yaml
and then
manually compare the original pod description, mypod.yaml
with the one you got
back from apiserver, mypod-on-apiserver.yaml
. There will typically be some
lines on the "apiserver" version that are not on the original version. This is
expected. However, if there are lines on the original that are not on the apiserver
version, then this may indicate a problem with your pod spec.
Debugging Replication Controllers
Replication controllers are fairly straightforward. They can either create Pods or they can't. If they can't create pods, then please refer to the instructions above to debug your pods.
You can also use kubectl describe rc ${CONTROLLER_NAME}
to introspect events related to the replication
controller.
Debugging Services
Services provide load balancing across a set of pods. There are several common problems that can make Services not work properly. The following instructions should help debug Service problems.
First, verify that there are endpoints for the service. For every Service object, the apiserver makes an endpoints
resource available.
You can view this resource with:
kubectl get endpoints ${SERVICE_NAME}
Make sure that the endpoints match up with the number of pods that you expect to be members of your service. For example, if your Service is for an nginx container with 3 replicas, you would expect to see three different IP addresses in the Service's endpoints.
My service is missing endpoints
If you are missing endpoints, try listing pods using the labels that Service uses. Imagine that you have a Service where the labels are:
...
spec:
- selector:
name: nginx
type: frontend
You can use:
kubectl get pods --selector=name=nginx,type=frontend
to list pods that match this selector. Verify that the list matches the Pods that you expect to provide your Service.
Verify that the pod's containerPort
matches up with the Service's targetPort
Network traffic is not forwarded
Please see debugging service for more information.
What's next
If none of the above solves your problem, follow the instructions in
Debugging Service document
to make sure that your Service
is running, has Endpoints
, and your Pods
are
actually serving; you have DNS working, iptables rules installed, and kube-proxy
does not seem to be misbehaving.
You may also visit troubleshooting document for more information.
4.10.16 - Troubleshoot Clusters
This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the problem you are experiencing. See the application troubleshooting guide for tips on application debugging. You may also visit troubleshooting document for more information.
Listing your cluster
The first thing to debug in your cluster is if your nodes are all registered correctly.
Run
kubectl get nodes
And verify that all of the nodes you expect to see are present and that they are all in the Ready
state.
To get detailed information about the overall health of your cluster, you can run:
kubectl cluster-info dump
Looking at logs
For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locations
of the relevant log files. (note that on systemd-based systems, you may need to use journalctl
instead)
Master
/var/log/kube-apiserver.log
- API Server, responsible for serving the API/var/log/kube-scheduler.log
- Scheduler, responsible for making scheduling decisions/var/log/kube-controller-manager.log
- Controller that manages replication controllers
Worker Nodes
/var/log/kubelet.log
- Kubelet, responsible for running containers on the node/var/log/kube-proxy.log
- Kube Proxy, responsible for service load balancing
A general overview of cluster failure modes
This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.
Root causes:
- VM(s) shutdown
- Network partition within cluster, or between cluster and users
- Crashes in Kubernetes software
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
- Operator error, for example misconfigured Kubernetes software or application software
Specific scenarios:
- Apiserver VM shutdown or apiserver crashing
- Results
- unable to stop, update, or start new pods, services, replication controller
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
- Results
- Apiserver backing storage lost
- Results
- apiserver should fail to come up
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
- Results
- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
- in future, these will be replicated as well and may not be co-located
- they do not have their own persistent state
- Individual node (VM or physical machine) shuts down
- Results
- pods on that Node stop running
- Results
- Network partition
- Results
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
- Results
- Kubelet software fault
- Results
- crashing kubelet cannot start new pods on the node
- kubelet might delete the pods or not
- node marked unhealthy
- replication controllers start new pods elsewhere
- Results
- Cluster operator error
- Results
- loss of pods, services, etc
- lost of apiserver backing store
- users unable to read API
- etc.
- Results
Mitigations:
-
Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs
- Mitigates: Apiserver VM shutdown or apiserver crashing
- Mitigates: Supporting services VM shutdown or crashes
-
Action: Use IaaS providers reliable storage (e.g. GCE PD or AWS EBS volume) for VMs with apiserver+etcd
- Mitigates: Apiserver backing storage lost
-
Action: Use high-availability configuration
- Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
- Will tolerate one or more simultaneous node or component failures
- Mitigates: API server backing storage (i.e., etcd's data directory) lost
- Assumes HA (highly-available) etcd configuration
- Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
-
Action: Snapshot apiserver PDs/EBS-volumes periodically
- Mitigates: Apiserver backing storage lost
- Mitigates: Some cases of operator error
- Mitigates: Some cases of Kubernetes software fault
-
Action: use replication controller and services in front of pods
- Mitigates: Node shutdown
- Mitigates: Kubelet software fault
-
Action: applications (containers) designed to tolerate unexpected restarts
- Mitigates: Node shutdown
- Mitigates: Kubelet software fault
4.10.17 - Troubleshooting
Sometimes things go wrong. This guide is aimed at making them right. It has two sections:
- Troubleshooting your application - Useful for users who are deploying code into Kubernetes and wondering why it is not working.
- Troubleshooting your cluster - Useful for cluster administrators and people whose Kubernetes cluster is unhappy.
You should also check the known issues for the release you're using.
Getting help
If your problem isn't answered by any of the guides above, there are variety of ways for you to get help from the Kubernetes community.
Questions
The documentation on this site has been structured to provide answers to a wide
range of questions. Concepts explain the Kubernetes
architecture and how each component works, while Setup provides
practical instructions for getting started. Tasks show how to
accomplish commonly used tasks, and Tutorials are more
comprehensive walkthroughs of real-world, industry-specific, or end-to-end
development scenarios. The Reference section provides
detailed documentation on the Kubernetes API
and command-line interfaces (CLIs), such as kubectl
.
Help! My question isn't covered! I need help now!
Stack Overflow
Someone else from the community may have already asked a similar question or may be able to help with your problem. The Kubernetes team will also monitor posts tagged Kubernetes. If there aren't any existing questions that help, please ensure that your question is on-topic on Stack Overflow and that you read through the guidance on how to ask a new question, before asking a new one!
Slack
Many people from the Kubernetes community hang out on Kubernetes Slack in the #kubernetes-users
channel.
Slack requires registration; you can request an invitation,
and registration is open to everyone). Feel free to come and ask any and all questions.
Once registered, access the Kubernetes organisation in Slack
via your web browser or via Slack's own dedicated app.
Once you are registered, browse the growing list of channels for various subjects of
interest. For example, people new to Kubernetes may also want to join the
#kubernetes-novice
channel. As another example, developers should join the
#kubernetes-dev
channel.
There are also many country specific / local language channels. Feel free to join these channels for localized support and info:
Country | Channels |
---|---|
China | #cn-users , #cn-events |
Finland | #fi-users |
France | #fr-users , #fr-events |
Germany | #de-users , #de-events |
India | #in-users , #in-events |
Italy | #it-users , #it-events |
Japan | #jp-users , #jp-events |
Korea | #kr-users |
Netherlands | #nl-users |
Norway | #norw-users |
Poland | #pl-users |
Russia | #ru-users |
Spain | #es-users |
Sweden | #se-users |
Turkey | #tr-users , #tr-events |
Forum
You're welcome to join the official Kubernetes Forum: discuss.kubernetes.io.
Bugs and feature requests
If you have what looks like a bug, or you would like to make a feature request, please use the GitHub issue tracking system.
Before you file an issue, please search existing issues to see if your issue is already covered.
If filing a bug, please include detailed information about how to reproduce the problem, such as:
- Kubernetes version:
kubectl version
- Cloud provider, OS distro, network configuration, and container runtime version
- Steps to reproduce the problem
4.11 - Extend Kubernetes
4.11.1 - Configure the Aggregation Layer
Configuring the aggregation layer allows the Kubernetes apiserver to be extended with additional APIs, which are not part of the core Kubernetes APIs.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Authentication Flow
Unlike Custom Resource Definitions (CRDs), the Aggregation API involves another server - your Extension apiserver - in addition to the standard Kubernetes apiserver. The Kubernetes apiserver will need to communicate with your extension apiserver, and your extension apiserver will need to communicate with the Kubernetes apiserver. In order for this communication to be secured, the Kubernetes apiserver uses x509 certificates to authenticate itself to the extension apiserver.
This section describes how the authentication and authorization flows work, and how to configure them.
The high-level flow is as follows:
- Kubernetes apiserver: authenticate the requesting user and authorize their rights to the requested API path.
- Kubernetes apiserver: proxy the request to the extension apiserver
- Extension apiserver: authenticate the request from the Kubernetes apiserver
- Extension apiserver: authorize the request from the original user
- Extension apiserver: execute
The rest of this section describes these steps in detail.
The flow can be seen in the following diagram.
.
The source for the above swimlanes can be found in the source of this document.
Kubernetes Apiserver Authentication and Authorization
A request to an API path that is served by an extension apiserver begins the same way as all API requests: communication to the Kubernetes apiserver. This path already has been registered with the Kubernetes apiserver by the extension apiserver.
The user communicates with the Kubernetes apiserver, requesting access to the path. The Kubernetes apiserver uses standard authentication and authorization configured with the Kubernetes apiserver to authenticate the user and authorize access to the specific path.
For an overview of authenticating to a Kubernetes cluster, see "Authenticating to a Cluster". For an overview of authorization of access to Kubernetes cluster resources, see "Authorization Overview".
Everything to this point has been standard Kubernetes API requests, authentication and authorization.
The Kubernetes apiserver now is prepared to send the request to the extension apiserver.
Kubernetes Apiserver Proxies the Request
The Kubernetes apiserver now will send, or proxy, the request to the extension apiserver that registered to handle the request. In order to do so, it needs to know several things:
- How should the Kubernetes apiserver authenticate to the extension apiserver, informing the extension apiserver that the request, which comes over the network, is coming from a valid Kubernetes apiserver?
- How should the Kubernetes apiserver inform the extension apiserver of the username and group for which the original request was authenticated?
In order to provide for these two, you must configure the Kubernetes apiserver using several flags.
Kubernetes Apiserver Client Authentication
The Kubernetes apiserver connects to the extension apiserver over TLS, authenticating itself using a client certificate. You must provide the following to the Kubernetes apiserver upon startup, using the provided flags:
- private key file via
--proxy-client-key-file
- signed client certificate file via
--proxy-client-cert-file
- certificate of the CA that signed the client certificate file via
--requestheader-client-ca-file
- valid Common Name values (CNs) in the signed client certificate via
--requestheader-allowed-names
The Kubernetes apiserver will use the files indicated by --proxy-client-*-file
to authenticate to the extension apiserver. In order for the request to be considered valid by a compliant extension apiserver, the following conditions must be met:
- The connection must be made using a client certificate that is signed by the CA whose certificate is in
--requestheader-client-ca-file
. - The connection must be made using a client certificate whose CN is one of those listed in
--requestheader-allowed-names
.
--requestheader-allowed-names=""
. This will indicate to an extension apiserver that any CN is acceptable.
When started with these options, the Kubernetes apiserver will:
- Use them to authenticate to the extension apiserver.
- Create a configmap in the
kube-system
namespace calledextension-apiserver-authentication
, in which it will place the CA certificate and the allowed CNs. These in turn can be retrieved by extension apiservers to validate requests.
Note that the same client certificate is used by the Kubernetes apiserver to authenticate against all extension apiservers. It does not create a client certificate per extension apiserver, but rather a single one to authenticate as the Kubernetes apiserver. This same one is reused for all extension apiserver requests.
Original Request Username and Group
When the Kubernetes apiserver proxies the request to the extension apiserver, it informs the extension apiserver of the username and group with which the original request successfully authenticated. It provides these in http headers of its proxied request. You must inform the Kubernetes apiserver of the names of the headers to be used.
- the header in which to store the username via
--requestheader-username-headers
- the header in which to store the group via
--requestheader-group-headers
- the prefix to append to all extra headers via
--requestheader-extra-headers-prefix
These header names are also placed in the extension-apiserver-authentication
configmap, so they can be retrieved and used by extension apiservers.
Extension Apiserver Authenticates the Request
The extension apiserver, upon receiving a proxied request from the Kubernetes apiserver, must validate that the request actually did come from a valid authenticating proxy, which role the Kubernetes apiserver is fulfilling. The extension apiserver validates it via:
- Retrieve the following from the configmap in
kube-system
, as described above:- Client CA certificate
- List of allowed names (CNs)
- Header names for username, group and extra info
- Check that the TLS connection was authenticated using a client certificate which:
- Was signed by the CA whose certificate matches the retrieved CA certificate.
- Has a CN in the list of allowed CNs, unless the list is blank, in which case all CNs are allowed.
- Extract the username and group from the appropriate headers
If the above passes, then the request is a valid proxied request from a legitimate authenticating proxy, in this case the Kubernetes apiserver.
Note that it is the responsibility of the extension apiserver implementation to provide the above. Many do it by default, leveraging the k8s.io/apiserver/
package. Others may provide options to override it using command-line options.
In order to have permission to retrieve the configmap, an extension apiserver requires the appropriate role. There is a default role named extension-apiserver-authentication-reader
in the kube-system
namespace which can be assigned.
Extension Apiserver Authorizes the Request
The extension apiserver now can validate that the user/group retrieved from the headers are authorized to execute the given request. It does so by sending a standard SubjectAccessReview request to the Kubernetes apiserver.
In order for the extension apiserver to be authorized itself to submit the SubjectAccessReview
request to the Kubernetes apiserver, it needs the correct permissions. Kubernetes includes a default ClusterRole
named system:auth-delegator
that has the appropriate permissions. It can be granted to the extension apiserver's service account.
Extension Apiserver Executes
If the SubjectAccessReview
passes, the extension apiserver executes the request.
Enable Kubernetes Apiserver flags
Enable the aggregation layer via the following kube-apiserver
flags. They may have already been taken care of by your provider.
--requestheader-client-ca-file=<path to aggregator CA cert>
--requestheader-allowed-names=front-proxy-client
--requestheader-extra-headers-prefix=X-Remote-Extra-
--requestheader-group-headers=X-Remote-Group
--requestheader-username-headers=X-Remote-User
--proxy-client-cert-file=<path to aggregator proxy cert>
--proxy-client-key-file=<path to aggregator proxy key>
CA Reusage and Conflicts
The Kubernetes apiserver has two client CA options:
--client-ca-file
--requestheader-client-ca-file
Each of these functions independently and can conflict with each other, if not used correctly.
--client-ca-file
: When a request arrives to the Kubernetes apiserver, if this option is enabled, the Kubernetes apiserver checks the certificate of the request. If it is signed by one of the CA certificates in the file referenced by--client-ca-file
, then the request is treated as a legitimate request, and the user is the value of the common nameCN=
, while the group is the organizationO=
. See the documentation on TLS authentication.--requestheader-client-ca-file
: When a request arrives to the Kubernetes apiserver, if this option is enabled, the Kubernetes apiserver checks the certificate of the request. If it is signed by one of the CA certificates in the file reference by--requestheader-client-ca-file
, then the request is treated as a potentially legitimate request. The Kubernetes apiserver then checks if the common nameCN=
is one of the names in the list provided by--requestheader-allowed-names
. If the name is allowed, the request is approved; if it is not, the request is not.
If both --client-ca-file
and --requestheader-client-ca-file
are provided, then the request first checks the --requestheader-client-ca-file
CA and then the --client-ca-file
. Normally, different CAs, either root CAs or intermediate CAs, are used for each of these options; regular client requests match against --client-ca-file
, while aggregation requests match against --requestheader-client-ca-file
. However, if both use the same CA, then client requests that normally would pass via --client-ca-file
will fail, because the CA will match the CA in --requestheader-client-ca-file
, but the common name CN=
will not match one of the acceptable common names in --requestheader-allowed-names
. This can cause your kubelets and other control plane components, as well as end-users, to be unable to authenticate to the Kubernetes apiserver.
For this reason, use different CA certs for the --client-ca-file
option - to authorize control plane components and end-users - and the --requestheader-client-ca-file
option - to authorize aggregation apiserver requests.
If you are not running kube-proxy on a host running the API server, then you must make sure that the system is enabled with the following kube-apiserver
flag:
--enable-aggregator-routing=true
Register APIService objects
You can dynamically configure what client requests are proxied to extension apiserver. The following is an example registration:
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: <name of the registration object>
spec:
group: <API group name this extension apiserver hosts>
version: <API version this extension apiserver hosts>
groupPriorityMinimum: <priority this APIService for this group, see API documentation>
versionPriority: <prioritizes ordering of this version within a group, see API documentation>
service:
namespace: <namespace of the extension apiserver service>
name: <name of the extension apiserver service>
caBundle: <pem encoded ca cert that signs the server cert used by the webhook>
The name of an APIService object must be a valid path segment name.
Contacting the extension apiserver
Once the Kubernetes apiserver has determined a request should be sent to an extension apiserver, it needs to know how to contact it.
The service
stanza is a reference to the service for an extension apiserver.
The service namespace and name are required. The port is optional and defaults to 443.
Here is an example of an extension apiserver that is configured to be called on port "1234",
and to verify the TLS connection against the ServerName
my-service-name.my-service-namespace.svc
using a custom CA bundle.
apiVersion: apiregistration.k8s.io/v1
kind: APIService
...
spec:
...
service:
namespace: my-service-namespace
name: my-service-name
port: 1234
caBundle: "Ci0tLS0tQk...<base64-encoded PEM bundle>...tLS0K"
...
What's next
- Setup an extension api-server to work with the aggregation layer.
- For a high level overview, see Extending the Kubernetes API with the aggregation layer.
- Learn how to Extend the Kubernetes API Using Custom Resource Definitions.
4.11.2 - Use Custom Resources
4.11.2.1 - Extend the Kubernetes API with CustomResourceDefinitions
This page shows how to install a custom resource into the Kubernetes API by creating a CustomResourceDefinition.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.16. To check the version, enterkubectl version
.
If you are using an older version of Kubernetes that is still supported, switch to
the documentation for that version to see advice that is relevant for your cluster.
Create a CustomResourceDefinition
When you create a new CustomResourceDefinition (CRD), the Kubernetes API Server
creates a new RESTful resource path for each version you specify. The CRD can be
either namespaced or cluster-scoped, as specified in the CRD's scope
field. As
with existing built-in objects, deleting a namespace deletes all custom objects
in that namespace. CustomResourceDefinitions themselves are non-namespaced and
are available to all namespaces.
For example, if you save the following CustomResourceDefinition to resourcedefinition.yaml
:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
# name must match the spec fields below, and be in the form: <plural>.<group>
name: crontabs.stable.example.com
spec:
# group name to use for REST API: /apis/<group>/<version>
group: stable.example.com
# list of versions supported by this CustomResourceDefinition
versions:
- name: v1
# Each version can be enabled/disabled by Served flag.
served: true
# One and only one version must be marked as the storage version.
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
cronSpec:
type: string
image:
type: string
replicas:
type: integer
# either Namespaced or Cluster
scope: Namespaced
names:
# plural name to be used in the URL: /apis/<group>/<version>/<plural>
plural: crontabs
# singular name to be used as an alias on the CLI and for display
singular: crontab
# kind is normally the CamelCased singular type. Your resource manifests use this.
kind: CronTab
# shortNames allow shorter string to match your resource on the CLI
shortNames:
- ct
and create it:
kubectl apply -f resourcedefinition.yaml
Then a new namespaced RESTful API endpoint is created at:
/apis/stable.example.com/v1/namespaces/*/crontabs/...
This endpoint URL can then be used to create and manage custom objects.
The kind
of these objects will be CronTab
from the spec of the
CustomResourceDefinition object you created above.
It might take a few seconds for the endpoint to be created.
You can watch the Established
condition of your CustomResourceDefinition
to be true or watch the discovery information of the API server for your
resource to show up.
Create custom objects
After the CustomResourceDefinition object has been created, you can create
custom objects. Custom objects can contain custom fields. These fields can
contain arbitrary JSON.
In the following example, the cronSpec
and image
custom fields are set in a
custom object of kind CronTab
. The kind CronTab
comes from the spec of the
CustomResourceDefinition object you created above.
If you save the following YAML to my-crontab.yaml
:
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
cronSpec: "* * * * */5"
image: my-awesome-cron-image
and create it:
kubectl apply -f my-crontab.yaml
You can then manage your CronTab objects using kubectl. For example:
kubectl get crontab
Should print a list like this:
NAME AGE
my-new-cron-object 6s
Resource names are not case-sensitive when using kubectl, and you can use either the singular or plural forms defined in the CRD, as well as any short names.
You can also view the raw YAML data:
kubectl get ct -o yaml
You should see that it contains the custom cronSpec
and image
fields
from the YAML you used to create it:
apiVersion: v1
items:
- apiVersion: stable.example.com/v1
kind: CronTab
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"stable.example.com/v1","kind":"CronTab","metadata":{"annotations":{},"name":"my-new-cron-object","namespace":"default"},"spec":{"cronSpec":"* * * * */5","image":"my-awesome-cron-image"}}
creationTimestamp: "2021-06-20T07:35:27Z"
generation: 1
name: my-new-cron-object
namespace: default
resourceVersion: "1326"
uid: 9aab1d66-628e-41bb-a422-57b8b3b1f5a9
spec:
cronSpec: '* * * * */5'
image: my-awesome-cron-image
kind: List
metadata:
resourceVersion: ""
selfLink: ""
Delete a CustomResourceDefinition
When you delete a CustomResourceDefinition, the server will uninstall the RESTful API endpoint and delete all custom objects stored in it.
kubectl delete -f resourcedefinition.yaml
kubectl get crontabs
Error from server (NotFound): Unable to list {"stable.example.com" "v1" "crontabs"}: the server could not find the requested resource (get crontabs.stable.example.com)
If you later recreate the same CustomResourceDefinition, it will start out empty.
Specifying a structural schema
CustomResources store structured data in custom fields (alongside the built-in
fields apiVersion
, kind
and metadata
, which the API server validates
implicitly). With OpenAPI v3.0 validation a schema can be
specified, which is validated during creation and updates, compare below for
details and limits of such a schema.
With apiextensions.k8s.io/v1
the definition of a structural schema is
mandatory for CustomResourceDefinitions. In the beta version of
CustomResourceDefinition, the structural schema was optional.
A structural schema is an OpenAPI v3.0 validation schema which:
- specifies a non-empty type (via
type
in OpenAPI) for the root, for each specified field of an object node (viaproperties
oradditionalProperties
in OpenAPI) and for each item in an array node (viaitems
in OpenAPI), with the exception of:- a node with
x-kubernetes-int-or-string: true
- a node with
x-kubernetes-preserve-unknown-fields: true
- a node with
- for each field in an object and each item in an array which is specified within any of
allOf
,anyOf
,oneOf
ornot
, the schema also specifies the field/item outside of those logical junctors (compare example 1 and 2). - does not set
description
,type
,default
,additionalProperties
,nullable
within anallOf
,anyOf
,oneOf
ornot
, with the exception of the two pattern forx-kubernetes-int-or-string: true
(see below). - if
metadata
is specified, then only restrictions onmetadata.name
andmetadata.generateName
are allowed.
Non-structural example 1:
allOf:
- properties:
foo:
...
conflicts with rule 2. The following would be correct:
properties:
foo:
...
allOf:
- properties:
foo:
...
Non-structural example 2:
allOf:
- items:
properties:
foo:
...
conflicts with rule 2. The following would be correct:
items:
properties:
foo:
...
allOf:
- items:
properties:
foo:
...
Non-structural example 3:
properties:
foo:
pattern: "abc"
metadata:
type: object
properties:
name:
type: string
pattern: "^a"
finalizers:
type: array
items:
type: string
pattern: "my-finalizer"
anyOf:
- properties:
bar:
type: integer
minimum: 42
required: ["bar"]
description: "foo bar object"
is not a structural schema because of the following violations:
- the type at the root is missing (rule 1).
- the type of
foo
is missing (rule 1). bar
inside ofanyOf
is not specified outside (rule 2).bar
'stype
is withinanyOf
(rule 3).- the description is set within
anyOf
(rule 3). metadata.finalizers
might not be restricted (rule 4).
In contrast, the following, corresponding schema is structural:
type: object
description: "foo bar object"
properties:
foo:
type: string
pattern: "abc"
bar:
type: integer
metadata:
type: object
properties:
name:
type: string
pattern: "^a"
anyOf:
- properties:
bar:
minimum: 42
required: ["bar"]
Violations of the structural schema rules are reported in the NonStructural
condition in the CustomResourceDefinition.
Field pruning
CustomResourceDefinitions store validated resource data in the cluster's persistence store, etcd. As with native Kubernetes resources such as ConfigMap, if you specify a field that the API server does not recognize, the unknown field is pruned (removed) before being persisted.
CRDs converted from apiextensions.k8s.io/v1beta1
to apiextensions.k8s.io/v1
might lack structural schemas, and spec.preserveUnknownFields
might be true
.
For legacy CustomResourceDefinition objects created as
apiextensions.k8s.io/v1beta1
with spec.preserveUnknownFields
set to
true
, the following is also true:
- Pruning is not enabled.
- You can store arbitrary data.
For compatibility with apiextensions.k8s.io/v1
, update your custom
resource definitions to:
- Use a structural OpenAPI schema.
- Set
spec.preserveUnknownFields
tofalse
.
If you save the following YAML to my-crontab.yaml
:
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
cronSpec: "* * * * */5"
image: my-awesome-cron-image
someRandomField: 42
and create it:
kubectl create --validate=false -f my-crontab.yaml -o yaml
your output is similar to:
apiVersion: stable.example.com/v1
kind: CronTab
metadata:
creationTimestamp: 2017-05-31T12:56:35Z
generation: 1
name: my-new-cron-object
namespace: default
resourceVersion: "285"
uid: 9423255b-4600-11e7-af6a-28d2447dc82b
spec:
cronSpec: '* * * * */5'
image: my-awesome-cron-image
Notice that the field someRandomField
was pruned.
This example turned off client-side validation to demonstrate the API server's behavior, by adding the --validate=false
command line option.
Because the OpenAPI validation schemas are also published
to clients, kubectl
also checks for unknown fields and rejects those objects well before they would be sent to the API server.
Controlling pruning
By default, all unspecified fields for a custom resource, across all versions, are pruned. It is possible though to opt-out of that for specifc sub-trees of fields by adding x-kubernetes-preserve-unknown-fields: true
in the structural OpenAPI v3 validation schema.
For example:
type: object
properties:
json:
x-kubernetes-preserve-unknown-fields: true
The field json
can store any JSON value, without anything being pruned.
You can also partially specify the permitted JSON; for example:
type: object
properties:
json:
x-kubernetes-preserve-unknown-fields: true
type: object
description: this is arbitrary JSON
With this, only object
type values are allowed.
Pruning is enabled again for each specified property (or additionalProperties
):
type: object
properties:
json:
x-kubernetes-preserve-unknown-fields: true
type: object
properties:
spec:
type: object
properties:
foo:
type: string
bar:
type: string
With this, the value:
json:
spec:
foo: abc
bar: def
something: x
status:
something: x
is pruned to:
json:
spec:
foo: abc
bar: def
status:
something: x
This means that the something
field in the specified spec
object is pruned, but everything outside is not.
IntOrString
Nodes in a schema with x-kubernetes-int-or-string: true
are excluded from rule 1, such that the following is structural:
type: object
properties:
foo:
x-kubernetes-int-or-string: true
Also those nodes are partially excluded from rule 3 in the sense that the following two patterns are allowed (exactly those, without variations in order to additional fields):
x-kubernetes-int-or-string: true
anyOf:
- type: integer
- type: string
...
and
x-kubernetes-int-or-string: true
allOf:
- anyOf:
- type: integer
- type: string
- ... # zero or more
...
With one of those specification, both an integer and a string validate.
In Validation Schema Publishing,
x-kubernetes-int-or-string: true
is unfolded to one of the two patterns shown above.
RawExtension
RawExtensions (as in runtime.RawExtension
defined in
k8s.io/apimachinery)
holds complete Kubernetes objects, i.e. with apiVersion
and kind
fields.
It is possible to specify those embedded objects (both completely without constraints or partially specified) by setting x-kubernetes-embedded-resource: true
. For example:
type: object
properties:
foo:
x-kubernetes-embedded-resource: true
x-kubernetes-preserve-unknown-fields: true
Here, the field foo
holds a complete object, e.g.:
foo:
apiVersion: v1
kind: Pod
spec:
...
Because x-kubernetes-preserve-unknown-fields: true
is specified alongside, nothing is pruned. The use of x-kubernetes-preserve-unknown-fields: true
is optional though.
With x-kubernetes-embedded-resource: true
, the apiVersion
, kind
and metadata
are implicitly specified and validated.
Serving multiple versions of a CRD
See Custom resource definition versioning for more information about serving multiple versions of your CustomResourceDefinition and migrating your objects from one version to another.
Advanced topics
Finalizers
Finalizers allow controllers to implement asynchronous pre-delete hooks. Custom objects support finalizers similar to built-in objects.
You can add a finalizer to a custom object like this:
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
finalizers:
- stable.example.com/finalizer
Identifiers of custom finalizers consist of a domain name, a forward slash and the name of the finalizer. Any controller can add a finalizer to any object's list of finalizers.
The first delete request on an object with finalizers sets a value for the
metadata.deletionTimestamp
field but does not delete it. Once this value is set,
entries in the finalizers
list can only be removed. While any finalizers remain it is also
impossible to force the deletion of an object.
When the metadata.deletionTimestamp
field is set, controllers watching the object execute any
finalizers they handle and remove the finalizer from the list after they are done. It is the
responsibility of each controller to remove its finalizer from the list.
The value of metadata.deletionGracePeriodSeconds
controls the interval between polling updates.
Once the list of finalizers is empty, meaning all finalizers have been executed, the resource is deleted by Kubernetes.
Validation
Custom resources are validated via OpenAPI v3 schemas, by x-kubernetes-validations when the Validation Rules feature is enabled, and you can add additional validation using admission webhooks.
Additionally, the following restrictions are applied to the schema:
- These fields cannot be set:
definitions
,dependencies
,deprecated
,discriminator
,id
,patternProperties
,readOnly
,writeOnly
,xml
,$ref
.
- The field
uniqueItems
cannot be set totrue
. - The field
additionalProperties
cannot be set tofalse
. - The field
additionalProperties
is mutually exclusive withproperties
.
The x-kubernetes-validations
extension can be used to validate custom resources using Common
Expression Language (CEL) expressions when the Validation
rules feature is enabled and the CustomResourceDefinition schema is a
structural schema.
The default
field can be set when the Defaulting feature is enabled,
which is the case with apiextensions.k8s.io/v1
CustomResourceDefinitions.
Defaulting is in GA since 1.17 (beta since 1.16 with the CustomResourceDefaulting
feature gate
enabled, which is the case automatically for many clusters for beta features).
Refer to the structural schemas section for other restrictions and CustomResourceDefinition features.
The schema is defined in the CustomResourceDefinition. In the following example, the CustomResourceDefinition applies the following validations on the custom object:
spec.cronSpec
must be a string and must be of the form described by the regular expression.spec.replicas
must be an integer and must have a minimum value of 1 and a maximum value of 10.
Save the CustomResourceDefinition to resourcedefinition.yaml
:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: crontabs.stable.example.com
spec:
group: stable.example.com
versions:
- name: v1
served: true
storage: true
schema:
# openAPIV3Schema is the schema for validating custom objects.
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
cronSpec:
type: string
pattern: '^(\d+|\*)(/\d+)?(\s+(\d+|\*)(/\d+)?){4}$'
image:
type: string
replicas:
type: integer
minimum: 1
maximum: 10
scope: Namespaced
names:
plural: crontabs
singular: crontab
kind: CronTab
shortNames:
- ct
and create it:
kubectl apply -f resourcedefinition.yaml
A request to create a custom object of kind CronTab is rejected if there are invalid values in its fields. In the following example, the custom object contains fields with invalid values:
spec.cronSpec
does not match the regular expression.spec.replicas
is greater than 10.
If you save the following YAML to my-crontab.yaml
:
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
cronSpec: "* * * *"
image: my-awesome-cron-image
replicas: 15
and attempt to create it:
kubectl apply -f my-crontab.yaml
then you get an error:
The CronTab "my-new-cron-object" is invalid: []: Invalid value: map[string]interface {}{"apiVersion":"stable.example.com/v1", "kind":"CronTab", "metadata":map[string]interface {}{"name":"my-new-cron-object", "namespace":"default", "deletionTimestamp":interface {}(nil), "deletionGracePeriodSeconds":(*int64)(nil), "creationTimestamp":"2017-09-05T05:20:07Z", "uid":"e14d79e7-91f9-11e7-a598-f0761cb232d1", "clusterName":""}, "spec":map[string]interface {}{"cronSpec":"* * * *", "image":"my-awesome-cron-image", "replicas":15}}:
validation failure list:
spec.cronSpec in body should match '^(\d+|\*)(/\d+)?(\s+(\d+|\*)(/\d+)?){4}$'
spec.replicas in body should be less than or equal to 10
If the fields contain valid values, the object creation request is accepted.
Save the following YAML to my-crontab.yaml
:
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
cronSpec: "* * * * */5"
image: my-awesome-cron-image
replicas: 5
And create it:
kubectl apply -f my-crontab.yaml
crontab "my-new-cron-object" created
Validation rules
Kubernetes v1.23 [alpha]
Validation rules are in alpha since 1.23 and validate custom resources when the
CustomResourceValidationExpressions
feature
gate is enabled.
This feature is only available if the schema is a
structural schema.
Validation rules use the Common Expression Language (CEL)
to validate custom resource values. Validation rules are included in
CustomResourceDefinition schemas using the x-kubernetes-validations
extension.
The Rule is scoped to the location of the x-kubernetes-validations
extension in the schema.
And self
variable in the CEL expression is bound to the scoped value.
For example:
...
openAPIV3Schema:
type: object
properties:
spec:
type: object
x-kubernetes-validations:
- rule: "self.minReplicas <= self.replicas"
message: "replicas should be greater than or equal to minReplicas."
- rule: "self.replicas <= self.maxReplicas"
message: "replicas should be smaller than or equal to maxReplicas."
properties:
...
minReplicas:
type: integer
replicas:
type: integer
maxReplicas:
type: integer
required:
- minReplicas
- replicas
- maxReplicas
will reject a request to create this custom resource:
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
minReplicas: 0
replicas: 20
maxReplicas: 10
with the response:
The CronTab "my-new-cron-object" is invalid:
* spec: Invalid value: map[string]interface {}{"maxReplicas":10, "minReplicas":0, "replicas":20}: replicas should be smaller than or equal to maxReplicas.
x-kubernetes-validations
could have multiple rules.
The rule
under x-kubernetes-validations
represents the expression which will be evaluated by CEL.
The message
represents the message displayed when validation fails. If message is unset, the above response would be:
The CronTab "my-new-cron-object" is invalid:
* spec: Invalid value: map[string]interface {}{"maxReplicas":10, "minReplicas":0, "replicas":20}: failed rule: self.replicas <= self.maxReplicas
Validation rules are compiled when CRDs are created/updated. The request of CRDs create/update will fail if compilation of validation rules fail. Compilation process includes type checking as well.
The compilation failure:
-
no_matching_overload
: this function has no overload for the types of the arguments.e.g. Rule like
self == true
against a field of integer type will get error:Invalid value: apiextensions.ValidationRule{Rule:"self == true", Message:""}: compilation failed: ERROR: \<input>:1:6: found no matching overload for '_==_' applied to '(int, bool)'
-
no_such_field
: does not contain the desired field.e.g. Rule like
self.nonExistingField > 0
against a non-existing field will return the error:Invalid value: apiextensions.ValidationRule{Rule:"self.nonExistingField > 0", Message:""}: compilation failed: ERROR: \<input>:1:5: undefined field 'nonExistingField'
-
invalid argument
: invalid argument to macros.e.g. Rule like
has(self)
will return error:Invalid value: apiextensions.ValidationRule{Rule:"has(self)", Message:""}: compilation failed: ERROR: <input>:1:4: invalid argument to has() macro
Validation Rules Examples:
Rule | Purpose |
---|---|
self.minReplicas <= self.replicas && self.replicas <= self.maxReplicas |
Validate that the three fields defining replicas are ordered appropriately |
'Available' in self.stateCounts |
Validate that an entry with the 'Available' key exists in a map |
(size(self.list1) == 0) != (size(self.list2) == 0) |
Validate that one of two lists is non-empty, but not both |
!('MY_KEY' in self.map1) || self['MY_KEY'].matches('^[a-zA-Z]*$') |
Validate the value of a map for a specific key, if it is in the map |
self.envars.filter(e, e.name = 'MY_ENV').all(e, e.value.matches('^[a-zA-Z]*$') |
Validate the 'value' field of a listMap entry where key field 'name' is 'MY_ENV' |
has(self.expired) && self.created + self.ttl < self.expired |
Validate that 'expired' date is after a 'create' date plus a 'ttl' duration |
self.health.startsWith('ok') |
Validate a 'health' string field has the prefix 'ok' |
self.widgets.exists(w, w.key == 'x' && w.foo < 10) |
Validate that the 'foo' property of a listMap item with a key 'x' is less than 10 |
type(self) == string ? self == '100%' : self == 1000 |
Validate an int-or-string field for both the the int and string cases |
self.metadata.name.startsWith(self.prefix) |
Validate that an object's name has the prefix of another field value |
self.set1.all(e, !(e in self.set2)) |
Validate that two listSets are disjoint |
size(self.names) == size(self.details) && self.names.all(n, n in self.details) |
Validate the 'details' map is keyed by the items in the 'names' listSet |
Xref: Supported evaluation on CEL
-
If the Rule is scoped to the root of a resource, it may make field selection into any fields declared in the OpenAPIv3 schema of the CRD as well as
apiVersion
,kind
,metadata.name
andmetadata.generateName
. This includes selection of fields in both thespec
andstatus
in the same expression:... openAPIV3Schema: type: object x-kubernetes-validations: - rule: "self.status.availableReplicas >= self.spec.minReplicas" properties: spec: type: object properties: minReplicas: type: integer ... status: type: object properties: availableReplicas: type: integer
-
If the Rule is scoped to an object with properties, the accessible properties of the object are field selectable via
self.field
and field presence can be checked viahas(self.field)
. Null valued fields are treated as absent fields in CEL expressions.... openAPIV3Schema: type: object properties: spec: type: object x-kubernetes-validations: - rule: "has(self.foo)" properties: ... foo: type: integer
-
If the Rule is scoped to an object with additionalProperties (i.e. a map) the value of the map are accessible via
self[mapKey]
, map containment can be checked viamapKey in self
and all entries of the map are accessible via CEL macros and functions such asself.all(...)
.... openAPIV3Schema: type: object properties: spec: type: object x-kubernetes-validations: - rule: "self['xyz'].foo > 0" additionalProperties: ... type: object properties: foo: type: integer
-
If the Rule is scoped to an array, the elements of the array are accessible via
self[i]
and also by macros and functions.... openAPIV3Schema: type: object properties: ... foo: type: array x-kubernetes-validations: - rule: "size(self) == 1" items: type: string
-
If the Rule is scoped to a scalar,
self
is bound to the scalar value.... openAPIV3Schema: type: object properties: spec: type: object properties: ... foo: type: integer x-kubernetes-validations: - rule: "self > 0"
Examples:
type of the field rule scoped to | Rule example |
---|---|
root object | self.status.actual <= self.spec.maxDesired |
map of objects | self.components['Widget'].priority < 10 |
list of integers | self.values.all(value, value >= 0 && value < 100) |
string | self.startsWith('kube') |
The apiVersion
, kind
, metadata.name
and metadata.generateName
are always accessible from the root of the
object and from any x-kubernetes-embedded-resource annotated objects. No other metadata properties are accessible.
Unknown data preserved in custom resources via x-kubernetes-preserve-unknown-fields
is not accessible in CEL
expressions. This includes:
- Unknown field values that are preserved by object schemas with x-kubernetes-preserve-unknown-fields.
- Object properties where the property schema is of an "unknown type". An "unknown type" is recursively defined as:
- A schema with no type and x-kubernetes-preserve-unknown-fields set to true
- An array where the items schema is of an "unknown type"
- An object where the additionalProperties schema is of an "unknown type"
Only property names of the form [a-zA-Z_.-/][a-zA-Z0-9_.-/]*
are accessible.
Accessible property names are escaped according to the following rules when accessed in the expression:
escape sequence | property name equivalent |
---|---|
__underscores__ |
__ |
__dot__ |
. |
__dash__ |
- |
__slash__ |
/ |
__{keyword}__ |
CEL RESERVED keyword |
Note: CEL RESERVED keyword needs to match the exact property name to be escaped (e.g. int in the word sprint would not be escaped).
Examples on escaping:
property name | rule with escaped property name |
---|---|
namespace | self.__namespace__ > 0 |
x-prop | self.x__dash__prop > 0 |
redact__d | self.redact__underscores__d > 0 |
string | self.startsWith('kube') |
Equality on arrays with x-kubernetes-list-type
of set
or map
ignores element order, i.e. [1, 2] == [2, 1].
Concatenation on arrays with x-kubernetes-list-type use the semantics of the list type:
set
:X + Y
performs a union where the array positions of all elements inX
are preserved and non-intersecting elements inY
are appended, retaining their partial order.map
:X + Y
performs a merge where the array positions of all keys inX
are preserved but the values are overwritten by values inY
when the key sets ofX
andY
intersect. Elements inY
with non-intersecting keys are appended, retaining their partial order.
Here is the declarations type mapping between OpenAPIv3 and CEL type:
OpenAPIv3 type | CEL type |
---|---|
'object' with Properties | object / "message type" |
'object' with AdditionalProperties | map |
'object' with x-kubernetes-embedded-type | object / "message type", 'apiVersion', 'kind', 'metadata.name' and 'metadata.generateName' are implicitly included in schema |
'object' with x-kubernetes-preserve-unknown-fields | object / "message type", unknown fields are NOT accessible in CEL expression |
x-kubernetes-int-or-string | dynamic object that is either an int or a string, type(value) can be used to check the type |
'array | list |
'array' with x-kubernetes-list-type=map | list with map based Equality & unique key guarantees |
'array' with x-kubernetes-list-type=set | list with set based Equality & unique entry guarantees |
'boolean' | boolean |
'number' (all formats) | double |
'integer' (all formats) | int (64) |
'null' | null_type |
'string' | string |
'string' with format=byte (base64 encoded) | bytes |
'string' with format=date | timestamp (google.protobuf.Timestamp) |
'string' with format=datetime | timestamp (google.protobuf.Timestamp) |
'string' with format=duration | duration (google.protobuf.Duration) |
xref: CEL types, OpenAPI types, Kubernetes Structural Schemas.
Defaulting
apiextensions.k8s.io/v1
.
Defaulting allows to specify default values in the OpenAPI v3 validation schema:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: crontabs.stable.example.com
spec:
group: stable.example.com
versions:
- name: v1
served: true
storage: true
schema:
# openAPIV3Schema is the schema for validating custom objects.
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
cronSpec:
type: string
pattern: '^(\d+|\*)(/\d+)?(\s+(\d+|\*)(/\d+)?){4}$'
default: "5 0 * * *"
image:
type: string
replicas:
type: integer
minimum: 1
maximum: 10
default: 1
scope: Namespaced
names:
plural: crontabs
singular: crontab
kind: CronTab
shortNames:
- ct
With this both cronSpec
and replicas
are defaulted:
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
image: my-awesome-cron-image
leads to
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
cronSpec: "5 0 * * *"
image: my-awesome-cron-image
replicas: 1
Defaulting happens on the object
- in the request to the API server using the request version defaults,
- when reading from etcd using the storage version defaults,
- after mutating admission plugins with non-empty patches using the admission webhook object version defaults.
Defaults applied when reading data from etcd are not automatically written back to etcd. An update request via the API is required to persist those defaults back into etcd.
Default values must be pruned (with the exception of defaults for metadata
fields) and must validate against a provided schema.
Default values for metadata
fields of x-kubernetes-embedded-resources: true
nodes (or parts of a default value covering metadata
) are not pruned during CustomResourceDefinition creation, but through the pruning step during handling of requests.
Defaulting and Nullable
New in 1.20: null values for fields that either don't specify the nullable flag, or give it a false
value, will be pruned before defaulting happens. If a default is present, it will be applied. When nullable is true
, null values will be conserved and won't be defaulted.
For example, given the OpenAPI schema below:
type: object
properties:
spec:
type: object
properties:
foo:
type: string
nullable: false
default: "default"
bar:
type: string
nullable: true
baz:
type: string
creating an object with null values for foo
and bar
and baz
spec:
foo: null
bar: null
baz: null
leads to
spec:
foo: "default"
bar: null
with foo
pruned and defaulted because the field is non-nullable, bar
maintaining the null value due to nullable: true
, and baz
pruned because the field is non-nullable and has no default.
Publish Validation Schema in OpenAPI v2
CustomResourceDefinition OpenAPI v3 validation schemas which are structural and enable pruning are published as part of the OpenAPI v2 spec from Kubernetes API server.
The kubectl command-line tool consumes the published schema to perform client-side validation (kubectl create
and kubectl apply
), schema explanation (kubectl explain
) on custom resources. The published schema can be consumed for other purposes as well, like client generation or documentation.
The OpenAPI v3 validation schema is converted to OpenAPI v2 schema, and
show up in definitions
and paths
fields in the OpenAPI v2 spec.
The following modifications are applied during the conversion to keep backwards compatibility with kubectl in previous 1.13 version. These modifications prevent kubectl from being over-strict and rejecting valid OpenAPI schemas that it doesn't understand. The conversion won't modify the validation schema defined in CRD, and therefore won't affect validation in the API server.
- The following fields are removed as they aren't supported by OpenAPI v2 (in future versions OpenAPI v3 will be used without these restrictions)
- The fields
allOf
,anyOf
,oneOf
andnot
are removed
- The fields
- If
nullable: true
is set, we droptype
,nullable
,items
andproperties
because OpenAPI v2 is not able to express nullable. To avoid kubectl to reject good objects, this is necessary.
Additional printer columns
The kubectl tool relies on server-side output formatting. Your cluster's API server decides which
columns are shown by the kubectl get
command. You can customize these columns for a
CustomResourceDefinition. The following example adds the Spec
, Replicas
, and Age
columns.
Save the CustomResourceDefinition to resourcedefinition.yaml
:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: crontabs.stable.example.com
spec:
group: stable.example.com
scope: Namespaced
names:
plural: crontabs
singular: crontab
kind: CronTab
shortNames:
- ct
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
cronSpec:
type: string
image:
type: string
replicas:
type: integer
additionalPrinterColumns:
- name: Spec
type: string
description: The cron spec defining the interval a CronJob is run
jsonPath: .spec.cronSpec
- name: Replicas
type: integer
description: The number of jobs launched by the CronJob
jsonPath: .spec.replicas
- name: Age
type: date
jsonPath: .metadata.creationTimestamp
Create the CustomResourceDefinition:
kubectl apply -f resourcedefinition.yaml
Create an instance using the my-crontab.yaml
from the previous section.
Invoke the server-side printing:
kubectl get crontab my-new-cron-object
Notice the NAME
, SPEC
, REPLICAS
, and AGE
columns in the output:
NAME SPEC REPLICAS AGE
my-new-cron-object * * * * * 1 7s
NAME
column is implicit and does not need to be defined in the CustomResourceDefinition.
Priority
Each column includes a priority
field. Currently, the priority
differentiates between columns shown in standard view or wide view (using the -o wide
flag).
- Columns with priority
0
are shown in standard view. - Columns with priority greater than
0
are shown only in wide view.
Type
A column's type
field can be any of the following (compare OpenAPI v3 data types):
integer
– non-floating-point numbersnumber
– floating point numbersstring
– stringsboolean
–true
orfalse
date
– rendered differentially as time since this timestamp.
If the value inside a CustomResource does not match the type specified for the column, the value is omitted. Use CustomResource validation to ensure that the value types are correct.
Format
A column's format
field can be any of the following:
int32
int64
float
double
byte
date
date-time
password
The column's format
controls the style used when kubectl
prints the value.
Subresources
Custom resources support /status
and /scale
subresources.
The status and scale subresources can be optionally enabled by defining them in the CustomResourceDefinition.
Status subresource
When the status subresource is enabled, the /status
subresource for the custom resource is exposed.
-
The status and the spec stanzas are represented by the
.status
and.spec
JSONPaths respectively inside of a custom resource. -
PUT
requests to the/status
subresource take a custom resource object and ignore changes to anything except the status stanza. -
PUT
requests to the/status
subresource only validate the status stanza of the custom resource. -
PUT
/POST
/PATCH
requests to the custom resource ignore changes to the status stanza. -
The
.metadata.generation
value is incremented for all changes, except for changes to.metadata
or.status
. -
Only the following constructs are allowed at the root of the CRD OpenAPI validation schema:
description
example
exclusiveMaximum
exclusiveMinimum
externalDocs
format
items
maximum
maxItems
maxLength
minimum
minItems
minLength
multipleOf
pattern
properties
required
title
type
uniqueItems
Scale subresource
When the scale subresource is enabled, the /scale
subresource for the custom resource is exposed.
The autoscaling/v1.Scale
object is sent as the payload for /scale
.
To enable the scale subresource, the following fields are defined in the CustomResourceDefinition.
-
specReplicasPath
defines the JSONPath inside of a custom resource that corresponds toscale.spec.replicas
.- It is a required value.
- Only JSONPaths under
.spec
and with the dot notation are allowed. - If there is no value under the
specReplicasPath
in the custom resource, the/scale
subresource will return an error on GET.
-
statusReplicasPath
defines the JSONPath inside of a custom resource that corresponds toscale.status.replicas
.- It is a required value.
- Only JSONPaths under
.status
and with the dot notation are allowed. - If there is no value under the
statusReplicasPath
in the custom resource, the status replica value in the/scale
subresource will default to 0.
-
labelSelectorPath
defines the JSONPath inside of a custom resource that corresponds toScale.Status.Selector
.- It is an optional value.
- It must be set to work with HPA.
- Only JSONPaths under
.status
or.spec
and with the dot notation are allowed. - If there is no value under the
labelSelectorPath
in the custom resource, the status selector value in the/scale
subresource will default to the empty string. - The field pointed by this JSON path must be a string field (not a complex selector struct) which contains a serialized label selector in string form.
In the following example, both status and scale subresources are enabled.
Save the CustomResourceDefinition to resourcedefinition.yaml
:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: crontabs.stable.example.com
spec:
group: stable.example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
cronSpec:
type: string
image:
type: string
replicas:
type: integer
status:
type: object
properties:
replicas:
type: integer
labelSelector:
type: string
# subresources describes the subresources for custom resources.
subresources:
# status enables the status subresource.
status: {}
# scale enables the scale subresource.
scale:
# specReplicasPath defines the JSONPath inside of a custom resource that corresponds to Scale.Spec.Replicas.
specReplicasPath: .spec.replicas
# statusReplicasPath defines the JSONPath inside of a custom resource that corresponds to Scale.Status.Replicas.
statusReplicasPath: .status.replicas
# labelSelectorPath defines the JSONPath inside of a custom resource that corresponds to Scale.Status.Selector.
labelSelectorPath: .status.labelSelector
scope: Namespaced
names:
plural: crontabs
singular: crontab
kind: CronTab
shortNames:
- ct
And create it:
kubectl apply -f resourcedefinition.yaml
After the CustomResourceDefinition object has been created, you can create custom objects.
If you save the following YAML to my-crontab.yaml
:
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
cronSpec: "* * * * */5"
image: my-awesome-cron-image
replicas: 3
and create it:
kubectl apply -f my-crontab.yaml
Then new namespaced RESTful API endpoints are created at:
/apis/stable.example.com/v1/namespaces/*/crontabs/status
and
/apis/stable.example.com/v1/namespaces/*/crontabs/scale
A custom resource can be scaled using the kubectl scale
command.
For example, the following command sets .spec.replicas
of the
custom resource created above to 5:
kubectl scale --replicas=5 crontabs/my-new-cron-object
crontabs "my-new-cron-object" scaled
kubectl get crontabs my-new-cron-object -o jsonpath='{.spec.replicas}'
5
You can use a PodDisruptionBudget to protect custom resources that have the scale subresource enabled.
Categories
Categories is a list of grouped resources the custom resource belongs to (eg. all
).
You can use kubectl get <category-name>
to list the resources belonging to the category.
The following example adds all
in the list of categories in the CustomResourceDefinition
and illustrates how to output the custom resource using kubectl get all
.
Save the following CustomResourceDefinition to resourcedefinition.yaml
:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: crontabs.stable.example.com
spec:
group: stable.example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
cronSpec:
type: string
image:
type: string
replicas:
type: integer
scope: Namespaced
names:
plural: crontabs
singular: crontab
kind: CronTab
shortNames:
- ct
# categories is a list of grouped resources the custom resource belongs to.
categories:
- all
and create it:
kubectl apply -f resourcedefinition.yaml
After the CustomResourceDefinition object has been created, you can create custom objects.
Save the following YAML to my-crontab.yaml
:
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
cronSpec: "* * * * */5"
image: my-awesome-cron-image
and create it:
kubectl apply -f my-crontab.yaml
You can specify the category when using kubectl get
:
kubectl get all
and it will include the custom resources of kind CronTab
:
NAME AGE
crontabs/my-new-cron-object 3s
What's next
-
Read about custom resources.
-
Serve multiple versions of a CustomResourceDefinition.
4.11.2.2 - Versions in CustomResourceDefinitions
This page explains how to add versioning information to CustomResourceDefinitions, to indicate the stability level of your CustomResourceDefinitions or advance your API to a new version with conversion between API representations. It also describes how to upgrade an object from one version to another.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
You should have a initial understanding of custom resources.
Your Kubernetes server must be at or later than version v1.16. To check the version, enterkubectl version
.
Overview
The CustomResourceDefinition API provides a workflow for introducing and upgrading to new versions of a CustomResourceDefinition.
When a CustomResourceDefinition is created, the first version is set in the
CustomResourceDefinition spec.versions
list to an appropriate stability level
and a version number. For example v1beta1
would indicate that the first
version is not yet stable. All custom resource objects will initially be stored
at this version.
Once the CustomResourceDefinition is created, clients may begin using the
v1beta1
API.
Later it might be necessary to add new version such as v1
.
Adding a new version:
- Pick a conversion strategy. Since custom resource objects need to be able to
be served at both versions, that means they will sometimes be served at a
different version than their storage version. In order for this to be
possible, the custom resource objects must sometimes be converted between the
version they are stored at and the version they are served at. If the
conversion involves schema changes and requires custom logic, a conversion
webhook should be used. If there are no schema changes, the default
None
conversion strategy may be used and only theapiVersion
field will be modified when serving different versions. - If using conversion webhooks, create and deploy the conversion webhook. See the Webhook conversion for more details.
- Update the CustomResourceDefinition to include the new version in the
spec.versions
list withserved:true
. Also, setspec.conversion
field to the selected conversion strategy. If using a conversion webhook, configurespec.conversion.webhookClientConfig
field to call the webhook.
Once the new version is added, clients may incrementally migrate to the new version. It is perfectly safe for some clients to use the old version while others use the new version.
Migrate stored objects to the new version:
- See the upgrade existing objects to a new stored version section.
It is safe for clients to use both the old and new version before, during and after upgrading the objects to a new stored version.
Removing an old version:
- Ensure all clients are fully migrated to the new version. The kube-apiserver logs can be reviewed to help identify any clients that are still accessing via the old version.
- Set
served
tofalse
for the old version in thespec.versions
list. If any clients are still unexpectedly using the old version they may begin reporting errors attempting to access the custom resource objects at the old version. If this occurs, switch back to usingserved:true
on the old version, migrate the remaining clients to the new version and repeat this step. - Ensure the upgrade of existing objects to the new stored version step has been completed.
- Verify that the
storage
is set totrue
for the new version in thespec.versions
list in the CustomResourceDefinition. - Verify that the old version is no longer listed in the CustomResourceDefinition
status.storedVersions
.
- Verify that the
- Remove the old version from the CustomResourceDefinition
spec.versions
list. - Drop conversion support for the old version in conversion webhooks.
Specify multiple versions
The CustomResourceDefinition API versions
field can be used to support multiple versions of custom resources that you
have developed. Versions can have different schemas, and conversion webhooks can convert custom resources between versions.
Webhook conversions should follow the Kubernetes API conventions wherever applicable.
Specifically, See the API change documentation for a set of useful gotchas and suggestions.
apiextensions.k8s.io/v1beta1
, there was a version
field instead of versions
. The
version
field is deprecated and optional, but if it is not empty, it must
match the first item in the versions
field.
This example shows a CustomResourceDefinition with two versions. For the first example, the assumption is all versions share the same schema with no conversion between them. The comments in the YAML provide more context.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
# name must match the spec fields below, and be in the form: <plural>.<group>
name: crontabs.example.com
spec:
# group name to use for REST API: /apis/<group>/<version>
group: example.com
# list of versions supported by this CustomResourceDefinition
versions:
- name: v1beta1
# Each version can be enabled/disabled by Served flag.
served: true
# One and only one version must be marked as the storage version.
storage: true
# A schema is required
schema:
openAPIV3Schema:
type: object
properties:
host:
type: string
port:
type: string
- name: v1
served: true
storage: false
schema:
openAPIV3Schema:
type: object
properties:
host:
type: string
port:
type: string
# The conversion section is introduced in Kubernetes 1.13+ with a default value of
# None conversion (strategy sub-field set to None).
conversion:
# None conversion assumes the same schema for all versions and only sets the apiVersion
# field of custom resources to the proper value
strategy: None
# either Namespaced or Cluster
scope: Namespaced
names:
# plural name to be used in the URL: /apis/<group>/<version>/<plural>
plural: crontabs
# singular name to be used as an alias on the CLI and for display
singular: crontab
# kind is normally the CamelCased singular type. Your resource manifests use this.
kind: CronTab
# shortNames allow shorter string to match your resource on the CLI
shortNames:
- ct
# Deprecated in v1.16 in favor of apiextensions.k8s.io/v1
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
# name must match the spec fields below, and be in the form: <plural>.<group>
name: crontabs.example.com
spec:
# group name to use for REST API: /apis/<group>/<version>
group: example.com
# list of versions supported by this CustomResourceDefinition
versions:
- name: v1beta1
# Each version can be enabled/disabled by Served flag.
served: true
# One and only one version must be marked as the storage version.
storage: true
- name: v1
served: true
storage: false
validation:
openAPIV3Schema:
type: object
properties:
host:
type: string
port:
type: string
# The conversion section is introduced in Kubernetes 1.13+ with a default value of
# None conversion (strategy sub-field set to None).
conversion:
# None conversion assumes the same schema for all versions and only sets the apiVersion
# field of custom resources to the proper value
strategy: None
# either Namespaced or Cluster
scope: Namespaced
names:
# plural name to be used in the URL: /apis/<group>/<version>/<plural>
plural: crontabs
# singular name to be used as an alias on the CLI and for display
singular: crontab
# kind is normally the PascalCased singular type. Your resource manifests use this.
kind: CronTab
# shortNames allow shorter string to match your resource on the CLI
shortNames:
- ct
You can save the CustomResourceDefinition in a YAML file, then use
kubectl apply
to create it.
kubectl apply -f my-versioned-crontab.yaml
After creation, the API server starts to serve each enabled version at an HTTP
REST endpoint. In the above example, the API versions are available at
/apis/example.com/v1beta1
and /apis/example.com/v1
.
Version priority
Regardless of the order in which versions are defined in a CustomResourceDefinition, the version with the highest priority is used by kubectl as the default version to access objects. The priority is determined by parsing the name field to determine the version number, the stability (GA, Beta, or Alpha), and the sequence within that stability level.
The algorithm used for sorting the versions is designed to sort versions in the
same way that the Kubernetes project sorts Kubernetes versions. Versions start with a
v
followed by a number, an optional beta
or alpha
designation, and
optional additional numeric versioning information. Broadly, a version string might look
like v2
or v2beta1
. Versions are sorted using the following algorithm:
- Entries that follow Kubernetes version patterns are sorted before those that do not.
- For entries that follow Kubernetes version patterns, the numeric portions of the version string is sorted largest to smallest.
- If the strings
beta
oralpha
follow the first numeric portion, they sorted in that order, after the equivalent string without thebeta
oralpha
suffix (which is presumed to be the GA version). - If another number follows the
beta
, oralpha
, those numbers are also sorted from largest to smallest. - Strings that don't fit the above format are sorted alphabetically and the
numeric portions are not treated specially. Notice that in the example below,
foo1
is sorted abovefoo10
. This is different from the sorting of the numeric portion of entries that do follow the Kubernetes version patterns.
This might make sense if you look at the following sorted version list:
- v10
- v2
- v1
- v11beta2
- v10beta3
- v3beta1
- v12alpha1
- v11alpha2
- foo1
- foo10
For the example in Specify multiple versions, the
version sort order is v1
, followed by v1beta1
. This causes the kubectl
command to use v1
as the default version unless the provided object specifies
the version.
Version deprecation
Kubernetes v1.19 [stable]
Starting in v1.19, a CustomResourceDefinition can indicate a particular version of the resource it defines is deprecated. When API requests to a deprecated version of that resource are made, a warning message is returned in the API response as a header. The warning message for each deprecated version of the resource can be customized if desired.
A customized warning message should indicate the deprecated API group, version, and kind, and should indicate what API group, version, and kind should be used instead, if applicable.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
name: crontabs.example.com
spec:
group: example.com
names:
plural: crontabs
singular: crontab
kind: CronTab
scope: Namespaced
versions:
- name: v1alpha1
served: true
# This indicates the v1alpha1 version of the custom resource is deprecated.
# API requests to this version receive a warning header in the server response.
deprecated: true
# This overrides the default warning returned to API clients making v1alpha1 API requests.
deprecationWarning: "example.com/v1alpha1 CronTab is deprecated; see http://example.com/v1alpha1-v1 for instructions to migrate to example.com/v1 CronTab"
schema: ...
- name: v1beta1
served: true
# This indicates the v1beta1 version of the custom resource is deprecated.
# API requests to this version receive a warning header in the server response.
# A default warning message is returned for this version.
deprecated: true
schema: ...
- name: v1
served: true
storage: true
schema: ...
# Deprecated in v1.16 in favor of apiextensions.k8s.io/v1
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: crontabs.example.com
spec:
group: example.com
names:
plural: crontabs
singular: crontab
kind: CronTab
scope: Namespaced
validation: ...
versions:
- name: v1alpha1
served: true
# This indicates the v1alpha1 version of the custom resource is deprecated.
# API requests to this version receive a warning header in the server response.
deprecated: true
# This overrides the default warning returned to API clients making v1alpha1 API requests.
deprecationWarning: "example.com/v1alpha1 CronTab is deprecated; see http://example.com/v1alpha1-v1 for instructions to migrate to example.com/v1 CronTab"
- name: v1beta1
served: true
# This indicates the v1beta1 version of the custom resource is deprecated.
# API requests to this version receive a warning header in the server response.
# A default warning message is returned for this version.
deprecated: true
- name: v1
served: true
storage: true
Webhook conversion
Kubernetes v1.16 [stable]
CustomResourceWebhookConversion
feature must be enabled, which is the case automatically for many clusters for beta features. Please refer to the feature gate documentation for more information.
The above example has a None conversion between versions which only sets the apiVersion
field
on conversion and does not change the rest of the object. The API server also supports webhook
conversions that call an external service in case a conversion is required. For example when:
- custom resource is requested in a different version than stored version.
- Watch is created in one version but the changed object is stored in another version.
- custom resource PUT request is in a different version than storage version.
To cover all of these cases and to optimize conversion by the API server, the conversion requests may contain multiple objects in order to minimize the external calls. The webhook should perform these conversions independently.
Write a conversion webhook server
Please refer to the implementation of the custom resource conversion webhook
server
that is validated in a Kubernetes e2e test. The webhook handles the
ConversionReview
requests sent by the API servers, and sends back conversion
results wrapped in ConversionResponse
. Note that the request
contains a list of custom resources that need to be converted independently without
changing the order of objects.
The example server is organized in a way to be reused for other conversions.
Most of the common code are located in the
framework file
that leaves only
one function
to be implemented for different conversions.
ClientAuth
field
empty,
which defaults to NoClientCert
. This means that the webhook server does not
authenticate the identity of the clients, supposedly API servers. If you need
mutual TLS or other ways to authenticate the clients, see
how to authenticate API servers.
Permissible mutations
A conversion webhook must not mutate anything inside of metadata
of the converted object
other than labels
and annotations
.
Attempted changes to name
, UID
and namespace
are rejected and fail the request
which caused the conversion. All other changes are ignored.
Deploy the conversion webhook service
Documentation for deploying the conversion webhook is the same as for the
admission webhook example service.
The assumption for next sections is that the conversion webhook server is deployed to a service
named example-conversion-webhook-server
in default
namespace and serving traffic on path /crdconvert
.
Configure CustomResourceDefinition to use conversion webhooks
The None
conversion example can be extended to use the conversion webhook by modifying conversion
section of the spec
:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
# name must match the spec fields below, and be in the form: <plural>.<group>
name: crontabs.example.com
spec:
# group name to use for REST API: /apis/<group>/<version>
group: example.com
# list of versions supported by this CustomResourceDefinition
versions:
- name: v1beta1
# Each version can be enabled/disabled by Served flag.
served: true
# One and only one version must be marked as the storage version.
storage: true
# Each version can define it's own schema when there is no top-level
# schema is defined.
schema:
openAPIV3Schema:
type: object
properties:
hostPort:
type: string
- name: v1
served: true
storage: false
schema:
openAPIV3Schema:
type: object
properties:
host:
type: string
port:
type: string
conversion:
# a Webhook strategy instruct API server to call an external webhook for any conversion between custom resources.
strategy: Webhook
# webhook is required when strategy is `Webhook` and it configures the webhook endpoint to be called by API server.
webhook:
# conversionReviewVersions indicates what ConversionReview versions are understood/preferred by the webhook.
# The first version in the list understood by the API server is sent to the webhook.
# The webhook must respond with a ConversionReview object in the same version it received.
conversionReviewVersions: ["v1","v1beta1"]
clientConfig:
service:
namespace: default
name: example-conversion-webhook-server
path: /crdconvert
caBundle: "Ci0tLS0tQk...<base64-encoded PEM bundle>...tLS0K"
# either Namespaced or Cluster
scope: Namespaced
names:
# plural name to be used in the URL: /apis/<group>/<version>/<plural>
plural: crontabs
# singular name to be used as an alias on the CLI and for display
singular: crontab
# kind is normally the CamelCased singular type. Your resource manifests use this.
kind: CronTab
# shortNames allow shorter string to match your resource on the CLI
shortNames:
- ct
# Deprecated in v1.16 in favor of apiextensions.k8s.io/v1
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
# name must match the spec fields below, and be in the form: <plural>.<group>
name: crontabs.example.com
spec:
# group name to use for REST API: /apis/<group>/<version>
group: example.com
# prunes object fields that are not specified in OpenAPI schemas below.
preserveUnknownFields: false
# list of versions supported by this CustomResourceDefinition
versions:
- name: v1beta1
# Each version can be enabled/disabled by Served flag.
served: true
# One and only one version must be marked as the storage version.
storage: true
# Each version can define it's own schema when there is no top-level
# schema is defined.
schema:
openAPIV3Schema:
type: object
properties:
hostPort:
type: string
- name: v1
served: true
storage: false
schema:
openAPIV3Schema:
type: object
properties:
host:
type: string
port:
type: string
conversion:
# a Webhook strategy instruct API server to call an external webhook for any conversion between custom resources.
strategy: Webhook
# webhookClientConfig is required when strategy is `Webhook` and it configures the webhook endpoint to be called by API server.
webhookClientConfig:
service:
namespace: default
name: example-conversion-webhook-server
path: /crdconvert
caBundle: "Ci0tLS0tQk...<base64-encoded PEM bundle>...tLS0K"
# either Namespaced or Cluster
scope: Namespaced
names:
# plural name to be used in the URL: /apis/<group>/<version>/<plural>
plural: crontabs
# singular name to be used as an alias on the CLI and for display
singular: crontab
# kind is normally the CamelCased singular type. Your resource manifests use this.
kind: CronTab
# shortNames allow shorter string to match your resource on the CLI
shortNames:
- ct
You can save the CustomResourceDefinition in a YAML file, then use
kubectl apply
to apply it.
kubectl apply -f my-versioned-crontab-with-conversion.yaml
Make sure the conversion service is up and running before applying new changes.
Contacting the webhook
Once the API server has determined a request should be sent to a conversion webhook,
it needs to know how to contact the webhook. This is specified in the webhookClientConfig
stanza of the webhook configuration.
Conversion webhooks can either be called via a URL or a service reference, and can optionally include a custom CA bundle to use to verify the TLS connection.
URL
url
gives the location of the webhook, in standard URL form
(scheme://host:port/path
).
The host
should not refer to a service running in the cluster; use
a service reference by specifying the service
field instead.
The host might be resolved via external DNS in some apiservers
(i.e., kube-apiserver
cannot resolve in-cluster DNS as that would
be a layering violation). host
may also be an IP address.
Please note that using localhost
or 127.0.0.1
as a host
is
risky unless you take great care to run this webhook on all hosts
which run an apiserver which might need to make calls to this
webhook. Such installations are likely to be non-portable or not readily run in a new cluster.
The scheme must be "https"; the URL must begin with "https://".
Attempting to use a user or basic auth (for example "user:password@") is not allowed. Fragments ("#...") and query parameters ("?...") are also not allowed.
Here is an example of a conversion webhook configured to call a URL (and expects the TLS certificate to be verified using system trust roots, so does not specify a caBundle):
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
...
spec:
...
conversion:
strategy: Webhook
webhook:
clientConfig:
url: "https://my-webhook.example.com:9443/my-webhook-path"
...
# Deprecated in v1.16 in favor of apiextensions.k8s.io/v1
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
...
spec:
...
conversion:
strategy: Webhook
webhookClientConfig:
url: "https://my-webhook.example.com:9443/my-webhook-path"
...
Service Reference
The service
stanza inside webhookClientConfig
is a reference to the service for a conversion webhook.
If the webhook is running within the cluster, then you should use service
instead of url
.
The service namespace and name are required. The port is optional and defaults to 443.
The path is optional and defaults to "/".
Here is an example of a webhook that is configured to call a service on port "1234"
at the subpath "/my-path", and to verify the TLS connection against the ServerName
my-service-name.my-service-namespace.svc
using a custom CA bundle.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
...
spec:
...
conversion:
strategy: Webhook
webhook:
clientConfig:
service:
namespace: my-service-namespace
name: my-service-name
path: /my-path
port: 1234
caBundle: "Ci0tLS0tQk...<base64-encoded PEM bundle>...tLS0K"
...
# Deprecated in v1.16 in favor of apiextensions.k8s.io/v1
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
...
spec:
...
conversion:
strategy: Webhook
webhookClientConfig:
service:
namespace: my-service-namespace
name: my-service-name
path: /my-path
port: 1234
caBundle: "Ci0tLS0tQk...<base64-encoded PEM bundle>...tLS0K"
...
Webhook request and response
Request
Webhooks are sent a POST request, with Content-Type: application/json
,
with a ConversionReview
API object in the apiextensions.k8s.io
API group
serialized to JSON as the body.
Webhooks can specify what versions of ConversionReview
objects they accept
with the conversionReviewVersions
field in their CustomResourceDefinition:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
...
spec:
...
conversion:
strategy: Webhook
webhook:
conversionReviewVersions: ["v1", "v1beta1"]
...
conversionReviewVersions
is a required field when creating
apiextensions.k8s.io/v1
custom resource definitions.
Webhooks are required to support at least one ConversionReview
version understood by the current and previous API server.
# Deprecated in v1.16 in favor of apiextensions.k8s.io/v1
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
...
spec:
...
conversion:
strategy: Webhook
conversionReviewVersions: ["v1", "v1beta1"]
...
If no conversionReviewVersions
are specified, the default when creating
apiextensions.k8s.io/v1beta1
custom resource definitions is v1beta1
.
API servers send the first ConversionReview
version in the conversionReviewVersions
list they support.
If none of the versions in the list are supported by the API server, the custom resource definition will not be allowed to be created.
If an API server encounters a conversion webhook configuration that was previously created and does not support any of the ConversionReview
versions the API server knows how to send, attempts to call to the webhook will fail.
This example shows the data contained in an ConversionReview
object
for a request to convert CronTab
objects to example.com/v1
:
{
"apiVersion": "apiextensions.k8s.io/v1",
"kind": "ConversionReview",
"request": {
# Random uid uniquely identifying this conversion call
"uid": "705ab4f5-6393-11e8-b7cc-42010a800002",
# The API group and version the objects should be converted to
"desiredAPIVersion": "example.com/v1",
# The list of objects to convert.
# May contain one or more objects, in one or more versions.
"objects": [
{
"kind": "CronTab",
"apiVersion": "example.com/v1beta1",
"metadata": {
"creationTimestamp": "2019-09-04T14:03:02Z",
"name": "local-crontab",
"namespace": "default",
"resourceVersion": "143",
"uid": "3415a7fc-162b-4300-b5da-fd6083580d66"
},
"hostPort": "localhost:1234"
},
{
"kind": "CronTab",
"apiVersion": "example.com/v1beta1",
"metadata": {
"creationTimestamp": "2019-09-03T13:02:01Z",
"name": "remote-crontab",
"resourceVersion": "12893",
"uid": "359a83ec-b575-460d-b553-d859cedde8a0"
},
"hostPort": "example.com:2345"
}
]
}
}
{
# Deprecated in v1.16 in favor of apiextensions.k8s.io/v1
"apiVersion": "apiextensions.k8s.io/v1beta1",
"kind": "ConversionReview",
"request": {
# Random uid uniquely identifying this conversion call
"uid": "705ab4f5-6393-11e8-b7cc-42010a800002",
# The API group and version the objects should be converted to
"desiredAPIVersion": "example.com/v1",
# The list of objects to convert.
# May contain one or more objects, in one or more versions.
"objects": [
{
"kind": "CronTab",
"apiVersion": "example.com/v1beta1",
"metadata": {
"creationTimestamp": "2019-09-04T14:03:02Z",
"name": "local-crontab",
"namespace": "default",
"resourceVersion": "143",
"uid": "3415a7fc-162b-4300-b5da-fd6083580d66"
},
"hostPort": "localhost:1234"
},
{
"kind": "CronTab",
"apiVersion": "example.com/v1beta1",
"metadata": {
"creationTimestamp": "2019-09-03T13:02:01Z",
"name": "remote-crontab",
"resourceVersion": "12893",
"uid": "359a83ec-b575-460d-b553-d859cedde8a0"
},
"hostPort": "example.com:2345"
}
]
}
}
Response
Webhooks respond with a 200 HTTP status code, Content-Type: application/json
,
and a body containing a ConversionReview
object (in the same version they were sent),
with the response
stanza populated, serialized to JSON.
If conversion succeeds, a webhook should return a response
stanza containing the following fields:
uid
, copied from therequest.uid
sent to the webhookresult
, set to{"status":"Success"}
convertedObjects
, containing all of the objects fromrequest.objects
, converted torequest.desiredVersion
Example of a minimal successful response from a webhook:
{
"apiVersion": "apiextensions.k8s.io/v1",
"kind": "ConversionReview",
"response": {
# must match <request.uid>
"uid": "705ab4f5-6393-11e8-b7cc-42010a800002",
"result": {
"status": "Success"
},
# Objects must match the order of request.objects, and have apiVersion set to <request.desiredAPIVersion>.
# kind, metadata.uid, metadata.name, and metadata.namespace fields must not be changed by the webhook.
# metadata.labels and metadata.annotations fields may be changed by the webhook.
# All other changes to metadata fields by the webhook are ignored.
"convertedObjects": [
{
"kind": "CronTab",
"apiVersion": "example.com/v1",
"metadata": {
"creationTimestamp": "2019-09-04T14:03:02Z",
"name": "local-crontab",
"namespace": "default",
"resourceVersion": "143",
"uid": "3415a7fc-162b-4300-b5da-fd6083580d66"
},
"host": "localhost",
"port": "1234"
},
{
"kind": "CronTab",
"apiVersion": "example.com/v1",
"metadata": {
"creationTimestamp": "2019-09-03T13:02:01Z",
"name": "remote-crontab",
"resourceVersion": "12893",
"uid": "359a83ec-b575-460d-b553-d859cedde8a0"
},
"host": "example.com",
"port": "2345"
}
]
}
}
{
# Deprecated in v1.16 in favor of apiextensions.k8s.io/v1
"apiVersion": "apiextensions.k8s.io/v1beta1",
"kind": "ConversionReview",
"response": {
# must match <request.uid>
"uid": "705ab4f5-6393-11e8-b7cc-42010a800002",
"result": {
"status": "Failed"
},
# Objects must match the order of request.objects, and have apiVersion set to <request.desiredAPIVersion>.
# kind, metadata.uid, metadata.name, and metadata.namespace fields must not be changed by the webhook.
# metadata.labels and metadata.annotations fields may be changed by the webhook.
# All other changes to metadata fields by the webhook are ignored.
"convertedObjects": [
{
"kind": "CronTab",
"apiVersion": "example.com/v1",
"metadata": {
"creationTimestamp": "2019-09-04T14:03:02Z",
"name": "local-crontab",
"namespace": "default",
"resourceVersion": "143",
"uid": "3415a7fc-162b-4300-b5da-fd6083580d66"
},
"host": "localhost",
"port": "1234"
},
{
"kind": "CronTab",
"apiVersion": "example.com/v1",
"metadata": {
"creationTimestamp": "2019-09-03T13:02:01Z",
"name": "remote-crontab",
"resourceVersion": "12893",
"uid": "359a83ec-b575-460d-b553-d859cedde8a0"
},
"host": "example.com",
"port": "2345"
}
]
}
}
If conversion fails, a webhook should return a response
stanza containing the following fields:
uid
, copied from therequest.uid
sent to the webhookresult
, set to{"status":"Failed"}
Example of a response from a webhook indicating a conversion request failed, with an optional message:
{
"apiVersion": "apiextensions.k8s.io/v1",
"kind": "ConversionReview",
"response": {
"uid": "<value from request.uid>",
"result": {
"status": "Failed",
"message": "hostPort could not be parsed into a separate host and port"
}
}
}
{
# Deprecated in v1.16 in favor of apiextensions.k8s.io/v1
"apiVersion": "apiextensions.k8s.io/v1beta1",
"kind": "ConversionReview",
"response": {
"uid": "<value from request.uid>",
"result": {
"status": "Failed",
"message": "hostPort could not be parsed into a separate host and port"
}
}
}
Writing, reading, and updating versioned CustomResourceDefinition objects
When an object is written, it is persisted at the version designated as the storage version at the time of the write. If the storage version changes, existing objects are never converted automatically. However, newly-created or updated objects are written at the new storage version. It is possible for an object to have been written at a version that is no longer served.
When you read an object, you specify the version as part of the path. If you
specify a version that is different from the object's persisted version,
Kubernetes returns the object to you at the version you requested, but the
persisted object is neither changed on disk, nor converted in any way
(other than changing the apiVersion
string) while serving the request.
You can request an object at any version that is currently served.
If you update an existing object, it is rewritten at the version that is currently the storage version. This is the only way that objects can change from one version to another.
To illustrate this, consider the following hypothetical series of events:
- The storage version is
v1beta1
. You create an object. It is persisted in storage at versionv1beta1
- You add version
v1
to your CustomResourceDefinition and designate it as the storage version. - You read your object at version
v1beta1
, then you read the object again at versionv1
. Both returned objects are identical except for the apiVersion field. - You create a new object. It is persisted in storage at version
v1
. You now have two objects, one of which is atv1beta1
, and the other of which is atv1
. - You update the first object. It is now persisted at version
v1
since that is the current storage version.
Previous storage versions
The API server records each version which has ever been marked as the storage
version in the status field storedVersions
. Objects may have been persisted
at any version that has ever been designated as a storage version. No objects
can exist in storage at a version that has never been a storage version.
Upgrade existing objects to a new stored version
When deprecating versions and dropping support, select a storage upgrade procedure.
Option 1: Use the Storage Version Migrator
- Run the storage Version migrator
- Remove the old version from the CustomResourceDefinition
status.storedVersions
field.
Option 2: Manually upgrade the existing objects to a new stored version
The following is an example procedure to upgrade from v1beta1
to v1
.
- Set
v1
as the storage in the CustomResourceDefinition file and apply it using kubectl. ThestoredVersions
is nowv1beta1, v1
. - Write an upgrade procedure to list all existing objects and write them with
the same content. This forces the backend to write objects in the current
storage version, which is
v1
. - Remove
v1beta1
from the CustomResourceDefinitionstatus.storedVersions
field.
The kubectl
tool currently cannot be used to edit or patch the status
subresource on a CRD: see the Kubectl Subresource Support KEP for more details.
The easier way to patch the status subresource from the CLI is directly interacting with the API server using the curl
tool, in this example:
kubectl proxy &
curl --header "Content-Type: application/json-patch+json" \
--request PATCH http://localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/<your CRD name here>/status \
--data '[{"op": "replace", "path": "/status/storedVersions", "value":["v1"]}]'
4.11.3 - Set up an Extension API Server
Setting up an extension API server to work with the aggregation layer allows the Kubernetes apiserver to be extended with additional APIs, which are not part of the core Kubernetes APIs.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
- You must configure the aggregation layer and enable the apiserver flags.
Setup an extension api-server to work with the aggregation layer
The following steps describe how to set up an extension-apiserver at a high level. These steps apply regardless if you're using YAML configs or using APIs. An attempt is made to specifically identify any differences between the two. For a concrete example of how they can be implemented using YAML configs, you can look at the sample-apiserver in the Kubernetes repo.
Alternatively, you can use an existing 3rd party solution, such as apiserver-builder, which should generate a skeleton and automate all of the following steps for you.
- Make sure the APIService API is enabled (check
--runtime-config
). It should be on by default, unless it's been deliberately turned off in your cluster. - You may need to make an RBAC rule allowing you to add APIService objects, or get your cluster administrator to make one. (Since API extensions affect the entire cluster, it is not recommended to do testing/development/debug of an API extension in a live cluster.)
- Create the Kubernetes namespace you want to run your extension api-service in.
- Create/get a CA cert to be used to sign the server cert the extension api-server uses for HTTPS.
- Create a server cert/key for the api-server to use for HTTPS. This cert should be signed by the above CA. It should also have a CN of the Kube DNS name. This is derived from the Kubernetes service and be of the form
<service name>.<service name namespace>.svc
- Create a Kubernetes secret with the server cert/key in your namespace.
- Create a Kubernetes deployment for the extension api-server and make sure you are loading the secret as a volume. It should contain a reference to a working image of your extension api-server. The deployment should also be in your namespace.
- Make sure that your extension-apiserver loads those certs from that volume and that they are used in the HTTPS handshake.
- Create a Kubernetes service account in your namespace.
- Create a Kubernetes cluster role for the operations you want to allow on your resources.
- Create a Kubernetes cluster role binding from the service account in your namespace to the cluster role you created.
- Create a Kubernetes cluster role binding from the service account in your namespace to the
system:auth-delegator
cluster role to delegate auth decisions to the Kubernetes core API server. - Create a Kubernetes role binding from the service account in your namespace to the
extension-apiserver-authentication-reader
role. This allows your extension api-server to access theextension-apiserver-authentication
configmap. - Create a Kubernetes apiservice. The CA cert above should be base64 encoded, stripped of new lines and used as the spec.caBundle in the apiservice. This should not be namespaced. If using the kube-aggregator API, only pass in the PEM encoded CA bundle because the base 64 encoding is done for you.
- Use kubectl to get your resource. When run, kubectl should return "No resources found.". This message indicates that everything worked but you currently have no objects of that resource type created.
What's next
- Walk through the steps to configure the API aggregation layer and enable the apiserver flags.
- For a high level overview, see Extending the Kubernetes API with the aggregation layer.
- Learn how to Extend the Kubernetes API using Custom Resource Definitions.
4.11.4 - Configure Multiple Schedulers
Kubernetes ships with a default scheduler that is described here. If the default scheduler does not suit your needs you can implement your own scheduler. Moreover, you can even run multiple schedulers simultaneously alongside the default scheduler and instruct Kubernetes what scheduler to use for each of your pods. Let's learn how to run multiple schedulers in Kubernetes with an example.
A detailed description of how to implement a scheduler is outside the scope of this document. Please refer to the kube-scheduler implementation in pkg/scheduler in the Kubernetes source directory for a canonical example.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
Package the scheduler
Package your scheduler binary into a container image. For the purposes of this example, you can use the default scheduler (kube-scheduler) as your second scheduler. Clone the Kubernetes source code from GitHub and build the source.
git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes
make
Create a container image containing the kube-scheduler binary. Here is the Dockerfile
to build the image:
FROM busybox
ADD ./_output/local/bin/linux/amd64/kube-scheduler /usr/local/bin/kube-scheduler
Save the file as Dockerfile
, build the image and push it to a registry. This example
pushes the image to
Google Container Registry (GCR).
For more details, please read the GCR
documentation.
docker build -t gcr.io/my-gcp-project/my-kube-scheduler:1.0 .
gcloud docker -- push gcr.io/my-gcp-project/my-kube-scheduler:1.0
Define a Kubernetes Deployment for the scheduler
Now that you have your scheduler in a container image, create a pod
configuration for it and run it in your Kubernetes cluster. But instead of creating a pod
directly in the cluster, you can use a Deployment
for this example. A Deployment manages a
Replica Set which in turn manages the pods,
thereby making the scheduler resilient to failures. Here is the deployment
config. Save it as my-scheduler.yaml
:
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-scheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: my-scheduler-as-kube-scheduler
subjects:
- kind: ServiceAccount
name: my-scheduler
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:kube-scheduler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: my-scheduler-as-volume-scheduler
subjects:
- kind: ServiceAccount
name: my-scheduler
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:volume-scheduler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
name: my-scheduler-config
namespace: kube-system
data:
my-scheduler-config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: my-scheduler
leaderElection:
leaderElect: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
component: scheduler
tier: control-plane
name: my-scheduler
namespace: kube-system
spec:
selector:
matchLabels:
component: scheduler
tier: control-plane
replicas: 1
template:
metadata:
labels:
component: scheduler
tier: control-plane
version: second
spec:
serviceAccountName: my-scheduler
containers:
- command:
- /usr/local/bin/kube-scheduler
- --config=/etc/kubernetes/my-scheduler/my-scheduler-config.yaml
image: gcr.io/my-gcp-project/my-kube-scheduler:1.0
livenessProbe:
httpGet:
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15
name: kube-second-scheduler
readinessProbe:
httpGet:
path: /healthz
port: 10259
scheme: HTTPS
resources:
requests:
cpu: '0.1'
securityContext:
privileged: false
volumeMounts:
- name: config-volume
mountPath: /etc/kubernetes/my-scheduler
hostNetwork: false
hostPID: false
volumes:
- name: config-volume
configMap:
name: my-scheduler-config
In the above manifest, you use a KubeSchedulerConfiguration
to customize the behavior of your scheduler implementation. This configuration has been passed to
the kube-scheduler
during initialization with the --config
option. The my-scheduler-config
ConfigMap stores the configuration file. The Pod of themy-scheduler
Deployment mounts the my-scheduler-config
ConfigMap as a volume.
In the aforementioned Scheduler Configuration, your scheduler implementation is represented via a KubeSchedulerProfile.
spec.schedulerName
field in a
PodTemplate or Pod manifest must match the schedulerName
field of the KubeSchedulerProfile
.
All schedulers running in the cluster must have unique names.
Also, note that you create a dedicated service account my-scheduler
and bind the ClusterRole
system:kube-scheduler
to it so that it can acquire the same privileges as kube-scheduler
.
Please see the
kube-scheduler documentation for
detailed description of other command line arguments and
Scheduler Configuration reference for
detailed description of other customizable kube-scheduler
configurations.
Run the second scheduler in the cluster
In order to run your scheduler in a Kubernetes cluster, create the deployment specified in the config above in a Kubernetes cluster:
kubectl create -f my-scheduler.yaml
Verify that the scheduler pod is running:
kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
....
my-scheduler-lnf4s-4744f 1/1 Running 0 2m
...
You should see a "Running" my-scheduler pod, in addition to the default kube-scheduler pod in this list.
Enable leader election
To run multiple-scheduler with leader election enabled, you must do the following:
Update the following fields for the KubeSchedulerConfiguration in the my-scheduler-config
ConfigMap in your YAML file:
leaderElection.leaderElect
totrue
leaderElection.resourceNamespace
to<lock-object-namespace>
leaderElection.resourceName
to<lock-object-name>
kube-system
namespace.
If RBAC is enabled on your cluster, you must update the system:kube-scheduler
cluster role.
Add your scheduler name to the resourceNames of the rule applied for endpoints
and leases
resources, as in the following example:
kubectl edit clusterrole system:kube-scheduler
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:kube-scheduler
rules:
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- create
- apiGroups:
- coordination.k8s.io
resourceNames:
- kube-scheduler
- my-scheduler
resources:
- leases
verbs:
- get
- update
- apiGroups:
- ""
resourceNames:
- kube-scheduler
- my-scheduler
resources:
- endpoints
verbs:
- delete
- get
- patch
- update
Specify schedulers for pods
Now that your second scheduler is running, create some pods, and direct them to be scheduled by either the default scheduler or the one you deployed. In order to schedule a given pod using a specific scheduler, specify the name of the scheduler in that pod spec. Let's look at three examples.
-
Pod spec without any scheduler name
apiVersion: v1 kind: Pod metadata: name: no-annotation labels: name: multischeduler-example spec: containers: - name: pod-with-no-annotation-container image: k8s.gcr.io/pause:2.0
When no scheduler name is supplied, the pod is automatically scheduled using the default-scheduler.
Save this file as
pod1.yaml
and submit it to the Kubernetes cluster.kubectl create -f pod1.yaml
-
Pod spec with
default-scheduler
apiVersion: v1 kind: Pod metadata: name: annotation-default-scheduler labels: name: multischeduler-example spec: schedulerName: default-scheduler containers: - name: pod-with-default-annotation-container image: k8s.gcr.io/pause:2.0
A scheduler is specified by supplying the scheduler name as a value to
spec.schedulerName
. In this case, we supply the name of the default scheduler which isdefault-scheduler
.Save this file as
pod2.yaml
and submit it to the Kubernetes cluster.kubectl create -f pod2.yaml
-
Pod spec with
my-scheduler
apiVersion: v1 kind: Pod metadata: name: annotation-second-scheduler labels: name: multischeduler-example spec: schedulerName: my-scheduler containers: - name: pod-with-second-annotation-container image: k8s.gcr.io/pause:2.0
In this case, we specify that this pod should be scheduled using the scheduler that we deployed -
my-scheduler
. Note that the value ofspec.schedulerName
should match the name supplied for the scheduler in theschedulerName
field of the mappingKubeSchedulerProfile
.Save this file as
pod3.yaml
and submit it to the Kubernetes cluster.kubectl create -f pod3.yaml
Verify that all three pods are running.
kubectl get pods
Verifying that the pods were scheduled using the desired schedulers
In order to make it easier to work through these examples, we did not verify that the
pods were actually scheduled using the desired schedulers. We can verify that by
changing the order of pod and deployment config submissions above. If we submit all the
pod configs to a Kubernetes cluster before submitting the scheduler deployment config,
we see that the pod annotation-second-scheduler
remains in "Pending" state forever
while the other two pods get scheduled. Once we submit the scheduler deployment config
and our new scheduler starts running, the annotation-second-scheduler
pod gets
scheduled as well.
Alternatively, you can look at the "Scheduled" entries in the event logs to verify that the pods were scheduled by the desired schedulers.
kubectl get events
You can also use a custom scheduler configuration or a custom container image for the cluster's main scheduler by modifying its static pod manifest on the relevant control plane nodes.
4.11.5 - Use an HTTP Proxy to Access the Kubernetes API
This page shows how to use an HTTP proxy to access the Kubernetes API.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
If you do not already have an application running in your cluster, start a Hello world application by entering this command:
kubectl create deployment node-hello --image=gcr.io/google-samples/node-hello:1.0 --port=8080
Using kubectl to start a proxy server
This command starts a proxy to the Kubernetes API server:
kubectl proxy --port=8080
Exploring the Kubernetes API
When the proxy server is running, you can explore the API using curl
, wget
,
or a browser.
Get the API versions:
curl http://localhost:8080/api/
The output should look similar to this:
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.2.15:8443"
}
]
}
Get a list of pods:
curl http://localhost:8080/api/v1/namespaces/default/pods
The output should look similar to this:
{
"kind": "PodList",
"apiVersion": "v1",
"metadata": {
"resourceVersion": "33074"
},
"items": [
{
"metadata": {
"name": "kubernetes-bootcamp-2321272333-ix8pt",
"generateName": "kubernetes-bootcamp-2321272333-",
"namespace": "default",
"uid": "ba21457c-6b1d-11e6-85f7-1ef9f1dab92b",
"resourceVersion": "33003",
"creationTimestamp": "2016-08-25T23:43:30Z",
"labels": {
"pod-template-hash": "2321272333",
"run": "kubernetes-bootcamp"
},
...
}
What's next
Learn more about kubectl proxy.
4.11.6 - Set up Konnectivity service
The Konnectivity service provides a TCP level proxy for the control plane to cluster communication.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube.
Configure the Konnectivity service
The following steps require an egress configuration, for example:
apiVersion: apiserver.k8s.io/v1beta1
kind: EgressSelectorConfiguration
egressSelections:
# Since we want to control the egress traffic to the cluster, we use the
# "cluster" as the name. Other supported values are "etcd", and "master".
- name: cluster
connection:
# This controls the protocol between the API Server and the Konnectivity
# server. Supported values are "GRPC" and "HTTPConnect". There is no
# end user visible difference between the two modes. You need to set the
# Konnectivity server to work in the same mode.
proxyProtocol: GRPC
transport:
# This controls what transport the API Server uses to communicate with the
# Konnectivity server. UDS is recommended if the Konnectivity server
# locates on the same machine as the API Server. You need to configure the
# Konnectivity server to listen on the same UDS socket.
# The other supported transport is "tcp". You will need to set up TLS
# config to secure the TCP transport.
uds:
udsName: /etc/kubernetes/konnectivity-server/konnectivity-server.socket
You need to configure the API Server to use the Konnectivity service and direct the network traffic to the cluster nodes:
- Make sure that Service Account Token Volume Projection feature enabled in your cluster. It is enabled by default since Kubernetes v1.20.
- Create an egress configuration file such as
admin/konnectivity/egress-selector-configuration.yaml
. - Set the
--egress-selector-config-file
flag of the API Server to the path of your API Server egress configuration file. - If you use UDS connection, add volumes config to the kube-apiserver:
spec: containers: volumeMounts: - name: konnectivity-uds mountPath: /etc/kubernetes/konnectivity-server readOnly: false volumes: - name: konnectivity-uds hostPath: path: /etc/kubernetes/konnectivity-server type: DirectoryOrCreate
Generate or obtain a certificate and kubeconfig for konnectivity-server.
For example, you can use the OpenSSL command line tool to issue a X.509 certificate,
using the cluster CA certificate /etc/kubernetes/pki/ca.crt
from a control-plane host.
openssl req -subj "/CN=system:konnectivity-server" -new -newkey rsa:2048 -nodes -out konnectivity.csr -keyout konnectivity.key -out konnectivity.csr
openssl x509 -req -in konnectivity.csr -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key -CAcreateserial -out konnectivity.crt -days 375 -sha256
SERVER=$(kubectl config view -o jsonpath='{.clusters..server}')
kubectl --kubeconfig /etc/kubernetes/konnectivity-server.conf config set-credentials system:konnectivity-server --client-certificate konnectivity.crt --client-key konnectivity.key --embed-certs=true
kubectl --kubeconfig /etc/kubernetes/konnectivity-server.conf config set-cluster kubernetes --server "$SERVER" --certificate-authority /etc/kubernetes/pki/ca.crt --embed-certs=true
kubectl --kubeconfig /etc/kubernetes/konnectivity-server.conf config set-context system:konnectivity-server@kubernetes --cluster kubernetes --user system:konnectivity-server
kubectl --kubeconfig /etc/kubernetes/konnectivity-server.conf config use-context system:konnectivity-server@kubernetes
rm -f konnectivity.crt konnectivity.key konnectivity.csr
Next, you need to deploy the Konnectivity server and agents. kubernetes-sigs/apiserver-network-proxy is a reference implementation.
Deploy the Konnectivity server on your control plane node. The provided
konnectivity-server.yaml
manifest assumes
that the Kubernetes components are deployed as a static Pod in your cluster. If not, you can deploy the Konnectivity
server as a DaemonSet.
apiVersion: v1
kind: Pod
metadata:
name: konnectivity-server
namespace: kube-system
spec:
priorityClassName: system-cluster-critical
hostNetwork: true
containers:
- name: konnectivity-server-container
image: us.gcr.io/k8s-artifacts-prod/kas-network-proxy/proxy-server:v0.0.16
command: ["/proxy-server"]
args: [
"--logtostderr=true",
# This needs to be consistent with the value set in egressSelectorConfiguration.
"--uds-name=/etc/kubernetes/konnectivity-server/konnectivity-server.socket",
# The following two lines assume the Konnectivity server is
# deployed on the same machine as the apiserver, and the certs and
# key of the API Server are at the specified location.
"--cluster-cert=/etc/kubernetes/pki/apiserver.crt",
"--cluster-key=/etc/kubernetes/pki/apiserver.key",
# This needs to be consistent with the value set in egressSelectorConfiguration.
"--mode=grpc",
"--server-port=0",
"--agent-port=8132",
"--admin-port=8133",
"--health-port=8134",
"--agent-namespace=kube-system",
"--agent-service-account=konnectivity-agent",
"--kubeconfig=/etc/kubernetes/konnectivity-server.conf",
"--authentication-audience=system:konnectivity-server"
]
livenessProbe:
httpGet:
scheme: HTTP
host: 127.0.0.1
port: 8134
path: /healthz
initialDelaySeconds: 30
timeoutSeconds: 60
ports:
- name: agentport
containerPort: 8132
hostPort: 8132
- name: adminport
containerPort: 8133
hostPort: 8133
- name: healthport
containerPort: 8134
hostPort: 8134
volumeMounts:
- name: k8s-certs
mountPath: /etc/kubernetes/pki
readOnly: true
- name: kubeconfig
mountPath: /etc/kubernetes/konnectivity-server.conf
readOnly: true
- name: konnectivity-uds
mountPath: /etc/kubernetes/konnectivity-server
readOnly: false
volumes:
- name: k8s-certs
hostPath:
path: /etc/kubernetes/pki
- name: kubeconfig
hostPath:
path: /etc/kubernetes/konnectivity-server.conf
type: FileOrCreate
- name: konnectivity-uds
hostPath:
path: /etc/kubernetes/konnectivity-server
type: DirectoryOrCreate
Then deploy the Konnectivity agents in your cluster:
apiVersion: apps/v1
# Alternatively, you can deploy the agents as Deployments. It is not necessary
# to have an agent on each node.
kind: DaemonSet
metadata:
labels:
addonmanager.kubernetes.io/mode: Reconcile
k8s-app: konnectivity-agent
namespace: kube-system
name: konnectivity-agent
spec:
selector:
matchLabels:
k8s-app: konnectivity-agent
template:
metadata:
labels:
k8s-app: konnectivity-agent
spec:
priorityClassName: system-cluster-critical
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
containers:
- image: us.gcr.io/k8s-artifacts-prod/kas-network-proxy/proxy-agent:v0.0.16
name: konnectivity-agent
command: ["/proxy-agent"]
args: [
"--logtostderr=true",
"--ca-cert=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
# Since the konnectivity server runs with hostNetwork=true,
# this is the IP address of the master machine.
"--proxy-server-host=35.225.206.7",
"--proxy-server-port=8132",
"--admin-server-port=8133",
"--health-server-port=8134",
"--service-account-token-path=/var/run/secrets/tokens/konnectivity-agent-token"
]
volumeMounts:
- mountPath: /var/run/secrets/tokens
name: konnectivity-agent-token
livenessProbe:
httpGet:
port: 8134
path: /healthz
initialDelaySeconds: 15
timeoutSeconds: 15
serviceAccountName: konnectivity-agent
volumes:
- name: konnectivity-agent-token
projected:
sources:
- serviceAccountToken:
path: konnectivity-agent-token
audience: system:konnectivity-server
Last, if RBAC is enabled in your cluster, create the relevant RBAC rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:konnectivity-server
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:auth-delegator
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: system:konnectivity-server
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: konnectivity-agent
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
4.12 - TLS
4.12.1 - Configure Certificate Rotation for the Kubelet
This page shows how to enable and configure certificate rotation for the kubelet.
Kubernetes v1.19 [stable]
Before you begin
- Kubernetes version 1.8.0 or later is required
Overview
The kubelet uses certificates for authenticating to the Kubernetes API. By default, these certificates are issued with one year expiration so that they do not need to be renewed too frequently.
Kubernetes contains kubelet certificate rotation, that will automatically generate a new key and request a new certificate from the Kubernetes API as the current certificate approaches expiration. Once the new certificate is available, it will be used for authenticating connections to the Kubernetes API.
Enabling client certificate rotation
The kubelet
process accepts an argument --rotate-certificates
that controls
if the kubelet will automatically request a new certificate as the expiration of
the certificate currently in use approaches.
The kube-controller-manager
process accepts an argument
--cluster-signing-duration
(--experimental-cluster-signing-duration
prior to 1.19)
that controls how long certificates will be issued for.
Understanding the certificate rotation configuration
When a kubelet starts up, if it is configured to bootstrap (using the
--bootstrap-kubeconfig
flag), it will use its initial certificate to connect
to the Kubernetes API and issue a certificate signing request. You can view the
status of certificate signing requests using:
kubectl get csr
Initially a certificate signing request from the kubelet on a node will have a
status of Pending
. If the certificate signing requests meets specific
criteria, it will be auto approved by the controller manager, then it will have
a status of Approved
. Next, the controller manager will sign a certificate,
issued for the duration specified by the
--cluster-signing-duration
parameter, and the signed certificate
will be attached to the certificate signing request.
The kubelet will retrieve the signed certificate from the Kubernetes API and
write that to disk, in the location specified by --cert-dir
. Then the kubelet
will use the new certificate to connect to the Kubernetes API.
As the expiration of the signed certificate approaches, the kubelet will automatically issue a new certificate signing request, using the Kubernetes API. This can happen at any point between 30% and 10% of the time remaining on the certificate. Again, the controller manager will automatically approve the certificate request and attach a signed certificate to the certificate signing request. The kubelet will retrieve the new signed certificate from the Kubernetes API and write that to disk. Then it will update the connections it has to the Kubernetes API to reconnect using the new certificate.
4.12.2 - Manage TLS Certificates in a Cluster
Kubernetes provides a certificates.k8s.io
API, which lets you provision TLS
certificates signed by a Certificate Authority (CA) that you control. These CA
and certificates can be used by your workloads to establish trust.
certificates.k8s.io
API uses a protocol that is similar to the ACME
draft.
certificates.k8s.io
API are signed by a
dedicated CA. It is possible to configure your cluster to use the cluster root
CA for this purpose, but you should never rely on this. Do not assume that
these certificates will validate against the cluster root CA.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
You need the cfssl
tool. You can download cfssl
from
https://github.com/cloudflare/cfssl/releases.
Some steps in this page use the jq
tool. If you don't have jq
, you can
install it via your operating system's software sources, or fetch it from
https://stedolan.github.io/jq/.
Trusting TLS in a cluster
Trusting the custom CA from an application running as a pod usually requires
some extra application configuration. You will need to add the CA certificate
bundle to the list of CA certificates that the TLS client or server trusts. For
example, you would do this with a golang TLS config by parsing the certificate
chain and adding the parsed certificates to the RootCAs
field in the
tls.Config
struct.
Even though the custom CA certificate may be included in the filesystem (in the
ConfigMap kube-root-ca.crt
),
you should not use that certificate authority for any purpose other than to verify internal
Kubernetes endpoints. An example of an internal Kubernetes endpoint is the
Service named kubernetes
in the default namespace.
If you want to use a custom certificate authority for your workloads, you should generate that CA separately, and distribute its CA certificate using a ConfigMap that your pods have access to read.
Requesting a certificate
The following section demonstrates how to create a TLS certificate for a Kubernetes service accessed through DNS.
Create a certificate signing request
Generate a private key and certificate signing request (or CSR) by running the following command:
cat <<EOF | cfssl genkey - | cfssljson -bare server
{
"hosts": [
"my-svc.my-namespace.svc.cluster.local",
"my-pod.my-namespace.pod.cluster.local",
"192.0.2.24",
"10.0.34.2"
],
"CN": "my-pod.my-namespace.pod.cluster.local",
"key": {
"algo": "ecdsa",
"size": 256
}
}
EOF
Where 192.0.2.24
is the service's cluster IP,
my-svc.my-namespace.svc.cluster.local
is the service's DNS name,
10.0.34.2
is the pod's IP and my-pod.my-namespace.pod.cluster.local
is the pod's DNS name. You should see the output similar to:
2022/02/01 11:45:32 [INFO] generate received request
2022/02/01 11:45:32 [INFO] received CSR
2022/02/01 11:45:32 [INFO] generating key: ecdsa-256
2022/02/01 11:45:32 [INFO] encoded CSR
This command generates two files; it generates server.csr
containing the PEM
encoded PKCS#10 certification request,
and server-key.pem
containing the PEM encoded key to the certificate that
is still to be created.
Create a CertificateSigningRequest object to send to the Kubernetes API
Generate a CSR manifest (in YAML), and send it to the API server. You can do that by running the following command:
cat <<EOF | kubectl apply -f -
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
name: my-svc.my-namespace
spec:
request: $(cat server.csr | base64 | tr -d '\n')
signerName: example.com/serving
usages:
- digital signature
- key encipherment
- server auth
EOF
Notice that the server.csr
file created in step 1 is base64 encoded
and stashed in the .spec.request
field. You are also requesting a
certificate with the "digital signature", "key encipherment", and "server
auth" key usages, signed by an example example.com/serving
signer.
A specific signerName
must be requested.
View documentation for supported signer names
for more information.
The CSR should now be visible from the API in a Pending state. You can see it by running:
kubectl describe csr my-svc.my-namespace
Name: my-svc.my-namespace
Labels: <none>
Annotations: <none>
CreationTimestamp: Tue, 01 Feb 2022 11:49:15 -0500
Requesting User: yourname@example.com
Signer: example.com/serving
Status: Pending
Subject:
Common Name: my-pod.my-namespace.pod.cluster.local
Serial Number:
Subject Alternative Names:
DNS Names: my-pod.my-namespace.pod.cluster.local
my-svc.my-namespace.svc.cluster.local
IP Addresses: 192.0.2.24
10.0.34.2
Events: <none>
Get the CertificateSigningRequest approved
Approving the certificate signing request
is either done by an automated approval process or on a one off basis by a cluster
administrator. If you're authorized to approve a certificate request, you can do that
manually using kubectl
; for example:
kubectl certificate approve my-svc.my-namespace
certificatesigningrequest.certificates.k8s.io/my-svc.my-namespace approved
You should now see the following:
kubectl get csr
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
my-svc.my-namespace 10m example.com/serving yourname@example.com <none> Approved
This means the certificate request has been approved and is waiting for the requested signer to sign it.
Sign the CertificateSigningRequest
Next, you'll play the part of a certificate signer, issue the certificate, and upload it to the API.
A signer would typically watch the CertificateSigningRequest API for objects with its signerName
,
check that they have been approved, sign certificates for those requests,
and update the API object status with the issued certificate.
Create a Certificate Authority
You need an authority to provide the digital signature on the new certificate.
First, create a signing certificate by running the following:
cat <<EOF | cfssl gencert -initca - | cfssljson -bare ca
{
"CN": "My Example Signer",
"key": {
"algo": "rsa",
"size": 2048
}
}
EOF
You should see output similar to:
2022/02/01 11:50:39 [INFO] generating a new CA key and certificate from CSR
2022/02/01 11:50:39 [INFO] generate received request
2022/02/01 11:50:39 [INFO] received CSR
2022/02/01 11:50:39 [INFO] generating key: rsa-2048
2022/02/01 11:50:39 [INFO] encoded CSR
2022/02/01 11:50:39 [INFO] signed certificate with serial number 263983151013686720899716354349605500797834580472
This produces a certificate authority key file (ca-key.pem
) and certificate (ca.pem
).
Issue a certificate
{
"signing": {
"default": {
"usages": [
"digital signature",
"key encipherment",
"server auth"
],
"expiry": "876000h",
"ca_constraint": {
"is_ca": false
}
}
}
}
Use a server-signing-config.json
signing configuration and the certificate authority key file
and certificate to sign the certificate request:
kubectl get csr my-svc.my-namespace -o jsonpath='{.spec.request}' | \
base64 --decode | \
cfssl sign -ca ca.pem -ca-key ca-key.pem -config server-signing-config.json - | \
cfssljson -bare ca-signed-server
You should see the output similar to:
2022/02/01 11:52:26 [INFO] signed certificate with serial number 576048928624926584381415936700914530534472870337
This produces a signed serving certificate file, ca-signed-server.pem
.
Upload the signed certificate
Finally, populate the signed certificate in the API object's status:
kubectl get csr my-svc.my-namespace -o json | \
jq '.status.certificate = "'$(base64 ca-signed-server.pem | tr -d '\n')'"' | \
kubectl replace --raw /apis/certificates.k8s.io/v1/certificatesigningrequests/my-svc.my-namespace/status -f -
jq
to populate the base64-encoded
content in the .status.certificate
field.
If you do not have jq
, you can also save the JSON output to a file, populate this field manually, and
upload the resulting file.
Once the CSR is approved and the signed certificate is uploaded, run:
kubectl get csr
The output is similar to:
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
my-svc.my-namespace 20m example.com/serving yourname@example.com <none> Approved,Issued
Download the certificate and use it
Now, as the requesting user, you can download the issued certificate
and save it to a server.crt
file by running the following:
kubectl get csr my-svc.my-namespace -o jsonpath='{.status.certificate}' \
| base64 --decode > server.crt
Now you can populate server.crt
and server-key.pem
in a
Secret
that you could later mount into a Pod (for example, to use with a webserver
that serves HTTPS).
kubectl create secret tls server --cert server.crt --key server-key.pem
secret/server created
Finally, you can populate ca.pem
into a {< glossary_tooltip text="ConfigMap" term_id="configmap" >}}
and use it as the trust root to verify the serving certificate:
kubectl create configmap example-serving-ca --from-file ca.crt=ca.pem
configmap/example-serving-ca created
Approving CertificateSigningRequests
A Kubernetes administrator (with appropriate permissions) can manually approve
(or deny) CertificateSigningRequests by using the kubectl certificate approve
and kubectl certificate deny
commands. However if you intend
to make heavy usage of this API, you might consider writing an automated
certificates controller.
The ability to approve CSRs decides who trusts whom within your environment. The ability to approve CSRs should not be granted broadly or lightly.
You should make sure that you confidently understand both the verification requirements
that fall on the approver and the repercussions of issuing a specific certificate
before you grant the approve
permission.
Whether a machine or a human using kubectl as above, the role of the approver is to verify that the CSR satisfies two requirements:
- The subject of the CSR controls the private key used to sign the CSR. This addresses the threat of a third party masquerading as an authorized subject. In the above example, this step would be to verify that the pod controls the private key used to generate the CSR.
- The subject of the CSR is authorized to act in the requested context. This addresses the threat of an undesired subject joining the cluster. In the above example, this step would be to verify that the pod is allowed to participate in the requested service.
If and only if these two requirements are met, the approver should approve the CSR and otherwise should deny the CSR.
For more information on certificate approval and access control, read the Certificate Signing Requests reference page.
Configuring your cluster to provide signing
This page assumes that a signer is setup to serve the certificates API. The
Kubernetes controller manager provides a default implementation of a signer. To
enable it, pass the --cluster-signing-cert-file
and
--cluster-signing-key-file
parameters to the controller manager with paths to
your Certificate Authority's keypair.
4.12.3 - Manual Rotation of CA Certificates
This page shows how to manually rotate the certificate authority (CA) certificates.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.13. To check the version, enterkubectl version
.
- For more information about authentication in Kubernetes, see Authenticating.
- For more information about best practices for CA certificates, see Single root CA.
Rotate the CA certificates manually
Make sure to back up your certificate directory along with configuration files and any other necessary files.
This approach assumes operation of the Kubernetes control plane in a HA configuration with multiple API servers. Graceful termination of the API server is also assumed so clients can cleanly disconnect from one API server and reconnect to another.
Configurations with a single API server will experience unavailability while the API server is being restarted.
-
Distribute the new CA certificates and private keys (ex:
ca.crt
,ca.key
,front-proxy-ca.crt
, andfront-proxy-ca.key
) to all your control plane nodes in the Kubernetes certificates directory. -
Update kube-controller-manager's
--root-ca-file
to include both old and new CA. Then restart the component.Any service account created after this point will get secrets that include both old and new CAs.
Note: The files specified by the kube-controller-manager flags--client-ca-file
and--cluster-signing-cert-file
cannot be CA bundles. If these flags and--root-ca-file
point to the sameca.crt
file which is now a bundle (includes both old and new CA) you will face an error. To workaround this problem you can copy the new CA to a separate file and make the flags--client-ca-file
and--cluster-signing-cert-file
point to the copy. Onceca.crt
is no longer a bundle you can restore the problem flags to point toca.crt
and delete the copy. -
Update all service account tokens to include both old and new CA certificates.
If any pods are started before new CA is used by API servers, they will get this update and trust both old and new CAs.
base64_encoded_ca="$(base64 -w0 <path to file containing both old and new CAs>)" for namespace in $(kubectl get ns --no-headers | awk '{print $1}'); do for token in $(kubectl get secrets --namespace "$namespace" --field-selector type=kubernetes.io/service-account-token -o name); do kubectl get $token --namespace "$namespace" -o yaml | \ /bin/sed "s/\(ca.crt:\).*/\1 ${base64_encoded_ca}/" | \ kubectl apply -f - done done
-
Restart all pods using in-cluster configs (ex: kube-proxy, coredns, etc) so they can use the updated certificate authority data from ServiceAccount secrets.
- Make sure coredns, kube-proxy and other pods using in-cluster configs are working as expected.
-
Append the both old and new CA to the file against
--client-ca-file
and--kubelet-certificate-authority
flag in thekube-apiserver
configuration. -
Append the both old and new CA to the file against
--client-ca-file
flag in thekube-scheduler
configuration. -
Update certificates for user accounts by replacing the content of
client-certificate-data
andclient-key-data
respectively.For information about creating certificates for individual user accounts, see Configure certificates for user accounts.
Additionally, update the
certificate-authority-data
section in the kubeconfig files, respectively with Base64-encoded old and new certificate authority data -
Follow below steps in a rolling fashion.
-
Restart any other aggregated api servers or webhook handlers to trust the new CA certificates.
-
Restart the kubelet by update the file against
clientCAFile
in kubelet configuration andcertificate-authority-data
in kubelet.conf to use both the old and new CA on all nodes.If your kubelet is not using client certificate rotation update
client-certificate-data
andclient-key-data
in kubelet.conf on all nodes along with the kubelet client certificate file usually found in/var/lib/kubelet/pki
. -
Restart API servers with the certificates (
apiserver.crt
,apiserver-kubelet-client.crt
andfront-proxy-client.crt
) signed by new CA. You can use the existing private keys or new private keys. If you changed the private keys then update these in the Kubernetes certificates directory as well.Since the pod trusts both old and new CAs, there will be a momentarily disconnection after which the pod's kube client will reconnect to the new API server that uses the certificate signed by the new CA.
-
Restart Scheduler to use the new CAs.
-
Make sure control plane components logs no TLS errors.
Note: To generate certificates and private keys for your cluster using theopenssl
command line tool, see Certificates (openssl
). You can also usecfssl
. -
-
Annotate any Daemonsets and Deployments to trigger pod replacement in a safer rolling fashion.
Example:
for namespace in $(kubectl get namespace -o jsonpath='{.items[*].metadata.name}'); do for name in $(kubectl get deployments -n $namespace -o jsonpath='{.items[*].metadata.name}'); do kubectl patch deployment -n ${namespace} ${name} -p '{"spec":{"template":{"metadata":{"annotations":{"ca-rotation": "1"}}}}}'; done for name in $(kubectl get daemonset -n $namespace -o jsonpath='{.items[*].metadata.name}'); do kubectl patch daemonset -n ${namespace} ${name} -p '{"spec":{"template":{"metadata":{"annotations":{"ca-rotation": "1"}}}}}'; done done
Note: To limit the number of concurrent disruptions that your application experiences, see configure pod disruption budget.
-
-
If your cluster is using bootstrap tokens to join nodes, update the ConfigMap
cluster-info
in thekube-public
namespace with new CA.base64_encoded_ca="$(base64 -w0 /etc/kubernetes/pki/ca.crt)" kubectl get cm/cluster-info --namespace kube-public -o yaml | \ /bin/sed "s/\(certificate-authority-data:\).*/\1 ${base64_encoded_ca}/" | \ kubectl apply -f -
-
Verify the cluster functionality.
-
Validate the logs from control plane components, along with the kubelet and the kube-proxy are not throwing any tls errors, see looking at the logs.
-
Validate logs from any aggregated api servers and pods using in-cluster config.
-
-
Once the cluster functionality is successfully verified:
-
Update all service account tokens to include new CA certificate only.
- All pods using an in-cluster kubeconfig will eventually need to be restarted to pick up the new SA secret for the old CA to be completely untrusted.
-
Restart the control plane components by removing the old CA from the kubeconfig files and the files against
--client-ca-file
,--root-ca-file
flags resp. -
Restart kubelet by removing the old CA from file against the
clientCAFile
flag and kubelet kubeconfig file.
-
4.13 - Manage Cluster Daemons
4.13.1 - Perform a Rolling Update on a DaemonSet
This page shows how to perform a rolling update on a DaemonSet.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
DaemonSet Update Strategy
DaemonSet has two update strategy types:
OnDelete
: WithOnDelete
update strategy, after you update a DaemonSet template, new DaemonSet pods will only be created when you manually delete old DaemonSet pods. This is the same behavior of DaemonSet in Kubernetes version 1.5 or before.RollingUpdate
: This is the default update strategy.
WithRollingUpdate
update strategy, after you update a DaemonSet template, old DaemonSet pods will be killed, and new DaemonSet pods will be created automatically, in a controlled fashion. At most one pod of the DaemonSet will be running on each node during the whole update process.
Performing a Rolling Update
To enable the rolling update feature of a DaemonSet, you must set its
.spec.updateStrategy.type
to RollingUpdate
.
You may want to set
.spec.updateStrategy.rollingUpdate.maxUnavailable
(default to 1),
.spec.minReadySeconds
(default to 0) and
.spec.updateStrategy.rollingUpdate.maxSurge
(a beta feature and defaults to 0) as well.
Creating a DaemonSet with RollingUpdate
update strategy
This YAML file specifies a DaemonSet with an update strategy as 'RollingUpdate'
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
tolerations:
# this toleration is to have the daemonset runnable on master nodes
# remove it if your masters can't run pods
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
After verifying the update strategy of the DaemonSet manifest, create the DaemonSet:
kubectl create -f https://k8s.io/examples/controllers/fluentd-daemonset.yaml
Alternatively, use kubectl apply
to create the same DaemonSet if you plan to
update the DaemonSet with kubectl apply
.
kubectl apply -f https://k8s.io/examples/controllers/fluentd-daemonset.yaml
Checking DaemonSet RollingUpdate
update strategy
Check the update strategy of your DaemonSet, and make sure it's set to
RollingUpdate
:
kubectl get ds/fluentd-elasticsearch -o go-template='{{.spec.updateStrategy.type}}{{"\n"}}' -n kube-system
If you haven't created the DaemonSet in the system, check your DaemonSet manifest with the following command instead:
kubectl apply -f https://k8s.io/examples/controllers/fluentd-daemonset.yaml --dry-run=client -o go-template='{{.spec.updateStrategy.type}}{{"\n"}}'
The output from both commands should be:
RollingUpdate
If the output isn't RollingUpdate
, go back and modify the DaemonSet object or
manifest accordingly.
Updating a DaemonSet template
Any updates to a RollingUpdate
DaemonSet .spec.template
will trigger a rolling
update. Let's update the DaemonSet by applying a new YAML file. This can be done with several different kubectl
commands.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
tolerations:
# this toleration is to have the daemonset runnable on master nodes
# remove it if your masters can't run pods
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Declarative commands
If you update DaemonSets using
configuration files,
use kubectl apply
:
kubectl apply -f https://k8s.io/examples/controllers/fluentd-daemonset-update.yaml
Imperative commands
If you update DaemonSets using
imperative commands,
use kubectl edit
:
kubectl edit ds/fluentd-elasticsearch -n kube-system
Updating only the container image
If you only need to update the container image in the DaemonSet template, i.e.
.spec.template.spec.containers[*].image
, use kubectl set image
:
kubectl set image ds/fluentd-elasticsearch fluentd-elasticsearch=quay.io/fluentd_elasticsearch/fluentd:v2.6.0 -n kube-system
Watching the rolling update status
Finally, watch the rollout status of the latest DaemonSet rolling update:
kubectl rollout status ds/fluentd-elasticsearch -n kube-system
When the rollout is complete, the output is similar to this:
daemonset "fluentd-elasticsearch" successfully rolled out
Troubleshooting
DaemonSet rolling update is stuck
Sometimes, a DaemonSet rolling update may be stuck. Here are some possible causes:
Some nodes run out of resources
The rollout is stuck because new DaemonSet pods can't be scheduled on at least one node. This is possible when the node is running out of resources.
When this happens, find the nodes that don't have the DaemonSet pods scheduled on
by comparing the output of kubectl get nodes
and the output of:
kubectl get pods -l name=fluentd-elasticsearch -o wide -n kube-system
Once you've found those nodes, delete some non-DaemonSet pods from the node to make room for new DaemonSet pods.
Broken rollout
If the recent DaemonSet template update is broken, for example, the container is crash looping, or the container image doesn't exist (often due to a typo), DaemonSet rollout won't progress.
To fix this, update the DaemonSet template again. New rollout won't be blocked by previous unhealthy rollouts.
Clock skew
If .spec.minReadySeconds
is specified in the DaemonSet, clock skew between
master and nodes will make DaemonSet unable to detect the right rollout
progress.
Clean up
Delete DaemonSet from a namespace :
kubectl delete ds fluentd-elasticsearch -n kube-system
What's next
4.13.2 - Perform a Rollback on a DaemonSet
This page shows how to perform a rollback on a DaemonSet.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version 1.7. To check the version, enterkubectl version
.
You should already know how to perform a rolling update on a DaemonSet.
Performing a rollback on a DaemonSet
Step 1: Find the DaemonSet revision you want to roll back to
You can skip this step if you only want to roll back to the last revision.
List all revisions of a DaemonSet:
kubectl rollout history daemonset <daemonset-name>
This returns a list of DaemonSet revisions:
daemonsets "<daemonset-name>"
REVISION CHANGE-CAUSE
1 ...
2 ...
...
- Change cause is copied from DaemonSet annotation
kubernetes.io/change-cause
to its revisions upon creation. You may specify--record=true
inkubectl
to record the command executed in the change cause annotation.
To see the details of a specific revision:
kubectl rollout history daemonset <daemonset-name> --revision=1
This returns the details of that revision:
daemonsets "<daemonset-name>" with revision #1
Pod Template:
Labels: foo=bar
Containers:
app:
Image: ...
Port: ...
Environment: ...
Mounts: ...
Volumes: ...
Step 2: Roll back to a specific revision
# Specify the revision number you get from Step 1 in --to-revision
kubectl rollout undo daemonset <daemonset-name> --to-revision=<revision>
If it succeeds, the command returns:
daemonset "<daemonset-name>" rolled back
--to-revision
flag is not specified, kubectl picks the most recent revision.
Step 3: Watch the progress of the DaemonSet rollback
kubectl rollout undo daemonset
tells the server to start rolling back the
DaemonSet. The real rollback is done asynchronously inside the cluster
control plane.
To watch the progress of the rollback:
kubectl rollout status ds/<daemonset-name>
When the rollback is complete, the output is similar to:
daemonset "<daemonset-name>" successfully rolled out
Understanding DaemonSet revisions
In the previous kubectl rollout history
step, you got a list of DaemonSet
revisions. Each revision is stored in a resource named ControllerRevision.
To see what is stored in each revision, find the DaemonSet revision raw resources:
kubectl get controllerrevision -l <daemonset-selector-key>=<daemonset-selector-value>
This returns a list of ControllerRevisions:
NAME CONTROLLER REVISION AGE
<daemonset-name>-<revision-hash> DaemonSet/<daemonset-name> 1 1h
<daemonset-name>-<revision-hash> DaemonSet/<daemonset-name> 2 1h
Each ControllerRevision stores the annotations and template of a DaemonSet revision.
kubectl rollout undo
takes a specific ControllerRevision and replaces
DaemonSet template with the template stored in the ControllerRevision.
kubectl rollout undo
is equivalent to updating DaemonSet template to a
previous revision through other commands, such as kubectl edit
or kubectl apply
.
.revision
field) of the
ControllerRevision being rolled back to will advance. For example, if you
have revision 1 and 2 in the system, and roll back from revision 2 to revision
1, the ControllerRevision with .revision: 1
will become .revision: 3
.
Troubleshooting
4.14 - Service Catalog
4.14.1 - Install Service Catalog using Helm
Service Catalog is an extension API that enables applications running in Kubernetes clusters to easily use external managed software offerings, such as a datastore service offered by a cloud provider.
It provides a way to list, provision, and bind with external Managed Services from Service Brokers without needing detailed knowledge about how those services are created or managed.
Use Helm to install Service Catalog on your Kubernetes cluster. Up to date information on this process can be found at the kubernetes-sigs/service-catalog repo.
Before you begin
- Understand the key concepts of Service Catalog.
- Service Catalog requires a Kubernetes cluster running version 1.7 or higher.
- You must have a Kubernetes cluster with cluster DNS enabled.
- If you are using a cloud-based Kubernetes cluster or Minikube, you may already have cluster DNS enabled.
- If you are using
hack/local-up-cluster.sh
, ensure that theKUBE_ENABLE_CLUSTER_DNS
environment variable is set, then run the install script.
- Install and setup kubectl v1.7 or higher. Make sure it is configured to connect to the Kubernetes cluster.
- Install Helm v2.7.0 or newer.
- Follow the Helm install instructions.
- If you already have an appropriate version of Helm installed, execute
helm init
to install Tiller, the server-side component of Helm.
Add the service-catalog Helm repository
Once Helm is installed, add the service-catalog Helm repository to your local machine by executing the following command:
helm repo add svc-cat https://kubernetes-sigs.github.io/service-catalog
Check to make sure that it installed successfully by executing the following command:
helm search repo service-catalog
If the installation was successful, the command should output the following:
NAME CHART VERSION APP VERSION DESCRIPTION
svc-cat/catalog 0.2.1 service-catalog API server and controller-manager helm chart
svc-cat/catalog-v0.2 0.2.2 service-catalog API server and controller-manager helm chart
Enable RBAC
Your Kubernetes cluster must have RBAC enabled, which requires your Tiller Pod(s) to have cluster-admin
access.
When using Minikube v0.25 or older, you must run Minikube with RBAC explicitly enabled:
minikube start --extra-config=apiserver.Authorization.Mode=RBAC
When using Minikube v0.26+, run:
minikube start
With Minikube v0.26+, do not specify --extra-config
. The flag has since been changed to --extra-config=apiserver.authorization-mode and Minikube now uses RBAC by default. Specifying the older flag may cause the start command to hang.
If you are using hack/local-up-cluster.sh
, set the AUTHORIZATION_MODE
environment variable with the following values:
AUTHORIZATION_MODE=Node,RBAC hack/local-up-cluster.sh -O
By default, helm init
installs the Tiller Pod into the kube-system
namespace, with Tiller configured to use the default
service account.
--tiller-namespace
or --service-account
flags when running helm init
, the --serviceaccount
flag in the following command needs to be adjusted to reference the appropriate namespace and ServiceAccount name.
Configure Tiller to have cluster-admin
access:
kubectl create clusterrolebinding tiller-cluster-admin \
--clusterrole=cluster-admin \
--serviceaccount=kube-system:default
Install Service Catalog in your Kubernetes cluster
Install Service Catalog from the root of the Helm repository using the following command:
helm install catalog svc-cat/catalog --namespace catalog
helm install svc-cat/catalog --name catalog --namespace catalog
What's next
- View sample service brokers.
- Explore the kubernetes-sigs/service-catalog project.
4.14.2 - Install Service Catalog using SC
Service Catalog is an extension API that enables applications running in Kubernetes clusters to easily use external managed software offerings, such as a datastore service offered by a cloud provider.
It provides a way to list, provision, and bind with external Managed Services from Service Brokers without needing detailed knowledge about how those services are created or managed.
You can use the GCP Service Catalog Installer tool to easily install or uninstall Service Catalog on your Kubernetes cluster, linking it to Google Cloud projects.
Service Catalog can work with any kind of managed service, not only Google Cloud.
Before you begin
-
Understand the key concepts of Service Catalog.
-
Install Go 1.6+ and set the
GOPATH
. -
Install the cfssl tool needed for generating SSL artifacts.
-
Service Catalog requires Kubernetes version 1.7+.
-
Install and setup kubectl so that it is configured to connect to a Kubernetes v1.7+ cluster.
-
The kubectl user must be bound to the cluster-admin role for it to install Service Catalog. To ensure that this is true, run the following command:
kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=<user-name>
Install sc
in your local environment
The installer runs on your local computer as a CLI tool named sc
.
Install using go get
:
go get github.com/GoogleCloudPlatform/k8s-service-catalog/installer/cmd/sc
sc
should now be installed in your GOPATH/bin
directory.
Install Service Catalog in your Kubernetes cluster
First, verify that all dependencies have been installed. Run:
sc check
If the check is successful, it should return:
Dependency check passed. You are good to go.
Next, run the install command and specify the storageclass
that you want to use for the backup:
sc install --etcd-backup-storageclass "standard"
Uninstall Service Catalog
If you would like to uninstall Service Catalog from your Kubernetes cluster using the sc
tool, run:
sc uninstall
What's next
- View sample service brokers.
- Explore the kubernetes-sigs/service-catalog project.
4.15 - Networking
4.15.1 - Adding entries to Pod /etc/hosts with HostAliases
Adding entries to a Pod's /etc/hosts
file provides Pod-level override of hostname resolution when DNS and other options are not applicable. You can add these custom entries with the HostAliases field in PodSpec.
Modification not using HostAliases is not suggested because the file is managed by the kubelet and can be overwritten on during Pod creation/restart.
Default hosts file content
Start an Nginx Pod which is assigned a Pod IP:
kubectl run nginx --image nginx
pod/nginx created
Examine a Pod IP:
kubectl get pods --output=wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx 1/1 Running 0 13s 10.200.0.4 worker0
The hosts file content would look like this:
kubectl exec nginx -- cat /etc/hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.200.0.4 nginx
By default, the hosts
file only includes IPv4 and IPv6 boilerplates like
localhost
and its own hostname.
Adding additional entries with hostAliases
In addition to the default boilerplate, you can add additional entries to the
hosts
file.
For example: to resolve foo.local
, bar.local
to 127.0.0.1
and foo.remote
,
bar.remote
to 10.1.2.3
, you can configure HostAliases for a Pod under
.spec.hostAliases
:
apiVersion: v1
kind: Pod
metadata:
name: hostaliases-pod
spec:
restartPolicy: Never
hostAliases:
- ip: "127.0.0.1"
hostnames:
- "foo.local"
- "bar.local"
- ip: "10.1.2.3"
hostnames:
- "foo.remote"
- "bar.remote"
containers:
- name: cat-hosts
image: busybox:1.28
command:
- cat
args:
- "/etc/hosts"
You can start a Pod with that configuration by running:
kubectl apply -f https://k8s.io/examples/service/networking/hostaliases-pod.yaml
pod/hostaliases-pod created
Examine a Pod's details to see its IPv4 address and its status:
kubectl get pod --output=wide
NAME READY STATUS RESTARTS AGE IP NODE
hostaliases-pod 0/1 Completed 0 6s 10.200.0.5 worker0
The hosts
file content looks like this:
kubectl logs hostaliases-pod
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.200.0.5 hostaliases-pod
# Entries added by HostAliases.
127.0.0.1 foo.local bar.local
10.1.2.3 foo.remote bar.remote
with the additional entries specified at the bottom.
Why does the kubelet manage the hosts file?
The kubelet manages the
hosts
file for each container of the Pod to prevent the container runtime from
modifying the file after the containers have already been started.
Historically, Kubernetes always used Docker Engine as its container runtime, and Docker Engine would
then modify the /etc/hosts
file after each container had started.
Current Kubernetes can use a variety of container runtimes; even so, the kubelet manages the hosts file within each container so that the outcome is as intended regardless of which container runtime you use.
Avoid making manual changes to the hosts file inside a container.
If you make manual changes to the hosts file, those changes are lost when the container exits.
4.15.2 - Validate IPv4/IPv6 dual-stack
This document shares how to validate IPv4/IPv6 dual-stack enabled Kubernetes clusters.
Before you begin
- Provider support for dual-stack networking (Cloud provider or otherwise must be able to provide Kubernetes nodes with routable IPv4/IPv6 network interfaces)
- A network plugin that supports dual-stack (such as Calico, Cilium or Kubenet)
- Dual-stack enabled cluster
kubectl version
.
Validate addressing
Validate node addressing
Each dual-stack Node should have a single IPv4 block and a single IPv6 block allocated. Validate that IPv4/IPv6 Pod address ranges are configured by running the following command. Replace the sample node name with a valid dual-stack Node from your cluster. In this example, the Node's name is k8s-linuxpool1-34450317-0
:
kubectl get nodes k8s-linuxpool1-34450317-0 -o go-template --template='{{range .spec.podCIDRs}}{{printf "%s\n" .}}{{end}}'
10.244.1.0/24
a00:100::/24
There should be one IPv4 block and one IPv6 block allocated.
Validate that the node has an IPv4 and IPv6 interface detected. Replace node name with a valid node from the cluster. In this example the node name is k8s-linuxpool1-34450317-0
:
kubectl get nodes k8s-linuxpool1-34450317-0 -o go-template --template='{{range .status.addresses}}{{printf "%s: %s\n" .type .address}}{{end}}'
Hostname: k8s-linuxpool1-34450317-0
InternalIP: 10.240.0.5
InternalIP: 2001:1234:5678:9abc::5
Validate Pod addressing
Validate that a Pod has an IPv4 and IPv6 address assigned. Replace the Pod name with a valid Pod in your cluster. In this example the Pod name is pod01
:
kubectl get pods pod01 -o go-template --template='{{range .status.podIPs}}{{printf "%s\n" .ip}}{{end}}'
10.244.1.4
a00:100::4
You can also validate Pod IPs using the Downward API via the status.podIPs
fieldPath. The following snippet demonstrates how you can expose the Pod IPs via an environment variable called MY_POD_IPS
within a container.
env:
- name: MY_POD_IPS
valueFrom:
fieldRef:
fieldPath: status.podIPs
The following command prints the value of the MY_POD_IPS
environment variable from within a container. The value is a comma separated list that corresponds to the Pod's IPv4 and IPv6 addresses.
kubectl exec -it pod01 -- set | grep MY_POD_IPS
MY_POD_IPS=10.244.1.4,a00:100::4
The Pod's IP addresses will also be written to /etc/hosts
within a container. The following command executes a cat on /etc/hosts
on a dual stack Pod. From the output you can verify both the IPv4 and IPv6 IP address for the Pod.
kubectl exec -it pod01 -- cat /etc/hosts
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.244.1.4 pod01
a00:100::4 pod01
Validate Services
Create the following Service that does not explicitly define .spec.ipFamilyPolicy
. Kubernetes will assign a cluster IP for the Service from the first configured service-cluster-ip-range
and set the .spec.ipFamilyPolicy
to SingleStack
.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
Use kubectl
to view the YAML for the Service.
kubectl get svc my-service -o yaml
The Service has .spec.ipFamilyPolicy
set to SingleStack
and .spec.clusterIP
set to an IPv4 address from the first configured range set via --service-cluster-ip-range
flag on kube-controller-manager.
apiVersion: v1
kind: Service
metadata:
name: my-service
namespace: default
spec:
clusterIP: 10.0.217.164
clusterIPs:
- 10.0.217.164
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 9376
selector:
app: MyApp
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Create the following Service that explicitly defines IPv6
as the first array element in .spec.ipFamilies
. Kubernetes will assign a cluster IP for the Service from the IPv6 range configured service-cluster-ip-range
and set the .spec.ipFamilyPolicy
to SingleStack
.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
ipFamilies:
- IPv6
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
Use kubectl
to view the YAML for the Service.
kubectl get svc my-service -o yaml
The Service has .spec.ipFamilyPolicy
set to SingleStack
and .spec.clusterIP
set to an IPv6 address from the IPv6 range set via --service-cluster-ip-range
flag on kube-controller-manager.
apiVersion: v1
kind: Service
metadata:
labels:
app: MyApp
name: my-service
spec:
clusterIP: fd00::5118
clusterIPs:
- fd00::5118
ipFamilies:
- IPv6
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: MyApp
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Create the following Service that explicitly defines PreferDualStack
in .spec.ipFamilyPolicy
. Kubernetes will assign both IPv4 and IPv6 addresses (as this cluster has dual-stack enabled) and select the .spec.ClusterIP
from the list of .spec.ClusterIPs
based on the address family of the first element in the .spec.ipFamilies
array.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
ipFamilyPolicy: PreferDualStack
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
The kubectl get svc
command will only show the primary IP in the CLUSTER-IP
field.
kubectl get svc -l app=MyApp
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-service ClusterIP 10.0.216.242 <none> 80/TCP 5s
Validate that the Service gets cluster IPs from the IPv4 and IPv6 address blocks using kubectl describe
. You may then validate access to the service via the IPs and ports.
kubectl describe svc -l app=MyApp
Name: my-service
Namespace: default
Labels: app=MyApp
Annotations: <none>
Selector: app=MyApp
Type: ClusterIP
IP Family Policy: PreferDualStack
IP Families: IPv4,IPv6
IP: 10.0.216.242
IPs: 10.0.216.242,fd00::af55
Port: <unset> 80/TCP
TargetPort: 9376/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
Create a dual-stack load balanced Service
If the cloud provider supports the provisioning of IPv6 enabled external load balancers, create the following Service with PreferDualStack
in .spec.ipFamilyPolicy
, IPv6
as the first element of the .spec.ipFamilies
array and the type
field set to LoadBalancer
.
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
app: MyApp
spec:
ipFamilyPolicy: PreferDualStack
ipFamilies:
- IPv6
type: LoadBalancer
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
Check the Service:
kubectl get svc -l app=MyApp
Validate that the Service receives a CLUSTER-IP
address from the IPv6 address block along with an EXTERNAL-IP
. You may then validate access to the service via the IP and port.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-service LoadBalancer fd00::7ebc 2603:1030:805::5 80:30790/TCP 35s
4.16 - Configure a kubelet image credential provider
Kubernetes v1.20 [alpha]
Starting from Kubernetes v1.20, the kubelet can dynamically retrieve credentials for a container image registry using exec plugins. The kubelet and the exec plugin communicate through stdio (stdin, stdout, and stderr) using Kubernetes versioned APIs. These plugins allow the kubelet to request credentials for a container registry dynamically as opposed to storing static credentials on disk. For example, the plugin may talk to a local metadata server to retrieve short-lived credentials for an image that is being pulled by the kubelet.
You may be interested in using this capability if any of the below are true:
- API calls to a cloud provider service are required to retrieve authentication information for a registry.
- Credentials have short expiration times and requesting new credentials frequently is required.
- Storing registry credentials on disk or in imagePullSecrets is not acceptable.
This guide demonstrates how to configure the kubelet's image credential provider plugin mechanism.
Before you begin
- The kubelet image credential provider is introduced in v1.20 as an alpha feature. As with other alpha features,
a feature gate
KubeletCredentialProviders
must be enabled on only the kubelet for the feature to work. - A working implementation of a credential provider exec plugin. You can build your own plugin or use one provided by cloud providers.
Installing Plugins on Nodes
A credential provider plugin is an executable binary that will be run by the kubelet. Ensure that the plugin binary exists on every node in your cluster and stored in a known directory. The directory will be required later when configuring kubelet flags.
Configuring the Kubelet
In order to use this feature, the kubelet expects two flags to be set:
--image-credential-provider-config
- the path to the credential provider plugin config file.--image-credential-provider-bin-dir
- the path to the directory where credential provider plugin binaries are located.
Configure a kubelet credential provider
The configuration file passed into --image-credential-provider-config
is read by the kubelet to determine which exec plugins
should be invoked for which container images. Here's an example configuration file you may end up using if you are using the
ECR-based plugin:
apiVersion: kubelet.config.k8s.io/v1alpha1
kind: CredentialProviderConfig
# providers is a list of credential provider plugins that will be enabled by the kubelet.
# Multiple providers may match against a single image, in which case credentials
# from all providers will be returned to the kubelet. If multiple providers are called
# for a single image, the results are combined. If providers return overlapping
# auth keys, the value from the provider earlier in this list is used.
providers:
# name is the required name of the credential provider. It must match the name of the
# provider executable as seen by the kubelet. The executable must be in the kubelet's
# bin directory (set by the --image-credential-provider-bin-dir flag).
- name: ecr
# matchImages is a required list of strings used to match against images in order to
# determine if this provider should be invoked. If one of the strings matches the
# requested image from the kubelet, the plugin will be invoked and given a chance
# to provide credentials. Images are expected to contain the registry domain
# and URL path.
#
# Each entry in matchImages is a pattern which can optionally contain a port and a path.
# Globs can be used in the domain, but not in the port or the path. Globs are supported
# as subdomains like '*.k8s.io' or 'k8s.*.io', and top-level-domains such as 'k8s.*'.
# Matching partial subdomains like 'app*.k8s.io' is also supported. Each glob can only match
# a single subdomain segment, so *.io does not match *.k8s.io.
#
# A match exists between an image and a matchImage when all of the below are true:
# - Both contain the same number of domain parts and each part matches.
# - The URL path of an imageMatch must be a prefix of the target image URL path.
# - If the imageMatch contains a port, then the port must match in the image as well.
#
# Example values of matchImages:
# - 123456789.dkr.ecr.us-east-1.amazonaws.com
# - *.azurecr.io
# - gcr.io
# - *.*.registry.io
# - registry.io:8080/path
matchImages:
- "*.dkr.ecr.*.amazonaws.com"
- "*.dkr.ecr.*.amazonaws.cn"
- "*.dkr.ecr-fips.*.amazonaws.com"
- "*.dkr.ecr.us-iso-east-1.c2s.ic.gov"
- "*.dkr.ecr.us-isob-east-1.sc2s.sgov.gov"
# defaultCacheDuration is the default duration the plugin will cache credentials in-memory
# if a cache duration is not provided in the plugin response. This field is required.
defaultCacheDuration: "12h"
# Required input version of the exec CredentialProviderRequest. The returned CredentialProviderResponse
# MUST use the same encoding version as the input. Current supported values are:
# - credentialprovider.kubelet.k8s.io/v1alpha1
apiVersion: credentialprovider.kubelet.k8s.io/v1alpha1
# Arguments to pass to the command when executing it.
# +optional
args:
- get-credentials
# Env defines additional environment variables to expose to the process. These
# are unioned with the host's environment, as well as variables client-go uses
# to pass argument to the plugin.
# +optional
env:
- name: AWS_PROFILE
value: example_profile
The providers
field is a list of enabled plugins used by the kubelet. Each entry has a few required fields:
name
: the name of the plugin which MUST match the name of the executable binary that exists in the directory passed into--image-credential-provider-bin-dir
.matchImages
: a list of strings used to match against images in order to determine if this provider should be invoked. More on this below.defaultCacheDuration
: the default duration the kubelet will cache credentials in-memory if a cache duration was not specified by the plugin.apiVersion
: the API version that the kubelet and the exec plugin will use when communicating.
Each credential provider can also be given optional args and environment variables as well. Consult the plugin implementors to determine what set of arguments and environment variables are required for a given plugin.
Configure image matching
The matchImages
field for each credential provider is used by the kubelet to determine whether a plugin should be invoked
for a given image that a Pod is using. Each entry in matchImages
is an image pattern which can optionally contain a port and a path.
Globs can be used in the domain, but not in the port or the path. Globs are supported as subdomains like *.k8s.io
or k8s.*.io
,
and top-level domains such as k8s.*
. Matching partial subdomains like app*.k8s.io
is also supported. Each glob can only match
a single subdomain segment, so *.io
does NOT match *.k8s.io
.
A match exists between an image name and a matchImage
entry when all of the below are true:
- Both contain the same number of domain parts and each part matches.
- The URL path of match image must be a prefix of the target image URL path.
- If the imageMatch contains a port, then the port must match in the image as well.
Some example values of matchImages
patterns are:
123456789.dkr.ecr.us-east-1.amazonaws.com
*.azurecr.io
gcr.io
*.*.registry.io
foo.registry.io:8080/path
What's next
- Read the details about
CredentialProviderConfig
in the kubelet configuration API (v1alpha1) reference. - Read the kubelet credential provider API reference (v1alpha1).
4.17 - Extend kubectl with plugins
This guide demonstrates how to install and write extensions for kubectl. By thinking of core kubectl
commands as essential building blocks for interacting with a Kubernetes cluster, a cluster administrator can think
of plugins as a means of utilizing these building blocks to create more complex behavior. Plugins extend kubectl
with new sub-commands, allowing for new and custom features not included in the main distribution of kubectl
.
Before you begin
You need to have a working kubectl
binary installed.
Installing kubectl plugins
A plugin is a standalone executable file, whose name begins with kubectl-
. To install a plugin, move its executable file to anywhere on your PATH
.
You can also discover and install kubectl plugins available in the open source using Krew. Krew is a plugin manager maintained by the Kubernetes SIG CLI community.
Discovering plugins
kubectl
provides a command kubectl plugin list
that searches your PATH
for valid plugin executables.
Executing this command causes a traversal of all files in your PATH
. Any files that are executable, and begin with kubectl-
will show up in the order in which they are present in your PATH
in this command's output.
A warning will be included for any files beginning with kubectl-
that are not executable.
A warning will also be included for any valid plugin files that overlap each other's name.
You can use Krew to discover and install kubectl
plugins from a community-curated
plugin index.
Limitations
It is currently not possible to create plugins that overwrite existing kubectl
commands. For example, creating a plugin kubectl-version
will cause that plugin to never be executed, as the existing kubectl version
command will always take precedence over it. Due to this limitation, it is also not possible to use plugins to add new subcommands to existing kubectl
commands. For example, adding a subcommand kubectl create foo
by naming your plugin kubectl-create-foo
will cause that plugin to be ignored.
kubectl plugin list
shows warnings for any valid plugins that attempt to do this.
Writing kubectl plugins
You can write a plugin in any programming language or script that allows you to write command-line commands.
There is no plugin installation or pre-loading required. Plugin executables receive
the inherited environment from the kubectl
binary.
A plugin determines which command path it wishes to implement based on its name.
For example, a plugin named kubectl-foo
provides a command kubectl foo
. You must
install the plugin executable somewhere in your PATH
.
Example plugin
#!/bin/bash
# optional argument handling
if [[ "$1" == "version" ]]
then
echo "1.0.0"
exit 0
fi
# optional argument handling
if [[ "$1" == "config" ]]
then
echo "$KUBECONFIG"
exit 0
fi
echo "I am a plugin named kubectl-foo"
Using a plugin
To use a plugin, make the plugin executable:
sudo chmod +x ./kubectl-foo
and place it anywhere in your PATH
:
sudo mv ./kubectl-foo /usr/local/bin
You may now invoke your plugin as a kubectl
command:
kubectl foo
I am a plugin named kubectl-foo
All args and flags are passed as-is to the executable:
kubectl foo version
1.0.0
All environment variables are also passed as-is to the executable:
export KUBECONFIG=~/.kube/config
kubectl foo config
/home/<user>/.kube/config
KUBECONFIG=/etc/kube/config kubectl foo config
/etc/kube/config
Additionally, the first argument that is passed to a plugin will always be the full path to the location where it was invoked ($0
would equal /usr/local/bin/kubectl-foo
in the example above).
Naming a plugin
As seen in the example above, a plugin determines the command path that it will implement based on its filename. Every sub-command in the command path that a plugin targets, is separated by a dash (-
).
For example, a plugin that wishes to be invoked whenever the command kubectl foo bar baz
is invoked by the user, would have the filename of kubectl-foo-bar-baz
.
Flags and argument handling
The plugin mechanism does not create any custom, plugin-specific values or environment variables for a plugin process.
An older kubectl plugin mechanism provided environment variables such as KUBECTL_PLUGINS_CURRENT_NAMESPACE
; that no longer happens.
kubectl plugins must parse and validate all of the arguments passed to them. See using the command line runtime package for details of a Go library aimed at plugin authors.
Here are some additional cases where users invoke your plugin while providing additional flags and arguments. This builds upon the kubectl-foo-bar-baz
plugin from the scenario above.
If you run kubectl foo bar baz arg1 --flag=value arg2
, kubectl's plugin mechanism will first try to find the plugin with the longest possible name, which in this case
would be kubectl-foo-bar-baz-arg1
. Upon not finding that plugin, kubectl then treats the last dash-separated value as an argument (arg1
in this case), and attempts to find the next longest possible name, kubectl-foo-bar-baz
.
Upon having found a plugin with this name, kubectl then invokes that plugin, passing all args and flags after the plugin's name as arguments to the plugin process.
Example:
# create a plugin
echo -e '#!/bin/bash\n\necho "My first command-line argument was $1"' > kubectl-foo-bar-baz
sudo chmod +x ./kubectl-foo-bar-baz
# "install" your plugin by moving it to a directory in your $PATH
sudo mv ./kubectl-foo-bar-baz /usr/local/bin
# check that kubectl recognizes your plugin
kubectl plugin list
The following kubectl-compatible plugins are available:
/usr/local/bin/kubectl-foo-bar-baz
# test that calling your plugin via a "kubectl" command works
# even when additional arguments and flags are passed to your
# plugin executable by the user.
kubectl foo bar baz arg1 --meaningless-flag=true
My first command-line argument was arg1
As you can see, your plugin was found based on the kubectl
command specified by a user, and all extra arguments and flags were passed as-is to the plugin executable once it was found.
Names with dashes and underscores
Although the kubectl
plugin mechanism uses the dash (-
) in plugin filenames to separate the sequence of sub-commands processed by the plugin, it is still possible to create a plugin
command containing dashes in its commandline invocation by using underscores (_
) in its filename.
Example:
# create a plugin containing an underscore in its filename
echo -e '#!/bin/bash\n\necho "I am a plugin with a dash in my name"' > ./kubectl-foo_bar
sudo chmod +x ./kubectl-foo_bar
# move the plugin into your $PATH
sudo mv ./kubectl-foo_bar /usr/local/bin
# You can now invoke your plugin via kubectl:
kubectl foo-bar
I am a plugin with a dash in my name
Note that the introduction of underscores to a plugin filename does not prevent you from having commands such as kubectl foo_bar
.
The command from the above example, can be invoked using either a dash (-
) or an underscore (_
):
# You can invoke your custom command with a dash
kubectl foo-bar
I am a plugin with a dash in my name
# You can also invoke your custom command with an underscore
kubectl foo_bar
I am a plugin with a dash in my name
Name conflicts and overshadowing
It is possible to have multiple plugins with the same filename in different locations throughout your PATH
.
For example, given a PATH
with the following value: PATH=/usr/local/bin/plugins:/usr/local/bin/moreplugins
, a copy of plugin kubectl-foo
could exist in /usr/local/bin/plugins
and /usr/local/bin/moreplugins
,
such that the output of the kubectl plugin list
command is:
PATH=/usr/local/bin/plugins:/usr/local/bin/moreplugins kubectl plugin list
The following kubectl-compatible plugins are available:
/usr/local/bin/plugins/kubectl-foo
/usr/local/bin/moreplugins/kubectl-foo
- warning: /usr/local/bin/moreplugins/kubectl-foo is overshadowed by a similarly named plugin: /usr/local/bin/plugins/kubectl-foo
error: one plugin warning was found
In the above scenario, the warning under /usr/local/bin/moreplugins/kubectl-foo
tells you that this plugin will never be executed. Instead, the executable that appears first in your PATH
, /usr/local/bin/plugins/kubectl-foo
, will always be found and executed first by the kubectl
plugin mechanism.
A way to resolve this issue is to ensure that the location of the plugin that you wish to use with kubectl
always comes first in your PATH
. For example, if you want to always use /usr/local/bin/moreplugins/kubectl-foo
anytime that the kubectl
command kubectl foo
was invoked, change the value of your PATH
to be /usr/local/bin/moreplugins:/usr/local/bin/plugins
.
Invocation of the longest executable filename
There is another kind of overshadowing that can occur with plugin filenames. Given two plugins present in a user's PATH
: kubectl-foo-bar
and kubectl-foo-bar-baz
, the kubectl
plugin mechanism will always choose the longest possible plugin name for a given user command. Some examples below, clarify this further:
# for a given kubectl command, the plugin with the longest possible filename will always be preferred
kubectl foo bar baz
Plugin kubectl-foo-bar-baz is executed
kubectl foo bar
Plugin kubectl-foo-bar is executed
kubectl foo bar baz buz
Plugin kubectl-foo-bar-baz is executed, with "buz" as its first argument
kubectl foo bar buz
Plugin kubectl-foo-bar is executed, with "buz" as its first argument
This design choice ensures that plugin sub-commands can be implemented across multiple files, if needed, and that these sub-commands can be nested under a "parent" plugin command:
ls ./plugin_command_tree
kubectl-parent
kubectl-parent-subcommand
kubectl-parent-subcommand-subsubcommand
Checking for plugin warnings
You can use the aforementioned kubectl plugin list
command to ensure that your plugin is visible by kubectl
, and verify that there are no warnings preventing it from being called as a kubectl
command.
kubectl plugin list
The following kubectl-compatible plugins are available:
test/fixtures/pkg/kubectl/plugins/kubectl-foo
/usr/local/bin/kubectl-foo
- warning: /usr/local/bin/kubectl-foo is overshadowed by a similarly named plugin: test/fixtures/pkg/kubectl/plugins/kubectl-foo
plugins/kubectl-invalid
- warning: plugins/kubectl-invalid identified as a kubectl plugin, but it is not executable
error: 2 plugin warnings were found
Using the command line runtime package
If you're writing a plugin for kubectl and you're using Go, you can make use of the cli-runtime utility libraries.
These libraries provide helpers for parsing or updating a user's kubeconfig file, for making REST-style requests to the API server, or to bind flags associated with configuration and printing.
See the Sample CLI Plugin for an example usage of the tools provided in the CLI Runtime repo.
Distributing kubectl plugins
If you have developed a plugin for others to use, you should consider how you package it, distribute it and deliver updates to your users.
Krew
Krew offers a cross-platform way to package and distribute your plugins. This way, you use a single packaging format for all target platforms (Linux, Windows, macOS etc) and deliver updates to your users. Krew also maintains a plugin index so that other people can discover your plugin and install it.
Native / platform specific package management
Alternatively, you can use traditional package managers such as, apt
or yum
on Linux, Chocolatey on Windows, and Homebrew on macOS. Any package
manager will be suitable if it can place new executables placed somewhere
in the user's PATH
.
As a plugin author, if you pick this option then you also have the burden
of updating your kubectl plugin's distribution package across multiple
platforms for each release.
Source code
You can publish the source code; for example, as a Git repository. If you choose this option, someone who wants to use that plugin must fetch the code, set up a build environment (if it needs compiling), and deploy the plugin. If you also make compiled packages available, or use Krew, that will make installs easier.
What's next
- Check the Sample CLI Plugin repository for a detailed example of a plugin written in Go. In case of any questions, feel free to reach out to the SIG CLI team.
- Read about Krew, a package manager for kubectl plugins.
4.18 - Manage HugePages
Kubernetes v1.23 [stable]
Kubernetes supports the allocation and consumption of pre-allocated huge pages by applications in a Pod. This page describes how users can consume huge pages.
Before you begin
- Kubernetes nodes must pre-allocate huge pages in order for the node to report its huge page capacity. A node can pre-allocate huge pages for multiple sizes.
The nodes will automatically discover and report all huge page resources as schedulable resources.
API
Huge pages can be consumed via container level resource requirements using the
resource name hugepages-<size>
, where <size>
is the most compact binary
notation using integer values supported on a particular node. For example, if a
node supports 2048KiB and 1048576KiB page sizes, it will expose a schedulable
resources hugepages-2Mi
and hugepages-1Gi
. Unlike CPU or memory, huge pages
do not support overcommit. Note that when requesting hugepage resources, either
memory or CPU resources must be requested as well.
A pod may consume multiple huge page sizes in a single pod spec. In this case it
must use medium: HugePages-<hugepagesize>
notation for all volume mounts.
apiVersion: v1
kind: Pod
metadata:
name: huge-pages-example
spec:
containers:
- name: example
image: fedora:latest
command:
- sleep
- inf
volumeMounts:
- mountPath: /hugepages-2Mi
name: hugepage-2mi
- mountPath: /hugepages-1Gi
name: hugepage-1gi
resources:
limits:
hugepages-2Mi: 100Mi
hugepages-1Gi: 2Gi
memory: 100Mi
requests:
memory: 100Mi
volumes:
- name: hugepage-2mi
emptyDir:
medium: HugePages-2Mi
- name: hugepage-1gi
emptyDir:
medium: HugePages-1Gi
A pod may use medium: HugePages
only if it requests huge pages of one size.
apiVersion: v1
kind: Pod
metadata:
name: huge-pages-example
spec:
containers:
- name: example
image: fedora:latest
command:
- sleep
- inf
volumeMounts:
- mountPath: /hugepages
name: hugepage
resources:
limits:
hugepages-2Mi: 100Mi
memory: 100Mi
requests:
memory: 100Mi
volumes:
- name: hugepage
emptyDir:
medium: HugePages
- Huge page requests must equal the limits. This is the default if limits are specified, but requests are not.
- Huge pages are isolated at a container scope, so each container has own limit on their cgroup sandbox as requested in a container spec.
- EmptyDir volumes backed by huge pages may not consume more huge page memory than the pod request.
- Applications that consume huge pages via
shmget()
withSHM_HUGETLB
must run with a supplemental group that matchesproc/sys/vm/hugetlb_shm_group
. - Huge page usage in a namespace is controllable via ResourceQuota similar
to other compute resources like
cpu
ormemory
using thehugepages-<size>
token.
4.19 - Schedule GPUs
Kubernetes v1.10 [beta]
Kubernetes includes experimental support for managing AMD and NVIDIA GPUs (graphical processing units) across several nodes.
This page describes how users can consume GPUs across different Kubernetes versions and the current limitations.
Using device plugins
Kubernetes implements Device Plugins to let Pods access specialized hardware features such as GPUs.
As an administrator, you have to install GPU drivers from the corresponding hardware vendor on the nodes and run the corresponding device plugin from the GPU vendor:
When the above conditions are true, Kubernetes will expose amd.com/gpu
or
nvidia.com/gpu
as a schedulable resource.
You can consume these GPUs from your containers by requesting
<vendor>.com/gpu
the same way you request cpu
or memory
.
However, there are some limitations in how you specify the resource requirements
when using GPUs:
- GPUs are only supposed to be specified in the
limits
section, which means:- You can specify GPU
limits
without specifyingrequests
because Kubernetes will use the limit as the request value by default. - You can specify GPU in both
limits
andrequests
but these two values must be equal. - You cannot specify GPU
requests
without specifyinglimits
.
- You can specify GPU
- Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
- Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.
Here's an example:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
Deploying AMD GPU device plugin
The official AMD GPU device plugin has the following requirements:
- Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
To deploy the AMD device plugin once your cluster is running and the above requirements are satisfied:
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/v1.10/k8s-ds-amdgpu-dp.yaml
You can report issues with this third-party device plugin by logging an issue in RadeonOpenCompute/k8s-device-plugin.
Deploying NVIDIA GPU device plugin
There are currently two device plugin implementations for NVIDIA GPUs:
Official NVIDIA GPU device plugin
The official NVIDIA GPU device plugin has the following requirements:
- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
- Kubernetes nodes have to be pre-installed with nvidia-docker 2.0
- Kubelet must use Docker as its container runtime
nvidia-container-runtime
must be configured as the default runtime for Docker, instead of runc.- The version of the NVIDIA drivers must match the constraint ~= 384.81.
To deploy the NVIDIA device plugin once your cluster is running and the above requirements are satisfied:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
You can report issues with this third-party device plugin by logging an issue in NVIDIA/k8s-device-plugin.
NVIDIA GPU device plugin used by GCE
The NVIDIA GPU device plugin used by GCE doesn't require using nvidia-docker and should work with any container runtime that is compatible with the Kubernetes Container Runtime Interface (CRI). It's tested on Container-Optimized OS and has experimental code for Ubuntu from 1.9 onwards.
You can use the following commands to install the NVIDIA drivers and device plugin:
# Install NVIDIA drivers on Container-Optimized OS:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
# Install NVIDIA drivers on Ubuntu (experimental):
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml
# Install the device plugin:
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.14/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
You can report issues with using or deploying this third-party device plugin by logging an issue in GoogleCloudPlatform/container-engine-accelerators.
Google publishes its own instructions for using NVIDIA GPUs on GKE .
Clusters containing different types of GPUs
If different nodes in your cluster have different types of GPUs, then you can use Node Labels and Node Selectors to schedule pods to appropriate nodes.
For example:
# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
Automatic node labelling
If you're using AMD GPU devices, you can deploy Node Labeller. Node Labeller is a controller that automatically labels your nodes with GPU device properties.
At the moment, that controller can add labels for:
- Device ID (-device-id)
- VRAM Size (-vram)
- Number of SIMD (-simd-count)
- Number of Compute Unit (-cu-count)
- Firmware and Feature Versions (-firmware)
- GPU Family, in two letters acronym (-family)
- SI - Southern Islands
- CI - Sea Islands
- KV - Kaveri
- VI - Volcanic Islands
- CZ - Carrizo
- AI - Arctic Islands
- RV - Raven
kubectl describe node cluster-node-23
Name: cluster-node-23
Roles: <none>
Labels: beta.amd.com/gpu.cu-count.64=1
beta.amd.com/gpu.device-id.6860=1
beta.amd.com/gpu.family.AI=1
beta.amd.com/gpu.simd-count.256=1
beta.amd.com/gpu.vram.16G=1
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=cluster-node-23
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
…
With the Node Labeller in use, you can specify the GPU type in the Pod spec:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
This will ensure that the Pod will be scheduled to a node that has the GPU type you specified.
5 - Tutorials
This section of the Kubernetes documentation contains tutorials. A tutorial shows how to accomplish a goal that is larger than a single task. Typically a tutorial has several sections, each of which has a sequence of steps. Before walking through each tutorial, you may want to bookmark the Standardized Glossary page for later references.
Basics
-
Kubernetes Basics is an in-depth interactive tutorial that helps you understand the Kubernetes system and try out some basic Kubernetes features.
Configuration
Stateless Applications
Stateful Applications
Services
Security
- Apply Pod Security Standards at Cluster level
- Apply Pod Security Standards at Namespace level
- AppArmor
- seccomp
What's next
If you would like to write a tutorial, see Content Page Types for information about the tutorial page type.
5.1 - Hello Minikube
This tutorial shows you how to run a sample app on Kubernetes using minikube and Katacoda. Katacoda provides a free, in-browser Kubernetes environment.
Objectives
- Deploy a sample application to minikube.
- Run the app.
- View application logs.
Before you begin
This tutorial provides a container image that uses NGINX to echo back all the requests.
Create a minikube cluster
-
Click Launch Terminal
minikube start
. Before you run minikube dashboard
, you should open a new terminal, start minikube dashboard
there, and then switch back to the main terminal.
-
Open the Kubernetes dashboard in a browser:
minikube dashboard
-
Katacoda environment only: At the top of the terminal pane, click the plus sign, and then click Select port to view on Host 1.
-
Katacoda environment only: Type
30000
, and then click Display Port.
The dashboard
command enables the dashboard add-on and opens the proxy in the default web browser.
You can create Kubernetes resources on the dashboard such as Deployment and Service.
If you are running in an environment as root, see Open Dashboard with URL.
By default, the dashboard is only accessible from within the internal Kubernetes virtual network.
The dashboard
command creates a temporary proxy to make the dashboard accessible from outside the Kubernetes virtual network.
To stop the proxy, run Ctrl+C
to exit the process.
After the command exits, the dashboard remains running in the Kubernetes cluster.
You can run the dashboard
command again to create another proxy to access the dashboard.
Open Dashboard with URL
If you don't want to open a web browser, run the dashboard command with the --url
flag to emit a URL:
minikube dashboard --url
Create a Deployment
A Kubernetes Pod is a group of one or more Containers, tied together for the purposes of administration and networking. The Pod in this tutorial has only one Container. A Kubernetes Deployment checks on the health of your Pod and restarts the Pod's Container if it terminates. Deployments are the recommended way to manage the creation and scaling of Pods.
-
Use the
kubectl create
command to create a Deployment that manages a Pod. The Pod runs a Container based on the provided Docker image.kubectl create deployment hello-node --image=k8s.gcr.io/echoserver:1.4
-
View the Deployment:
kubectl get deployments
The output is similar to:
NAME READY UP-TO-DATE AVAILABLE AGE hello-node 1/1 1 1 1m
-
View the Pod:
kubectl get pods
The output is similar to:
NAME READY STATUS RESTARTS AGE hello-node-5f76cf6ccf-br9b5 1/1 Running 0 1m
-
View cluster events:
kubectl get events
-
View the
kubectl
configuration:kubectl config view
kubectl
commands, see the kubectl overview.
Create a Service
By default, the Pod is only accessible by its internal IP address within the
Kubernetes cluster. To make the hello-node
Container accessible from outside the
Kubernetes virtual network, you have to expose the Pod as a
Kubernetes Service.
-
Expose the Pod to the public internet using the
kubectl expose
command:kubectl expose deployment hello-node --type=LoadBalancer --port=8080
The
--type=LoadBalancer
flag indicates that you want to expose your Service outside of the cluster.The application code inside the image
k8s.gcr.io/echoserver
only listens on TCP port 8080. If you usedkubectl expose
to expose a different port, clients could not connect to that other port. -
View the Service you created:
kubectl get services
The output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE hello-node LoadBalancer 10.108.144.78 <pending> 8080:30369/TCP 21s kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 23m
On cloud providers that support load balancers, an external IP address would be provisioned to access the Service. On minikube, the
LoadBalancer
type makes the Service accessible through theminikube service
command. -
Run the following command:
minikube service hello-node
-
Katacoda environment only: Click the plus sign, and then click Select port to view on Host 1.
-
Katacoda environment only: Note the 5-digit port number displayed opposite to
8080
in services output. This port number is randomly generated and it can be different for you. Type your number in the port number text box, then click Display Port. Using the example from earlier, you would type30369
.This opens up a browser window that serves your app and shows the app's response.
Enable addons
The minikube tool includes a set of built-in addons that can be enabled, disabled and opened in the local Kubernetes environment.
-
List the currently supported addons:
minikube addons list
The output is similar to:
addon-manager: enabled dashboard: enabled default-storageclass: enabled efk: disabled freshpod: disabled gvisor: disabled helm-tiller: disabled ingress: disabled ingress-dns: disabled logviewer: disabled metrics-server: disabled nvidia-driver-installer: disabled nvidia-gpu-device-plugin: disabled registry: disabled registry-creds: disabled storage-provisioner: enabled storage-provisioner-gluster: disabled
-
Enable an addon, for example,
metrics-server
:minikube addons enable metrics-server
The output is similar to:
The 'metrics-server' addon is enabled
-
View the Pod and Service you created:
kubectl get pod,svc -n kube-system
The output is similar to:
NAME READY STATUS RESTARTS AGE pod/coredns-5644d7b6d9-mh9ll 1/1 Running 0 34m pod/coredns-5644d7b6d9-pqd2t 1/1 Running 0 34m pod/metrics-server-67fb648c5 1/1 Running 0 26s pod/etcd-minikube 1/1 Running 0 34m pod/influxdb-grafana-b29w8 2/2 Running 0 26s pod/kube-addon-manager-minikube 1/1 Running 0 34m pod/kube-apiserver-minikube 1/1 Running 0 34m pod/kube-controller-manager-minikube 1/1 Running 0 34m pod/kube-proxy-rnlps 1/1 Running 0 34m pod/kube-scheduler-minikube 1/1 Running 0 34m pod/storage-provisioner 1/1 Running 0 34m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/metrics-server ClusterIP 10.96.241.45 <none> 80/TCP 26s service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 34m service/monitoring-grafana NodePort 10.99.24.54 <none> 80:30002/TCP 26s service/monitoring-influxdb ClusterIP 10.111.169.94 <none> 8083/TCP,8086/TCP 26s
-
Disable
metrics-server
:minikube addons disable metrics-server
The output is similar to:
metrics-server was successfully disabled
Clean up
Now you can clean up the resources you created in your cluster:
kubectl delete service hello-node
kubectl delete deployment hello-node
Optionally, stop the Minikube virtual machine (VM):
minikube stop
Optionally, delete the Minikube VM:
minikube delete
What's next
- Learn more about Deployment objects.
- Learn more about Deploying applications.
- Learn more about Service objects.
5.2 - Learn Kubernetes Basics
Kubernetes Basics
This tutorial provides a walkthrough of the basics of the Kubernetes cluster orchestration system. Each module contains some background information on major Kubernetes features and concepts, and includes an interactive online tutorial. These interactive tutorials let you manage a simple cluster and its containerized applications for yourself.
Using the interactive tutorials, you can learn to:
- Deploy a containerized application on a cluster.
- Scale the deployment.
- Update the containerized application with a new software version.
- Debug the containerized application.
The tutorials use Katacoda to run a virtual terminal in your web browser that runs Minikube, a small-scale local deployment of Kubernetes that can run anywhere. There's no need to install any software or configure anything; each interactive tutorial runs directly out of your web browser itself.
What can Kubernetes do for you?
With modern web services, users expect applications to be available 24/7, and developers expect to deploy new versions of those applications several times a day. Containerization helps package software to serve these goals, enabling applications to be released and updated without downtime. Kubernetes helps you make sure those containerized applications run where and when you want, and helps them find the resources and tools they need to work. Kubernetes is a production-ready, open source platform designed with Google's accumulated experience in container orchestration, combined with best-of-breed ideas from the community.
5.2.1 - Create a Cluster
Learn about Kubernetes cluster and create a simple cluster using Minikube.
5.2.1.1 - Using Minikube to Create a Cluster
Objectives
- Learn what a Kubernetes cluster is.
- Learn what Minikube is.
- Start a Kubernetes cluster using an online terminal.
Kubernetes Clusters
Kubernetes coordinates a highly available cluster of computers that are connected to work as a single unit. The abstractions in Kubernetes allow you to deploy containerized applications to a cluster without tying them specifically to individual machines. To make use of this new model of deployment, applications need to be packaged in a way that decouples them from individual hosts: they need to be containerized. Containerized applications are more flexible and available than in past deployment models, where applications were installed directly onto specific machines as packages deeply integrated into the host. Kubernetes automates the distribution and scheduling of application containers across a cluster in a more efficient way. Kubernetes is an open-source platform and is production-ready.
A Kubernetes cluster consists of two types of resources:
- The Control Plane coordinates the cluster
- Nodes are the workers that run applications
Summary:
- Kubernetes cluster
- Minikube
Kubernetes is a production-grade, open-source platform that orchestrates the placement (scheduling) and execution of application containers within and across computer clusters.
Cluster Diagram
The Control Plane is responsible for managing the cluster. The Control Plane coordinates all activities in your cluster, such as scheduling applications, maintaining applications' desired state, scaling applications, and rolling out new updates.
A node is a VM or a physical computer that serves as a worker machine in a Kubernetes cluster. Each node has a Kubelet, which is an agent for managing the node and communicating with the Kubernetes control plane. The node should also have tools for handling container operations, such as containerd or Docker. A Kubernetes cluster that handles production traffic should have a minimum of three nodes because if one node goes down, both an etcd member and a control plane instance are lost, and redundancy is compromised. You can mitigate this risk by adding more control plane nodes.
Control Planes manage the cluster and the nodes that are used to host the running applications.
When you deploy applications on Kubernetes, you tell the control plane to start the application containers. The control plane schedules the containers to run on the cluster's nodes. The nodes communicate with the control plane using the Kubernetes API, which the control plane exposes. End users can also use the Kubernetes API directly to interact with the cluster.
A Kubernetes cluster can be deployed on either physical or virtual machines. To get started with Kubernetes development, you can use Minikube. Minikube is a lightweight Kubernetes implementation that creates a VM on your local machine and deploys a simple cluster containing only one node. Minikube is available for Linux, macOS, and Windows systems. The Minikube CLI provides basic bootstrapping operations for working with your cluster, including start, stop, status, and delete. For this tutorial, however, you'll use a provided online terminal with Minikube pre-installed.
Now that you know what Kubernetes is, let's go to the online tutorial and start our first cluster!
5.2.1.2 - Interactive Tutorial - Creating a Cluster
5.2.2 - Deploy an App
5.2.2.1 - Using kubectl to Create a Deployment
Objectives
- Learn about application Deployments.
- Deploy your first app on Kubernetes with kubectl.
Kubernetes Deployments
Once you have a running Kubernetes cluster, you can deploy your containerized applications on top of it. To do so, you create a Kubernetes Deployment configuration. The Deployment instructs Kubernetes how to create and update instances of your application. Once you've created a Deployment, the Kubernetes control plane schedules the application instances included in that Deployment to run on individual Nodes in the cluster.
Once the application instances are created, a Kubernetes Deployment Controller continuously monitors those instances. If the Node hosting an instance goes down or is deleted, the Deployment controller replaces the instance with an instance on another Node in the cluster. This provides a self-healing mechanism to address machine failure or maintenance.
In a pre-orchestration world, installation scripts would often be used to start applications, but they did not allow recovery from machine failure. By both creating your application instances and keeping them running across Nodes, Kubernetes Deployments provide a fundamentally different approach to application management.
Summary:
- Deployments
- Kubectl
A Deployment is responsible for creating and updating instances of your application
Deploying your first app on Kubernetes
You can create and manage a Deployment by using the Kubernetes command line interface, Kubectl. Kubectl uses the Kubernetes API to interact with the cluster. In this module, you'll learn the most common Kubectl commands needed to create Deployments that run your applications on a Kubernetes cluster.
When you create a Deployment, you'll need to specify the container image for your application and the number of replicas that you want to run. You can change that information later by updating your Deployment; Modules 5 and 6 of the bootcamp discuss how you can scale and update your Deployments.
Applications need to be packaged into one of the supported container formats in order to be deployed on Kubernetes
For your first Deployment, you'll use a hello-node application packaged in a Docker container that uses NGINX to echo back all the requests. (If you didn't already try creating a hello-node application and deploying it using a container, you can do that first by following the instructions from the Hello Minikube tutorial).
Now that you know what Deployments are, let's go to the online tutorial and deploy our first app!
5.2.2.2 - Interactive Tutorial - Deploying an App
A Pod is the basic execution unit of a Kubernetes application. Each Pod represents a part of a workload that is running on your cluster. Learn more about Pods.
5.2.3 - Explore Your App
5.2.3.1 - Viewing Pods and Nodes
Objectives
- Learn about Kubernetes Pods.
- Learn about Kubernetes Nodes.
- Troubleshoot deployed applications.
Kubernetes Pods
When you created a Deployment in Module 2, Kubernetes created a Pod to host your application instance. A Pod is a Kubernetes abstraction that represents a group of one or more application containers (such as Docker), and some shared resources for those containers. Those resources include:
- Shared storage, as Volumes
- Networking, as a unique cluster IP address
- Information about how to run each container, such as the container image version or specific ports to use
A Pod models an application-specific "logical host" and can contain different application containers which are relatively tightly coupled. For example, a Pod might include both the container with your Node.js app as well as a different container that feeds the data to be published by the Node.js webserver. The containers in a Pod share an IP Address and port space, are always co-located and co-scheduled, and run in a shared context on the same Node.
Pods are the atomic unit on the Kubernetes platform. When we create a Deployment on Kubernetes, that Deployment creates Pods with containers inside them (as opposed to creating containers directly). Each Pod is tied to the Node where it is scheduled, and remains there until termination (according to restart policy) or deletion. In case of a Node failure, identical Pods are scheduled on other available Nodes in the cluster.
Summary:
- Pods
- Nodes
- Kubectl main commands
A Pod is a group of one or more application containers (such as Docker) and includes shared storage (volumes), IP address and information about how to run them.
Pods overview
Nodes
A Pod always runs on a Node. A Node is a worker machine in Kubernetes and may be either a virtual or a physical machine, depending on the cluster. Each Node is managed by the control plane. A Node can have multiple pods, and the Kubernetes control plane automatically handles scheduling the pods across the Nodes in the cluster. The control plane's automatic scheduling takes into account the available resources on each Node.
Every Kubernetes Node runs at least:
- Kubelet, a process responsible for communication between the Kubernetes control plane and the Node; it manages the Pods and the containers running on a machine.
- A container runtime (like Docker) responsible for pulling the container image from a registry, unpacking the container, and running the application.
Containers should only be scheduled together in a single Pod if they are tightly coupled and need to share resources such as disk.
Node overview
Troubleshooting with kubectl
In Module 2, you used Kubectl command-line interface. You'll continue to use it in Module 3 to get information about deployed applications and their environments. The most common operations can be done with the following kubectl commands:
- kubectl get - list resources
- kubectl describe - show detailed information about a resource
- kubectl logs - print the logs from a container in a pod
- kubectl exec - execute a command on a container in a pod
You can use these commands to see when applications were deployed, what their current statuses are, where they are running and what their configurations are.
Now that we know more about our cluster components and the command line, let's explore our application.
A node is a worker machine in Kubernetes and may be a VM or physical machine, depending on the cluster. Multiple Pods can run on one Node.
5.2.3.2 - Interactive Tutorial - Exploring Your App
5.2.4 - Expose Your App Publicly
5.2.4.1 - Using a Service to Expose Your App
Objectives
- Learn about a Service in Kubernetes
- Understand how labels and LabelSelector objects relate to a Service
- Expose an application outside a Kubernetes cluster using a Service
Overview of Kubernetes Services
Kubernetes Pods are mortal. Pods in fact have a lifecycle. When a worker node dies, the Pods running on the Node are also lost. A ReplicaSet might then dynamically drive the cluster back to desired state via creation of new Pods to keep your application running. As another example, consider an image-processing backend with 3 replicas. Those replicas are exchangeable; the front-end system should not care about backend replicas or even if a Pod is lost and recreated. That said, each Pod in a Kubernetes cluster has a unique IP address, even Pods on the same Node, so there needs to be a way of automatically reconciling changes among Pods so that your applications continue to function.
A Service in Kubernetes is an abstraction which defines a logical set of Pods and a policy by which to access them. Services enable a loose coupling between dependent Pods. A Service is defined using YAML (preferred) or JSON, like all Kubernetes objects. The set of Pods targeted by a Service is usually determined by a LabelSelector (see below for why you might want a Service without including selector
in the spec).
Although each Pod has a unique IP address, those IPs are not exposed outside the cluster without a Service. Services allow your applications to receive traffic. Services can be exposed in different ways by specifying a type
in the ServiceSpec:
- ClusterIP (default) - Exposes the Service on an internal IP in the cluster. This type makes the Service only reachable from within the cluster.
- NodePort - Exposes the Service on the same port of each selected Node in the cluster using NAT. Makes a Service accessible from outside the cluster using
<NodeIP>:<NodePort>
. Superset of ClusterIP. - LoadBalancer - Creates an external load balancer in the current cloud (if supported) and assigns a fixed, external IP to the Service. Superset of NodePort.
- ExternalName - Maps the Service to the contents of the
externalName
field (e.g.foo.bar.example.com
), by returning aCNAME
record with its value. No proxying of any kind is set up. This type requires v1.7 or higher ofkube-dns
, or CoreDNS version 0.0.8 or higher.
More information about the different types of Services can be found in the Using Source IP tutorial. Also see Connecting Applications with Services.
Additionally, note that there are some use cases with Services that involve not defining selector
in the spec. A Service created without selector
will also not create the corresponding Endpoints object. This allows users to manually map a Service to specific endpoints. Another possibility why there may be no selector is you are strictly using type: ExternalName
.
Summary
- Exposing Pods to external traffic
- Load balancing traffic across multiple Pods
- Using labels
A Kubernetes Service is an abstraction layer which defines a logical set of Pods and enables external traffic exposure, load balancing and service discovery for those Pods.
Services and Labels
A Service routes traffic across a set of Pods. Services are the abstraction that allow pods to die and replicate in Kubernetes without impacting your application. Discovery and routing among dependent Pods (such as the frontend and backend components in an application) is handled by Kubernetes Services.
Services match a set of Pods using labels and selectors, a grouping primitive that allows logical operation on objects in Kubernetes. Labels are key/value pairs attached to objects and can be used in any number of ways:
- Designate objects for development, test, and production
- Embed version tags
- Classify an object using tags
Labels can be attached to objects at creation time or later on. They can be modified at any time. Let's expose our application now using a Service and apply some labels.
5.2.4.2 - Interactive Tutorial - Exposing Your App
5.2.5 - Scale Your App
5.2.5.1 - Running Multiple Instances of Your App
Objectives
- Scale an app using kubectl.
Scaling an application
In the previous modules we created a Deployment, and then exposed it publicly via a Service. The Deployment created only one Pod for running our application. When traffic increases, we will need to scale the application to keep up with user demand.
Scaling is accomplished by changing the number of replicas in a Deployment
Summary:
- Scaling a Deployment
You can create from the start a Deployment with multiple instances using the --replicas parameter for the kubectl create deployment command
Scaling overview
Scaling out a Deployment will ensure new Pods are created and scheduled to Nodes with available resources. Scaling will increase the number of Pods to the new desired state. Kubernetes also supports autoscaling of Pods, but it is outside of the scope of this tutorial. Scaling to zero is also possible, and it will terminate all Pods of the specified Deployment.
Running multiple instances of an application will require a way to distribute the traffic to all of them. Services have an integrated load-balancer that will distribute network traffic to all Pods of an exposed Deployment. Services will monitor continuously the running Pods using endpoints, to ensure the traffic is sent only to available Pods.
Scaling is accomplished by changing the number of replicas in a Deployment.
Once you have multiple instances of an Application running, you would be able to do Rolling updates without downtime. We'll cover that in the next module. Now, let's go to the online terminal and scale our application.
5.2.5.2 - Interactive Tutorial - Scaling Your App
5.2.6 - Update Your App
5.2.6.1 - Performing a Rolling Update
Objectives
- Perform a rolling update using kubectl.
Updating an application
Users expect applications to be available all the time and developers are expected to deploy new versions of them several times a day. In Kubernetes this is done with rolling updates. Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones. The new Pods will be scheduled on Nodes with available resources.
In the previous module we scaled our application to run multiple instances. This is a requirement for performing updates without affecting application availability. By default, the maximum number of Pods that can be unavailable during the update and the maximum number of new Pods that can be created, is one. Both options can be configured to either numbers or percentages (of Pods). In Kubernetes, updates are versioned and any Deployment update can be reverted to a previous (stable) version.
Summary:
- Updating an app
Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones.
Rolling updates overview
Similar to application Scaling, if a Deployment is exposed publicly, the Service will load-balance the traffic only to available Pods during the update. An available Pod is an instance that is available to the users of the application.
Rolling updates allow the following actions:
- Promote an application from one environment to another (via container image updates)
- Rollback to previous versions
- Continuous Integration and Continuous Delivery of applications with zero downtime
If a Deployment is exposed publicly, the Service will load-balance the traffic only to available Pods during the update.
In the following interactive tutorial, we'll update our application to a new version, and also perform a rollback.
5.2.6.2 - Interactive Tutorial - Updating Your App
5.3 - Configuration
5.3.1 - Example: Configuring a Java Microservice
5.3.1.1 - Externalizing config using MicroProfile, ConfigMaps and Secrets
In this tutorial you will learn how and why to externalize your microservice’s configuration. Specifically, you will learn how to use Kubernetes ConfigMaps and Secrets to set environment variables and then consume them using MicroProfile Config.
Before you begin
Creating Kubernetes ConfigMaps & Secrets
There are several ways to set environment variables for a Docker container in Kubernetes, including: Dockerfile, kubernetes.yml, Kubernetes ConfigMaps, and Kubernetes Secrets. In the tutorial, you will learn how to use the latter two for setting your environment variables whose values will be injected into your microservices. One of the benefits for using ConfigMaps and Secrets is that they can be re-used across multiple containers, including being assigned to different environment variables for the different containers.
ConfigMaps are API Objects that store non-confidential key-value pairs. In the Interactive Tutorial you will learn how to use a ConfigMap to store the application's name. For more information regarding ConfigMaps, you can find the documentation here.
Although Secrets are also used to store key-value pairs, they differ from ConfigMaps in that they're intended for confidential/sensitive information and are stored using Base64 encoding. This makes secrets the appropriate choice for storing such things as credentials, keys, and tokens, the former of which you'll do in the Interactive Tutorial. For more information on Secrets, you can find the documentation here.
Externalizing Config from Code
Externalized application configuration is useful because configuration usually changes depending on your environment. In order to accomplish this, we'll use Java's Contexts and Dependency Injection (CDI) and MicroProfile Config. MicroProfile Config is a feature of MicroProfile, a set of open Java technologies for developing and deploying cloud-native microservices.
CDI provides a standard dependency injection capability enabling an application to be assembled from collaborating, loosely-coupled beans. MicroProfile Config provides apps and microservices a standard way to obtain config properties from various sources, including the application, runtime, and environment. Based on the source's defined priority, the properties are automatically combined into a single set of properties that the application can access via an API. Together, CDI & MicroProfile will be used in the Interactive Tutorial to retrieve the externally provided properties from the Kubernetes ConfigMaps and Secrets and get injected into your application code.
Many open source frameworks and runtimes implement and support MicroProfile Config. Throughout the interactive tutorial, you'll be using Open Liberty, a flexible open-source Java runtime for building and running cloud-native apps and microservices. However, any MicroProfile compatible runtime could be used instead.
Objectives
- Create a Kubernetes ConfigMap and Secret
- Inject microservice configuration using MicroProfile Config
Example: Externalizing config using MicroProfile, ConfigMaps and Secrets
Start Interactive Tutorial
5.3.1.2 - Interactive Tutorial - Configuring a Java Microservice
5.3.2 - Configuring Redis using a ConfigMap
This page provides a real world example of how to configure Redis using a ConfigMap and builds upon the Configure Containers Using a ConfigMap task.
Objectives
- Create a ConfigMap with Redis configuration values
- Create a Redis Pod that mounts and uses the created ConfigMap
- Verify that the configuration was correctly applied.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
- The example shown on this page works with
kubectl
1.14 and above. - Understand Configure Containers Using a ConfigMap.
Real World Example: Configuring Redis using a ConfigMap
Follow the steps below to configure a Redis cache using data stored in a ConfigMap.
First create a ConfigMap with an empty configuration block:
cat <<EOF >./example-redis-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: example-redis-config
data:
redis-config: ""
EOF
Apply the ConfigMap created above, along with a Redis pod manifest:
kubectl apply -f example-redis-config.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/website/main/content/en/examples/pods/config/redis-pod.yaml
Examine the contents of the Redis pod manifest and note the following:
- A volume named
config
is created byspec.volumes[1]
- The
key
andpath
underspec.volumes[1].items[0]
exposes theredis-config
key from theexample-redis-config
ConfigMap as a file namedredis.conf
on theconfig
volume. - The
config
volume is then mounted at/redis-master
byspec.containers[0].volumeMounts[1]
.
This has the net effect of exposing the data in data.redis-config
from the example-redis-config
ConfigMap above as /redis-master/redis.conf
inside the Pod.
apiVersion: v1
kind: Pod
metadata:
name: redis
spec:
containers:
- name: redis
image: redis:5.0.4
command:
- redis-server
- "/redis-master/redis.conf"
env:
- name: MASTER
value: "true"
ports:
- containerPort: 6379
resources:
limits:
cpu: "0.1"
volumeMounts:
- mountPath: /redis-master-data
name: data
- mountPath: /redis-master
name: config
volumes:
- name: data
emptyDir: {}
- name: config
configMap:
name: example-redis-config
items:
- key: redis-config
path: redis.conf
Examine the created objects:
kubectl get pod/redis configmap/example-redis-config
You should see the following output:
NAME READY STATUS RESTARTS AGE
pod/redis 1/1 Running 0 8s
NAME DATA AGE
configmap/example-redis-config 1 14s
Recall that we left redis-config
key in the example-redis-config
ConfigMap blank:
kubectl describe configmap/example-redis-config
You should see an empty redis-config
key:
Name: example-redis-config
Namespace: default
Labels: <none>
Annotations: <none>
Data
====
redis-config:
Use kubectl exec
to enter the pod and run the redis-cli
tool to check the current configuration:
kubectl exec -it redis -- redis-cli
Check maxmemory
:
127.0.0.1:6379> CONFIG GET maxmemory
It should show the default value of 0:
1) "maxmemory"
2) "0"
Similarly, check maxmemory-policy
:
127.0.0.1:6379> CONFIG GET maxmemory-policy
Which should also yield its default value of noeviction
:
1) "maxmemory-policy"
2) "noeviction"
Now let's add some configuration values to the example-redis-config
ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: example-redis-config
data:
redis-config: |
maxmemory 2mb
maxmemory-policy allkeys-lru
Apply the updated ConfigMap:
kubectl apply -f example-redis-config.yaml
Confirm that the ConfigMap was updated:
kubectl describe configmap/example-redis-config
You should see the configuration values we just added:
Name: example-redis-config
Namespace: default
Labels: <none>
Annotations: <none>
Data
====
redis-config:
----
maxmemory 2mb
maxmemory-policy allkeys-lru
Check the Redis Pod again using redis-cli
via kubectl exec
to see if the configuration was applied:
kubectl exec -it redis -- redis-cli
Check maxmemory
:
127.0.0.1:6379> CONFIG GET maxmemory
It remains at the default value of 0:
1) "maxmemory"
2) "0"
Similarly, maxmemory-policy
remains at the noeviction
default setting:
127.0.0.1:6379> CONFIG GET maxmemory-policy
Returns:
1) "maxmemory-policy"
2) "noeviction"
The configuration values have not changed because the Pod needs to be restarted to grab updated values from associated ConfigMaps. Let's delete and recreate the Pod:
kubectl delete pod redis
kubectl apply -f https://raw.githubusercontent.com/kubernetes/website/main/content/en/examples/pods/config/redis-pod.yaml
Now re-check the configuration values one last time:
kubectl exec -it redis -- redis-cli
Check maxmemory
:
127.0.0.1:6379> CONFIG GET maxmemory
It should now return the updated value of 2097152:
1) "maxmemory"
2) "2097152"
Similarly, maxmemory-policy
has also been updated:
127.0.0.1:6379> CONFIG GET maxmemory-policy
It now reflects the desired value of allkeys-lru
:
1) "maxmemory-policy"
2) "allkeys-lru"
Clean up your work by deleting the created resources:
kubectl delete pod/redis configmap/example-redis-config
What's next
- Learn more about ConfigMaps.
5.4 - Security
5.4.1 - Apply Pod Security Standards at the Cluster Level
Note
This tutorial applies only for new clusters.Pod Security admission (PSA) is enabled by default in v1.23 and later, as it has
graduated to beta.
Pod Security
is an admission controller that carries out checks against the Kubernetes
Pod Security Standards when new pods are
created. This tutorial shows you how to enforce the baseline
Pod Security
Standard at the cluster level which applies a standard configuration
to all namespaces in a cluster.
To apply Pod Security Standards to specific namespaces, refer to Apply Pod Security Standards at the namespace level.
Before you begin
Install the following on your workstation:
Choose the right Pod Security Standard to apply
Pod Security Admission
lets you apply built-in Pod Security Standards
with the following modes: enforce
, audit
, and warn
.
To gather information that helps you to choose the Pod Security Standards that are most appropriate for your configuration, do the following:
-
Create a cluster with no Pod Security Standards applied:
kind create cluster --name psa-wo-cluster-pss --image kindest/node:v1.23.0
The output is similar to this:
Creating cluster "psa-wo-cluster-pss" ... ✓ Ensuring node image (kindest/node:v1.23.0) 🖼 ✓ Preparing nodes 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 Set kubectl context to "kind-psa-wo-cluster-pss" You can now use your cluster with: kubectl cluster-info --context kind-psa-wo-cluster-pss Thanks for using kind! 😊
-
Set the kubectl context to the new cluster:
kubectl cluster-info --context kind-psa-wo-cluster-pss
The output is similar to this:
Kubernetes control plane is running at https://127.0.0.1:61350 CoreDNS is running at https://127.0.0.1:61350/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
-
Get a list of namespaces in the cluster:
kubectl get ns
The output is similar to this:
NAME STATUS AGE default Active 9m30s kube-node-lease Active 9m32s kube-public Active 9m32s kube-system Active 9m32s local-path-storage Active 9m26s
-
Use
--dry-run=server
to understand what happens when different Pod Security Standards are applied:- Privileged
kubectl label --dry-run=server --overwrite ns --all \ pod-security.kubernetes.io/enforce=privileged
The output is similar to this:
namespace/default labeled namespace/kube-node-lease labeled namespace/kube-public labeled namespace/kube-system labeled namespace/local-path-storage labeled
- Baseline
kubectl label --dry-run=server --overwrite ns --all \ pod-security.kubernetes.io/enforce=baseline
The output is similar to this:
namespace/default labeled namespace/kube-node-lease labeled namespace/kube-public labeled Warning: existing pods in namespace "kube-system" violate the new PodSecurity enforce level "baseline:latest" Warning: etcd-psa-wo-cluster-pss-control-plane (and 3 other pods): host namespaces, hostPath volumes Warning: kindnet-vzj42: non-default capabilities, host namespaces, hostPath volumes Warning: kube-proxy-m6hwf: host namespaces, hostPath volumes, privileged namespace/kube-system labeled namespace/local-path-storage labeled
- Restricted
kubectl label --dry-run=server --overwrite ns --all \ pod-security.kubernetes.io/enforce=restricted
The output is similar to this:
namespace/default labeled namespace/kube-node-lease labeled namespace/kube-public labeled Warning: existing pods in namespace "kube-system" violate the new PodSecurity enforce level "restricted:latest" Warning: coredns-7bb9c7b568-hsptc (and 1 other pod): unrestricted capabilities, runAsNonRoot != true, seccompProfile Warning: etcd-psa-wo-cluster-pss-control-plane (and 3 other pods): host namespaces, hostPath volumes, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true Warning: kindnet-vzj42: non-default capabilities, host namespaces, hostPath volumes, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true, seccompProfile Warning: kube-proxy-m6hwf: host namespaces, hostPath volumes, privileged, allowPrivilegeEscalation != false, unrestricted capabilities, restricted volume types, runAsNonRoot != true, seccompProfile namespace/kube-system labeled Warning: existing pods in namespace "local-path-storage" violate the new PodSecurity enforce level "restricted:latest" Warning: local-path-provisioner-d6d9f7ffc-lw9lh: allowPrivilegeEscalation != false, unrestricted capabilities, runAsNonRoot != true, seccompProfile namespace/local-path-storage labeled
- Privileged
From the previous output, you'll notice that applying the privileged
Pod Security Standard shows no warnings
for any namespaces. However, baseline
and restricted
standards both have
warnings, specifically in the kube-system
namespace.
Set modes, versions and standards
In this section, you apply the following Pod Security Standards to the latest
version:
baseline
standard inenforce
mode.restricted
standard inwarn
andaudit
mode.
The baseline
Pod Security Standard provides a convenient
middle ground that allows keeping the exemption list short and prevents known
privilege escalations.
Additionally, to prevent pods from failing in kube-system
, you'll exempt the namespace
from having Pod Security Standards applied.
When you implement Pod Security Admission in your own environment, consider the following:
-
Based on the risk posture applied to a cluster, a stricter Pod Security Standard like
restricted
might be a better choice. -
Exempting the
kube-system
namespace allows pods to run asprivileged
in this namespace. For real world use, the Kubernetes project strongly recommends that you apply strict RBAC policies that limit access tokube-system
, following the principle of least privilege. To implement the preceding standards, do the following: -
Create a configuration file that can be consumed by the Pod Security Admission Controller to implement these Pod Security Standards:
mkdir -p /tmp/pss cat <<EOF > /tmp/pss/cluster-level-pss.yaml apiVersion: apiserver.config.k8s.io/v1 kind: AdmissionConfiguration plugins: - name: PodSecurity configuration: apiVersion: pod-security.admission.config.k8s.io/v1beta1 kind: PodSecurityConfiguration defaults: enforce: "baseline" enforce-version: "latest" audit: "restricted" audit-version: "latest" warn: "restricted" warn-version: "latest" exemptions: usernames: [] runtimeClasses: [] namespaces: [kube-system] EOF
-
Configure the API server to consume this file during cluster creation:
cat <<EOF > /tmp/pss/cluster-config.yaml kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane kubeadmConfigPatches: - | kind: ClusterConfiguration apiServer: extraArgs: admission-control-config-file: /etc/config/cluster-level-pss.yaml extraVolumes: - name: accf hostPath: /etc/config mountPath: /etc/config readOnly: false pathType: "DirectoryOrCreate" extraMounts: - hostPath: /tmp/pss containerPath: /etc/config # optional: if set, the mount is read-only. # default false readOnly: false # optional: if set, the mount needs SELinux relabeling. # default false selinuxRelabel: false # optional: set propagation mode (None, HostToContainer or Bidirectional) # see https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation # default None propagation: None EOF
Note: If you use Docker Desktop with KinD on macOS, you can add/tmp
as a Shared Directory under the menu item Preferences > Resources > File Sharing. -
Create a cluster that uses Pod Security Admission to apply these Pod Security Standards:
kind create cluster --name psa-with-cluster-pss --image kindest/node:v1.23.0 --config /tmp/pss/cluster-config.yaml
The output is similar to this:
Creating cluster "psa-with-cluster-pss" ... ✓ Ensuring node image (kindest/node:v1.23.0) 🖼 ✓ Preparing nodes 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 Set kubectl context to "kind-psa-with-cluster-pss" You can now use your cluster with: kubectl cluster-info --context kind-psa-with-cluster-pss Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community 🙂
-
Point kubectl to the cluster
kubectl cluster-info --context kind-psa-with-cluster-pss
The output is similar to this:
Kubernetes control plane is running at https://127.0.0.1:63855 CoreDNS is running at https://127.0.0.1:63855/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
-
Create the following Pod specification for a minimal configuration in the default namespace:
cat <<EOF > /tmp/pss/nginx-pod.yaml apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - image: nginx name: nginx ports: - containerPort: 80 EOF
-
Create the Pod in the cluster:
kubectl apply -f /tmp/pss/nginx-pod.yaml
The output is similar to this:
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "nginx" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nginx" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "nginx" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nginx" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") pod/nginx created
Clean up
Run kind delete cluster --name psa-with-cluster-pss
and
kind delete cluster --name psa-wo-cluster-pss
to delete the clusters you
created.
What's next
- Run a
shell script
to perform all the preceding steps at once:
- Create a Pod Security Standards based cluster level Configuration
- Create a file to let API server consumes this configuration
- Create a cluster that creates an API server with this configuration
- Set kubectl context to this new cluster
- Create a minimal pod yaml file
- Apply this file to create a Pod in the new cluster
- Pod Security Admission
- Pod Security Standards
- Apply Pod Security Standards at the namespace level
5.4.2 - Apply Pod Security Standards at the Namespace Level
Note
This tutorial applies only for new clusters.Pod Security admission (PSA) is enabled by default in v1.23 and later, as it
graduated to beta. Pod Security Admission
is an admission controller that applies
Pod Security Standards
when pods are created. In this tutorial, you will enforce the baseline
Pod Security Standard,
one namespace at a time.
You can also apply Pod Security Standards to multiple namespaces at once at the cluster level. For instructions, refer to Apply Pod Security Standards at the cluster level.
Before you begin
Install the following on your workstation:
Create cluster
-
Create a
KinD
cluster as follows:kind create cluster --name psa-ns-level --image kindest/node:v1.23.0
The output is similar to this:
Creating cluster "psa-ns-level" ... ✓ Ensuring node image (kindest/node:v1.23.0) 🖼 ✓ Preparing nodes 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 Set kubectl context to "kind-psa-ns-level" You can now use your cluster with: kubectl cluster-info --context kind-psa-ns-level Not sure what to do next? 😅 Check out https://kind.sigs.k8s.io/docs/user/quick-start/
-
Set the kubectl context to the new cluster:
kubectl cluster-info --context kind-psa-ns-level
The output is similar to this:
Kubernetes control plane is running at https://127.0.0.1:50996 CoreDNS is running at https://127.0.0.1:50996/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Create a namespace
Create a new namespace called example
:
kubectl create ns example
The output is similar to this:
namespace/example created
Apply Pod Security Standards
-
Enable Pod Security Standards on this namespace using labels supported by built-in Pod Security Admission. In this step we will warn on baseline pod security standard as per the latest version (default value)
kubectl label --overwrite ns example \ pod-security.kubernetes.io/warn=baseline \ pod-security.kubernetes.io/warn-version=latest
-
Multiple pod security standards can be enabled on any namespace, using labels. Following command will
enforce
thebaseline
Pod Security Standard, butwarn
andaudit
forrestricted
Pod Security Standards as per the latest version (default value)kubectl label --overwrite ns example \ pod-security.kubernetes.io/enforce=baseline \ pod-security.kubernetes.io/enforce-version=latest \ pod-security.kubernetes.io/warn=restricted \ pod-security.kubernetes.io/warn-version=latest \ pod-security.kubernetes.io/audit=restricted \ pod-security.kubernetes.io/audit-version=latest
Verify the Pod Security Standards
-
Create a minimal pod in
example
namespace:cat <<EOF > /tmp/pss/nginx-pod.yaml apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - image: nginx name: nginx ports: - containerPort: 80 EOF
-
Apply the pod spec to the cluster in
example
namespace:kubectl apply -n example -f /tmp/pss/nginx-pod.yaml
The output is similar to this:
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "nginx" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nginx" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "nginx" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nginx" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") pod/nginx created
-
Apply the pod spec to the cluster in
default
namespace:kubectl apply -n default -f /tmp/pss/nginx-pod.yaml
Output is similar to this:
pod/nginx created
The Pod Security Standards were applied only to the example
namespace. You could create the same Pod in the default
namespace
with no warnings.
Clean up
Run kind delete cluster -name psa-ns-level
to delete the cluster created.
What's next
-
Run a shell script to perform all the preceding steps all at once.
- Create KinD cluster
- Create new namespace
- Apply
baseline
Pod Security Standard inenforce
mode while applyingrestricted
Pod Security Standard also inwarn
andaudit
mode. - Create a new pod with the following pod security standards applied
5.4.3 - Restrict a Container's Access to Resources with AppArmor
Kubernetes v1.4 [beta]
AppArmor is a Linux kernel security module that supplements the standard Linux user and group based permissions to confine programs to a limited set of resources. AppArmor can be configured for any application to reduce its potential attack surface and provide greater in-depth defense. It is configured through profiles tuned to allow the access needed by a specific program or container, such as Linux capabilities, network access, file permissions, etc. Each profile can be run in either enforcing mode, which blocks access to disallowed resources, or complain mode, which only reports violations.
AppArmor can help you to run a more secure deployment by restricting what containers are allowed to do, and/or provide better auditing through system logs. However, it is important to keep in mind that AppArmor is not a silver bullet and can only do so much to protect against exploits in your application code. It is important to provide good, restrictive profiles, and harden your applications and cluster from other angles as well.
Objectives
- See an example of how to load a profile on a node
- Learn how to enforce the profile on a Pod
- Learn how to check that the profile is loaded
- See what happens when a profile is violated
- See what happens when a profile cannot be loaded
Before you begin
Make sure:
-
Kubernetes version is at least v1.4 -- Kubernetes support for AppArmor was added in v1.4. Kubernetes components older than v1.4 are not aware of the new AppArmor annotations, and will silently ignore any AppArmor settings that are provided. To ensure that your Pods are receiving the expected protections, it is important to verify the Kubelet version of your nodes:
kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.kubeletVersion}\n{end}'
gke-test-default-pool-239f5d02-gyn2: v1.4.0 gke-test-default-pool-239f5d02-x1kf: v1.4.0 gke-test-default-pool-239f5d02-xwux: v1.4.0
-
AppArmor kernel module is enabled -- For the Linux kernel to enforce an AppArmor profile, the AppArmor kernel module must be installed and enabled. Several distributions enable the module by default, such as Ubuntu and SUSE, and many others provide optional support. To check whether the module is enabled, check the
/sys/module/apparmor/parameters/enabled
file:cat /sys/module/apparmor/parameters/enabled Y
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor options if the kernel module is not enabled.
-
Container runtime supports AppArmor -- Currently all common Kubernetes-supported container runtimes should support AppArmor, like Docker, CRI-O or containerd. Please refer to the corresponding runtime documentation and verify that the cluster fulfills the requirements to use AppArmor.
-
Profile is loaded -- AppArmor is applied to a Pod by specifying an AppArmor profile that each container should be run with. If any of the specified profiles is not already loaded in the kernel, the Kubelet (>= v1.4) will reject the Pod. You can view which profiles are loaded on a node by checking the
/sys/kernel/security/apparmor/profiles
file. For example:ssh gke-test-default-pool-239f5d02-gyn2 "sudo cat /sys/kernel/security/apparmor/profiles | sort"
apparmor-test-deny-write (enforce) apparmor-test-audit-write (enforce) docker-default (enforce) k8s-nginx (enforce)
For more details on loading profiles on nodes, see Setting up nodes with profiles.
As long as the Kubelet version includes AppArmor support (>= v1.4), the Kubelet will reject a Pod with AppArmor options if any of the prerequisites are not met. You can also verify AppArmor support on nodes by checking the node ready condition message (though this is likely to be removed in a later release):
kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {.status.conditions[?(@.reason=="KubeletReady")].message}\n{end}'
gke-test-default-pool-239f5d02-gyn2: kubelet is posting ready status. AppArmor enabled
gke-test-default-pool-239f5d02-x1kf: kubelet is posting ready status. AppArmor enabled
gke-test-default-pool-239f5d02-xwux: kubelet is posting ready status. AppArmor enabled
Securing a Pod
AppArmor profiles are specified per-container. To specify the AppArmor profile to run a Pod container with, add an annotation to the Pod's metadata:
container.apparmor.security.beta.kubernetes.io/<container_name>: <profile_ref>
Where <container_name>
is the name of the container to apply the profile to, and <profile_ref>
specifies the profile to apply. The profile_ref
can be one of:
runtime/default
to apply the runtime's default profilelocalhost/<profile_name>
to apply the profile loaded on the host with the name<profile_name>
unconfined
to indicate that no profiles will be loaded
See the API Reference for the full details on the annotation and profile name formats.
Kubernetes AppArmor enforcement works by first checking that all the prerequisites have been met, and then forwarding the profile selection to the container runtime for enforcement. If the prerequisites have not been met, the Pod will be rejected, and will not run.
To verify that the profile was applied, you can look for the AppArmor security option listed in the container created event:
kubectl get events | grep Created
22s 22s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet e2e-test-stclair-node-pool-31nt} Created container with docker id 269a53b202d3; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
You can also verify directly that the container's root process is running with the correct profile by checking its proc attr:
kubectl exec <pod_name> cat /proc/1/attr/current
k8s-apparmor-example-deny-write (enforce)
Example
This example assumes you have already set up a cluster with AppArmor support.
First, we need to load the profile we want to use onto our nodes. This profile denies all file writes:
#include <tunables/global>
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
#include <abstractions/base>
file,
# Deny all file writes.
deny /** w,
}
Since we don't know where the Pod will be scheduled, we'll need to load the profile on all our nodes. For this example we'll use SSH to install the profiles, but other approaches are discussed in Setting up nodes with profiles.
NODES=(
# The SSH-accessible domain names of your nodes
gke-test-default-pool-239f5d02-gyn2.us-central1-a.my-k8s
gke-test-default-pool-239f5d02-x1kf.us-central1-a.my-k8s
gke-test-default-pool-239f5d02-xwux.us-central1-a.my-k8s)
for NODE in ${NODES[*]}; do ssh $NODE 'sudo apparmor_parser -q <<EOF
#include <tunables/global>
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
#include <abstractions/base>
file,
# Deny all file writes.
deny /** w,
}
EOF'
done
Next, we'll run a simple "Hello AppArmor" pod with the deny-write profile:
apiVersion: v1
kind: Pod
metadata:
name: hello-apparmor
annotations:
# Tell Kubernetes to apply the AppArmor profile "k8s-apparmor-example-deny-write".
# Note that this is ignored if the Kubernetes node is not running version 1.4 or greater.
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write
spec:
containers:
- name: hello
image: busybox:1.28
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
kubectl create -f ./hello-apparmor.yaml
If we look at the pod events, we can see that the Pod container was created with the AppArmor profile "k8s-apparmor-example-deny-write":
kubectl get events | grep hello-apparmor
14s 14s 1 hello-apparmor Pod Normal Scheduled {default-scheduler } Successfully assigned hello-apparmor to gke-test-default-pool-239f5d02-gyn2
14s 14s 1 hello-apparmor Pod spec.containers{hello} Normal Pulling {kubelet gke-test-default-pool-239f5d02-gyn2} pulling image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Pulled {kubelet gke-test-default-pool-239f5d02-gyn2} Successfully pulled image "busybox"
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet gke-test-default-pool-239f5d02-gyn2} Created container with docker id 06b6cd1c0989; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Started {kubelet gke-test-default-pool-239f5d02-gyn2} Started container with docker id 06b6cd1c0989
We can verify that the container is actually running with that profile by checking its proc attr:
kubectl exec hello-apparmor -- cat /proc/1/attr/current
k8s-apparmor-example-deny-write (enforce)
Finally, we can see what happens if we try to violate the profile by writing to a file:
kubectl exec hello-apparmor -- touch /tmp/test
touch: /tmp/test: Permission denied
error: error executing remote command: command terminated with non-zero exit code: Error executing in Docker Container: 1
To wrap up, let's look at what happens if we try to specify a profile that hasn't been loaded:
kubectl create -f /dev/stdin <<EOF
apiVersion: v1
kind: Pod
metadata:
name: hello-apparmor-2
annotations:
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-allow-write
spec:
containers:
- name: hello
image: busybox:1.28
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
EOF
pod/hello-apparmor-2 created
kubectl describe pod hello-apparmor-2
Name: hello-apparmor-2
Namespace: default
Node: gke-test-default-pool-239f5d02-x1kf/
Start Time: Tue, 30 Aug 2016 17:58:56 -0700
Labels: <none>
Annotations: container.apparmor.security.beta.kubernetes.io/hello=localhost/k8s-apparmor-example-allow-write
Status: Pending
Reason: AppArmor
Message: Pod Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
IP:
Controllers: <none>
Containers:
hello:
Container ID:
Image: busybox
Image ID:
Port:
Command:
sh
-c
echo 'Hello AppArmor!' && sleep 1h
State: Waiting
Reason: Blocked
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-dnz7v (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-dnz7v:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-dnz7v
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
23s 23s 1 {default-scheduler } Normal Scheduled Successfully assigned hello-apparmor-2 to e2e-test-stclair-node-pool-t1f5
23s 23s 1 {kubelet e2e-test-stclair-node-pool-t1f5} Warning AppArmor Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
Note the pod status is Pending, with a helpful error message: Pod Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
. An event was also recorded with the same message.
Administration
Setting up nodes with profiles
Kubernetes does not currently provide any native mechanisms for loading AppArmor profiles onto nodes. There are lots of ways to setup the profiles though, such as:
- Through a DaemonSet that runs a Pod on each node to ensure the correct profiles are loaded. An example implementation can be found here.
- At node initialization time, using your node initialization scripts (e.g. Salt, Ansible, etc.) or image.
- By copying the profiles to each node and loading them through SSH, as demonstrated in the Example.
The scheduler is not aware of which profiles are loaded onto which node, so the full set of profiles must be loaded onto every node. An alternative approach is to add a node label for each profile (or class of profiles) on the node, and use a node selector to ensure the Pod is run on a node with the required profile.
Restricting profiles with the PodSecurityPolicy
If the PodSecurityPolicy extension is enabled, cluster-wide AppArmor restrictions can be applied. To
enable the PodSecurityPolicy, the following flag must be set on the apiserver
:
--enable-admission-plugins=PodSecurityPolicy[,others...]
The AppArmor options can be specified as annotations on the PodSecurityPolicy:
apparmor.security.beta.kubernetes.io/defaultProfileName: <profile_ref>
apparmor.security.beta.kubernetes.io/allowedProfileNames: <profile_ref>[,others...]
The default profile name option specifies the profile to apply to containers by default when none is specified. The allowed profile names option specifies a list of profiles that Pod containers are allowed to be run with. If both options are provided, the default must be allowed. The profiles are specified in the same format as on containers. See the API Reference for the full specification.
Disabling AppArmor
If you do not want AppArmor to be available on your cluster, it can be disabled by a command-line flag:
--feature-gates=AppArmor=false
When disabled, any Pod that includes an AppArmor profile will fail validation with a "Forbidden" error.
Authoring Profiles
Getting AppArmor profiles specified correctly can be a tricky business. Fortunately there are some tools to help with that:
aa-genprof
andaa-logprof
generate profile rules by monitoring an application's activity and logs, and admitting the actions it takes. Further instructions are provided by the AppArmor documentation.- bane is an AppArmor profile generator for Docker that uses a simplified profile language.
To debug problems with AppArmor, you can check the system logs to see what, specifically, was
denied. AppArmor logs verbose messages to dmesg
, and errors can usually be found in the system
logs or through journalctl
. More information is provided in
AppArmor failures.
API Reference
Pod Annotation
Specifying the profile a container will run with:
- key:
container.apparmor.security.beta.kubernetes.io/<container_name>
Where<container_name>
matches the name of a container in the Pod. A separate profile can be specified for each container in the Pod. - value: a profile reference, described below
Profile Reference
runtime/default
: Refers to the default runtime profile.- Equivalent to not specifying a profile (without a PodSecurityPolicy default), except it still requires AppArmor to be enabled.
- In practice, many container runtimes use the same OCI default profile, defined here: https://github.com/containers/common/blob/main/pkg/apparmor/apparmor_linux_template.go
localhost/<profile_name>
: Refers to a profile loaded on the node (localhost) by name.- The possible profile names are detailed in the core policy reference.
unconfined
: This effectively disables AppArmor on the container.
Any other profile reference format is invalid.
PodSecurityPolicy Annotations
Specifying the default profile to apply to containers when none is provided:
- key:
apparmor.security.beta.kubernetes.io/defaultProfileName
- value: a profile reference, described above
Specifying the list of profiles Pod containers is allowed to specify:
- key:
apparmor.security.beta.kubernetes.io/allowedProfileNames
- value: a comma-separated list of profile references (described above)
- Although an escaped comma is a legal character in a profile name, it cannot be explicitly allowed here.
What's next
Additional resources:
5.4.4 - Restrict a Container's Syscalls with seccomp
Kubernetes v1.19 [stable]
Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a node to your Pods and containers.
Identifying the privileges required for your workloads can be difficult. In this tutorial, you will go through how to load seccomp profiles into a local Kubernetes cluster, how to apply them to a Pod, and how you can begin to craft profiles that give only the necessary privileges to your container processes.
Objectives
- Learn how to load seccomp profiles on a node
- Learn how to apply a seccomp profile to a container
- Observe auditing of syscalls made by a container process
- Observe behavior when a missing profile is specified
- Observe a violation of a seccomp profile
- Learn how to create fine-grained seccomp profiles
- Learn how to apply a container runtime default seccomp profile
Before you begin
In order to complete all steps in this tutorial, you must install kind and kubectl.
This tutorial shows some examples that are still alpha (since v1.22) and others that use only generally available seccomp functionality. You should make sure that your cluster is configured correctly for the version you are using.
The tutorial also uses the curl
tool for downloading examples to your computer.
You can adapt the steps to use a different tool if you prefer.
privileged: true
set in the container's securityContext
. Privileged containers always
run as Unconfined
.
Download example seccomp profiles
The contents of these profiles will be explored later on, but for now go ahead
and download them into a directory named profiles/
so that they can be loaded
into the cluster.
{
"defaultAction": "SCMP_ACT_LOG"
}
{
"defaultAction": "SCMP_ACT_ERRNO"
}
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"accept4",
"epoll_wait",
"pselect6",
"futex",
"madvise",
"epoll_ctl",
"getsockname",
"setsockopt",
"vfork",
"mmap",
"read",
"write",
"close",
"arch_prctl",
"sched_getaffinity",
"munmap",
"brk",
"rt_sigaction",
"rt_sigprocmask",
"sigaltstack",
"gettid",
"clone",
"bind",
"socket",
"openat",
"readlinkat",
"exit_group",
"epoll_create1",
"listen",
"rt_sigreturn",
"sched_yield",
"clock_gettime",
"connect",
"dup2",
"epoll_pwait",
"execve",
"exit",
"fcntl",
"getpid",
"getuid",
"ioctl",
"mprotect",
"nanosleep",
"open",
"poll",
"recvfrom",
"sendto",
"set_tid_address",
"setitimer",
"writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Run these commands:
mkdir ./profiles
curl -L -o profiles/audit.json https://k8s.io/examples/pods/security/seccomp/profiles/audit.json
curl -L -o profiles/violation.json https://k8s.io/examples/pods/security/seccomp/profiles/violation.json
curl -L -o profiles/fine-grained.json https://k8s.io/examples/pods/security/seccomp/profiles/fine-grained.json
ls profiles
You should see three profiles listed at the end of the final step:
audit.json fine-grained.json violation.json
Create a local Kubernetes cluster with kind
For simplicity, kind can be used to create a single node cluster with the seccomp profiles loaded. Kind runs Kubernetes in Docker, so each node of the cluster is a container. This allows for files to be mounted in the filesystem of each container similar to loading files onto a node.
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
extraMounts:
- hostPath: "./profiles"
containerPath: "/var/lib/kubelet/seccomp/profiles"
Download that example kind configuration, and save it to a file named kind.yaml
:
curl -L -O https://k8s.io/examples/pods/security/seccomp/kind.yaml
You can set a specific Kubernetes version by setting the node's container image. See Nodes within the kind documentation about configuration for more details on this. This tutorial assumes you are using Kubernetes v1.23.
As an alpha feature, you can configure Kubernetes to use the profile that the
container runtime
prefers by default, rather than falling back to Unconfined
.
If you want to try that, see
enable the use of RuntimeDefault
as the default seccomp profile for all workloads
before you continue.
Once you have a kind configuration in place, create the kind cluster with that configuration:
kind create cluster --config=kind.yaml
After the new Kubernetes cluster is ready, identify the Docker container running as the single node cluster:
docker ps
You should see output indicating that a container is running with name
kind-control-plane
. The output is similar to:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6a96207fed4b kindest/node:v1.18.2 "/usr/local/bin/entr…" 27 seconds ago Up 24 seconds 127.0.0.1:42223->6443/tcp kind-control-plane
If observing the filesystem of that container, you should see that the
profiles/
directory has been successfully loaded into the default seccomp path
of the kubelet. Use docker exec
to run a command in the Pod:
# Change 6a96207fed4b to the container ID you saw from "docker ps"
docker exec -it 6a96207fed4b ls /var/lib/kubelet/seccomp/profiles
audit.json fine-grained.json violation.json
You have verified that these seccomp profiles are available to the kubelet running within kind.
Enable the use of RuntimeDefault
as the default seccomp profile for all workloads
Kubernetes v1.22 [alpha]
SeccompDefault
is an optional kubelet
feature gate as
well as corresponding --seccomp-default
command line flag.
Both have to be enabled simultaneously to use the feature.
If enabled, the kubelet will use the RuntimeDefault
seccomp profile by default, which is
defined by the container runtime, instead of using the Unconfined
(seccomp disabled) mode.
The default profiles aim to provide a strong set
of security defaults while preserving the functionality of the workload. It is
possible that the default profiles differ between container runtimes and their
release versions, for example when comparing those from CRI-O and containerd.
securityContext.seccompProfile
API field nor add the deprecated annotations of
the workload. This provides users the possibility to rollback anytime without
actually changing the workload configuration. Tools like
crictl inspect
can be used to
verify which seccomp profile is being used by a container.
Some workloads may require a lower amount of syscall restrictions than others.
This means that they can fail during runtime even with the RuntimeDefault
profile. To mitigate such a failure, you can:
- Run the workload explicitly as
Unconfined
. - Disable the
SeccompDefault
feature for the nodes. Also making sure that workloads get scheduled on nodes where the feature is disabled. - Create a custom seccomp profile for the workload.
If you were introducing this feature into production-like cluster, the Kubernetes project recommends that you enable this feature gate on a subset of your nodes and then test workload execution before rolling the change out cluster-wide.
More detailed information about a possible upgrade and downgrade strategy can be found in the related Kubernetes Enhancement Proposal (KEP).
Since the feature is in alpha state it is disabled per default. To enable it,
pass the flags --feature-gates=SeccompDefault=true --seccomp-default
to the
kubelet
CLI or enable it via the kubelet configuration
file. To enable the
feature gate in kind, ensure that kind
provides
the minimum required Kubernetes version and enables the SeccompDefault
feature
in the kind configuration:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
SeccompDefault: true
nodes:
- role: control-plane
image: kindest/node:v1.23.0@sha256:49824ab1727c04e56a21a5d8372a402fcd32ea51ac96a2706a12af38934f81ac
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
seccomp-default: "true"
- role: worker
image: kindest/node:v1.23.0@sha256:49824ab1727c04e56a21a5d8372a402fcd32ea51ac96a2706a12af38934f81ac
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
feature-gates: SeccompDefault=true
seccomp-default: "true"
If the cluster is ready, then running a pod:
kubectl run --rm -it --restart=Never --image=alpine alpine -- sh
Should now have the default seccomp profile attached. This can be verified by
using docker exec
to run crictl inspect
for the container on the kind
worker:
docker exec -it kind-worker bash -c \
'crictl inspect $(crictl ps --name=alpine -q) | jq .info.runtimeSpec.linux.seccomp'
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
"syscalls": [
{
"names": ["..."]
}
]
}
Create a Pod with a seccomp profile for syscall auditing
To start off, apply the audit.json
profile, which will log all syscalls of the
process, to a new Pod.
Here's a manifest for that Pod:
apiVersion: v1
kind: Pod
metadata:
name: audit-pod
labels:
app: audit-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/audit.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
seccomp.security.alpha.kubernetes.io/pod
(for the whole pod) and
container.seccomp.security.alpha.kubernetes.io/[name]
(for a single container)
is going to be removed with the release of Kubernetes v1.25. Please always use
the native API fields in favor of the annotations.
Create the Pod in the cluster:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/audit-pod.yaml
This profile does not restrict any syscalls, so the Pod should start successfully.
kubectl get pod/audit-pod
NAME READY STATUS RESTARTS AGE
audit-pod 1/1 Running 0 30s
In order to be able to interact with this endpoint exposed by this container, create a NodePort Services that allows access to the endpoint from inside the kind control plane container.
kubectl expose pod audit-pod --type NodePort --port 5678
Check what port the Service has been assigned on the node.
kubectl get service audit-pod
The output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
audit-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s
Now you can use curl
to access that endpoint from inside the kind control plane container,
at the port exposed by this Service. Use docker exec
to run the curl
command within the
container belonging to that control plane container:
# Change 6a96207fed4b to the control plane container ID you saw from "docker ps"
docker exec -it 6a96207fed4b curl localhost:32373
just made some syscalls!
You can see that the process is running, but what syscalls did it actually make?
Because this Pod is running in a local cluster, you should be able to see those
in /var/log/syslog
. Open up a new terminal window and tail
the output for
calls from http-echo
:
tail -f /var/log/syslog | grep 'http-echo'
You should already see some logs of syscalls made by http-echo
, and if you
curl
the endpoint in the control plane container you will see more written.
For example:
Jul 6 15:37:40 my-machine kernel: [369128.669452] audit: type=1326 audit(1594067860.484:14536): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=51 compat=0 ip=0x46fe1f code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669453] audit: type=1326 audit(1594067860.484:14537): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=54 compat=0 ip=0x46fdba code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669455] audit: type=1326 audit(1594067860.484:14538): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669456] audit: type=1326 audit(1594067860.484:14539): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=288 compat=0 ip=0x46fdba code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669517] audit: type=1326 audit(1594067860.484:14540): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=0 compat=0 ip=0x46fd44 code=0x7ffc0000
Jul 6 15:37:40 my-machine kernel: [369128.669519] audit: type=1326 audit(1594067860.484:14541): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
Jul 6 15:38:40 my-machine kernel: [369188.671648] audit: type=1326 audit(1594067920.488:14559): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
Jul 6 15:38:40 my-machine kernel: [369188.671726] audit: type=1326 audit(1594067920.488:14560): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000
You can begin to understand the syscalls required by the http-echo
process by
looking at the syscall=
entry on each line. While these are unlikely to
encompass all syscalls it uses, it can serve as a basis for a seccomp profile
for this container.
Clean up that Pod and Service before moving to the next section:
kubectl delete service audit-pod --wait
kubectl delete pod audit-pod --wait --now
Create Pod with seccomp profile that causes violation
For demonstration, apply a profile to the Pod that does not allow for any syscalls.
The manifest for this demonstration is:
apiVersion: v1
kind: Pod
metadata:
name: violation-pod
labels:
app: violation-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/violation.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
Attempt to create the Pod in the cluster:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/violation-pod.yaml
The Pod creates, but there is an issue. If you check the status of the Pod, you should see that it failed to start.
kubectl get pod/violation-pod
NAME READY STATUS RESTARTS AGE
violation-pod 0/1 CrashLoopBackOff 1 6s
As seen in the previous example, the http-echo
process requires quite a few
syscalls. Here seccomp has been instructed to error on any syscall by setting
"defaultAction": "SCMP_ACT_ERRNO"
. This is extremely secure, but removes the
ability to do anything meaningful. What you really want is to give workloads
only the privileges they need.
Clean up that Pod before moving to the next section:
kubectl delete pod violation-pod --wait --now
Create Pod with seccomp profile that only allows necessary syscalls
If you take a look at the fine-grained.json
profile, you will notice some of the syscalls
seen in syslog of the first example where the profile set "defaultAction": "SCMP_ACT_LOG"
. Now the profile is setting "defaultAction": "SCMP_ACT_ERRNO"
,
but explicitly allowing a set of syscalls in the "action": "SCMP_ACT_ALLOW"
block. Ideally, the container will run successfully and you will see no messages
sent to syslog
.
The manifest for this example is:
apiVersion: v1
kind: Pod
metadata:
name: fine-pod
labels:
app: fine-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/fine-grained.json
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some syscalls!"
securityContext:
allowPrivilegeEscalation: false
Create the Pod in your cluster:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/fine-pod.yaml
kubectl get pod fine-pod
The Pod should be showing as having started successfully:
NAME READY STATUS RESTARTS AGE
fine-pod 1/1 Running 0 30s
Open up a new terminal window and use tail
to monitor for log entries that
mention calls from http-echo
:
# The log path on your computer might be different from "/var/log/syslog"
tail -f /var/log/syslog | grep 'http-echo'
Next, expose the Pod with a NodePort Service:
kubectl expose pod fine-pod --type NodePort --port 5678
Check what port the Service has been assigned on the node:
kubectl get service fine-pod
The output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fine-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s
Use curl
to access that endpoint from inside the kind control plane container:
# Change 6a96207fed4b to the control plane container ID you saw from "docker ps"
docker exec -it 6a96207fed4b curl localhost:32373
just made some syscalls!
You should see no output in the syslog
. This is because the profile allowed all
necessary syscalls and specified that an error should occur if one outside of
the list is invoked. This is an ideal situation from a security perspective, but
required some effort in analyzing the program. It would be nice if there was a
simple way to get closer to this security without requiring as much effort.
Clean up that Pod and Service before moving to the next section:
kubectl delete service fine-pod --wait
kubectl delete pod fine-pod --wait --now
Create Pod that uses the container runtime default seccomp profile
Most container runtimes provide a sane set of default syscalls that are allowed
or not. You can adopt these defaults for your workload by setting the seccomp
type in the security context of a pod or container to RuntimeDefault
.
SeccompDefault
feature gate enabled, then Pods use the RuntimeDefault
seccomp profile whenever
no other seccomp profile is specified. Otherwise, the default is Unconfined
.
Here's a manifest for a Pod that requests the RuntimeDefault
seccomp profile
for all its containers:
apiVersion: v1
kind: Pod
metadata:
name: default-pod
labels:
app: default-pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: test-container
image: hashicorp/http-echo:0.2.3
args:
- "-text=just made some more syscalls!"
securityContext:
allowPrivilegeEscalation: false
Create that Pod:
kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/default-pod.yaml
kubectl get pod default-pod
The Pod should be showing as having started successfully:
NAME READY STATUS RESTARTS AGE
default-pod 1/1 Running 0 20s
Finally, now that you saw that work OK, clean up:
kubectl delete pod default-pod --wait --now
What's next
You can learn more about Linux seccomp:
5.5 - Stateless Applications
5.5.1 - Exposing an External IP Address to Access an Application in a Cluster
This page shows how to create a Kubernetes Service object that exposes an external IP address.
Before you begin
- Install kubectl.
- Use a cloud provider like Google Kubernetes Engine or Amazon Web Services to create a Kubernetes cluster. This tutorial creates an external load balancer, which requires a cloud provider.
- Configure
kubectl
to communicate with your Kubernetes API server. For instructions, see the documentation for your cloud provider.
Objectives
- Run five instances of a Hello World application.
- Create a Service object that exposes an external IP address.
- Use the Service object to access the running application.
Creating a service for an application running in five pods
-
Run a Hello World application in your cluster:
apiVersion: apps/v1 kind: Deployment metadata: labels: app.kubernetes.io/name: load-balancer-example name: hello-world spec: replicas: 5 selector: matchLabels: app.kubernetes.io/name: load-balancer-example template: metadata: labels: app.kubernetes.io/name: load-balancer-example spec: containers: - image: gcr.io/google-samples/node-hello:1.0 name: hello-world ports: - containerPort: 8080
kubectl apply -f https://k8s.io/examples/service/load-balancer-example.yaml
The preceding command creates a Deployment and an associated ReplicaSet. The ReplicaSet has five Pods each of which runs the Hello World application.
-
Display information about the Deployment:
kubectl get deployments hello-world kubectl describe deployments hello-world
-
Display information about your ReplicaSet objects:
kubectl get replicasets kubectl describe replicasets
-
Create a Service object that exposes the deployment:
kubectl expose deployment hello-world --type=LoadBalancer --name=my-service
-
Display information about the Service:
kubectl get services my-service
The output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE my-service LoadBalancer 10.3.245.137 104.198.205.71 8080/TCP 54s
Note: Thetype=LoadBalancer
service is backed by external cloud providers, which is not covered in this example, please refer to this page for the details.Note: If the external IP address is shown as <pending>, wait for a minute and enter the same command again. -
Display detailed information about the Service:
kubectl describe services my-service
The output is similar to:
Name: my-service Namespace: default Labels: app.kubernetes.io/name=load-balancer-example Annotations: <none> Selector: app.kubernetes.io/name=load-balancer-example Type: LoadBalancer IP: 10.3.245.137 LoadBalancer Ingress: 104.198.205.71 Port: <unset> 8080/TCP NodePort: <unset> 32377/TCP Endpoints: 10.0.0.6:8080,10.0.1.6:8080,10.0.1.7:8080 + 2 more... Session Affinity: None Events: <none>
Make a note of the external IP address (
LoadBalancer Ingress
) exposed by your service. In this example, the external IP address is 104.198.205.71. Also note the value ofPort
andNodePort
. In this example, thePort
is 8080 and theNodePort
is 32377. -
In the preceding output, you can see that the service has several endpoints: 10.0.0.6:8080,10.0.1.6:8080,10.0.1.7:8080 + 2 more. These are internal addresses of the pods that are running the Hello World application. To verify these are pod addresses, enter this command:
kubectl get pods --output=wide
The output is similar to:
NAME ... IP NODE hello-world-2895499144-1jaz9 ... 10.0.1.6 gke-cluster-1-default-pool-e0b8d269-1afc hello-world-2895499144-2e5uh ... 10.0.1.8 gke-cluster-1-default-pool-e0b8d269-1afc hello-world-2895499144-9m4h1 ... 10.0.0.6 gke-cluster-1-default-pool-e0b8d269-5v7a hello-world-2895499144-o4z13 ... 10.0.1.7 gke-cluster-1-default-pool-e0b8d269-1afc hello-world-2895499144-segjf ... 10.0.2.5 gke-cluster-1-default-pool-e0b8d269-cpuc
-
Use the external IP address (
LoadBalancer Ingress
) to access the Hello World application:curl http://<external-ip>:<port>
where
<external-ip>
is the external IP address (LoadBalancer Ingress
) of your Service, and<port>
is the value ofPort
in your Service description. If you are using minikube, typingminikube service my-service
will automatically open the Hello World application in a browser.The response to a successful request is a hello message:
Hello Kubernetes!
Cleaning up
To delete the Service, enter this command:
kubectl delete services my-service
To delete the Deployment, the ReplicaSet, and the Pods that are running the Hello World application, enter this command:
kubectl delete deployment hello-world
What's next
Learn more about connecting applications with services.
5.5.2 - Example: Deploying PHP Guestbook application with Redis
This tutorial shows you how to build and deploy a simple (not production ready), multi-tier web application using Kubernetes and Docker. This example consists of the following components:
- A single-instance Redis to store guestbook entries
- Multiple web frontend instances
Objectives
- Start up a Redis leader.
- Start up two Redis followers.
- Start up the guestbook frontend.
- Expose and view the Frontend Service.
- Clean up.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
Your Kubernetes server must be at or later than version v1.14. To check the version, enterkubectl version
.
Start up the Redis Database
The guestbook application uses Redis to store its data.
Creating the Redis Deployment
The manifest file, included below, specifies a Deployment controller that runs a single replica Redis Pod.
# SOURCE: https://cloud.google.com/kubernetes-engine/docs/tutorials/guestbook
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-leader
labels:
app: redis
role: leader
tier: backend
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
role: leader
tier: backend
spec:
containers:
- name: leader
image: "docker.io/redis:6.0.5"
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 6379
-
Launch a terminal window in the directory you downloaded the manifest files.
-
Apply the Redis Deployment from the
redis-leader-deployment.yaml
file:kubectl apply -f https://k8s.io/examples/application/guestbook/redis-leader-deployment.yaml
-
Query the list of Pods to verify that the Redis Pod is running:
kubectl get pods
The response should be similar to this:
NAME READY STATUS RESTARTS AGE redis-leader-fb76b4755-xjr2n 1/1 Running 0 13s
-
Run the following command to view the logs from the Redis leader Pod:
kubectl logs -f deployment/redis-leader
Creating the Redis leader Service
The guestbook application needs to communicate to the Redis to write its data. You need to apply a Service to proxy the traffic to the Redis Pod. A Service defines a policy to access the Pods.
# SOURCE: https://cloud.google.com/kubernetes-engine/docs/tutorials/guestbook
apiVersion: v1
kind: Service
metadata:
name: redis-leader
labels:
app: redis
role: leader
tier: backend
spec:
ports:
- port: 6379
targetPort: 6379
selector:
app: redis
role: leader
tier: backend
-
Apply the Redis Service from the following
redis-leader-service.yaml
file:kubectl apply -f https://k8s.io/examples/application/guestbook/redis-leader-service.yaml
-
Query the list of Services to verify that the Redis Service is running:
kubectl get service
The response should be similar to this:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.0.0.1 <none> 443/TCP 1m redis-leader ClusterIP 10.103.78.24 <none> 6379/TCP 16s
redis-leader
with a set of labels
that match the labels previously defined, so the Service routes network
traffic to the Redis Pod.
Set up Redis followers
Although the Redis leader is a single Pod, you can make it highly available and meet traffic demands by adding a few Redis followers, or replicas.
# SOURCE: https://cloud.google.com/kubernetes-engine/docs/tutorials/guestbook
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-follower
labels:
app: redis
role: follower
tier: backend
spec:
replicas: 2
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
role: follower
tier: backend
spec:
containers:
- name: follower
image: gcr.io/google_samples/gb-redis-follower:v2
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 6379
-
Apply the Redis Deployment from the following
redis-follower-deployment.yaml
file:kubectl apply -f https://k8s.io/examples/application/guestbook/redis-follower-deployment.yaml
-
Verify that the two Redis follower replicas are running by querying the list of Pods:
kubectl get pods
The response should be similar to this:
NAME READY STATUS RESTARTS AGE redis-follower-dddfbdcc9-82sfr 1/1 Running 0 37s redis-follower-dddfbdcc9-qrt5k 1/1 Running 0 38s redis-leader-fb76b4755-xjr2n 1/1 Running 0 11m
Creating the Redis follower service
The guestbook application needs to communicate with the Redis followers to read data. To make the Redis followers discoverable, you must set up another Service.
# SOURCE: https://cloud.google.com/kubernetes-engine/docs/tutorials/guestbook
apiVersion: v1
kind: Service
metadata:
name: redis-follower
labels:
app: redis
role: follower
tier: backend
spec:
ports:
# the port that this service should serve on
- port: 6379
selector:
app: redis
role: follower
tier: backend
-
Apply the Redis Service from the following
redis-follower-service.yaml
file:kubectl apply -f https://k8s.io/examples/application/guestbook/redis-follower-service.yaml
-
Query the list of Services to verify that the Redis Service is running:
kubectl get service
The response should be similar to this:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 3d19h redis-follower ClusterIP 10.110.162.42 <none> 6379/TCP 9s redis-leader ClusterIP 10.103.78.24 <none> 6379/TCP 6m10s
redis-follower
with a set of
labels that match the labels previously defined, so the Service routes network
traffic to the Redis Pod.
Set up and Expose the Guestbook Frontend
Now that you have the Redis storage of your guestbook up and running, start the guestbook web servers. Like the Redis followers, the frontend is deployed using a Kubernetes Deployment.
The guestbook app uses a PHP frontend. It is configured to communicate with either the Redis follower or leader Services, depending on whether the request is a read or a write. The frontend exposes a JSON interface, and serves a jQuery-Ajax-based UX.
Creating the Guestbook Frontend Deployment
# SOURCE: https://cloud.google.com/kubernetes-engine/docs/tutorials/guestbook
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
replicas: 3
selector:
matchLabels:
app: guestbook
tier: frontend
template:
metadata:
labels:
app: guestbook
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google_samples/gb-frontend:v5
env:
- name: GET_HOSTS_FROM
value: "dns"
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 80
-
Apply the frontend Deployment from the
frontend-deployment.yaml
file:kubectl apply -f https://k8s.io/examples/application/guestbook/frontend-deployment.yaml
-
Query the list of Pods to verify that the three frontend replicas are running:
kubectl get pods -l app=guestbook -l tier=frontend
The response should be similar to this:
NAME READY STATUS RESTARTS AGE frontend-85595f5bf9-5tqhb 1/1 Running 0 47s frontend-85595f5bf9-qbzwm 1/1 Running 0 47s frontend-85595f5bf9-zchwc 1/1 Running 0 47s
Creating the Frontend Service
The Redis
Services you applied is only accessible within the Kubernetes
cluster because the default type for a Service is
ClusterIP.
ClusterIP
provides a single IP address for the set of Pods the Service is
pointing to. This IP address is accessible only within the cluster.
If you want guests to be able to access your guestbook, you must configure the
frontend Service to be externally visible, so a client can request the Service
from outside the Kubernetes cluster. However a Kubernetes user can use
kubectl port-forward
to access the service even though it uses a
ClusterIP
.
type: LoadBalancer
.
# SOURCE: https://cloud.google.com/kubernetes-engine/docs/tutorials/guestbook
apiVersion: v1
kind: Service
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# if your cluster supports it, uncomment the following to automatically create
# an external load-balanced IP for the frontend service.
# type: LoadBalancer
#type: LoadBalancer
ports:
# the port that this service should serve on
- port: 80
selector:
app: guestbook
tier: frontend
-
Apply the frontend Service from the
frontend-service.yaml
file:kubectl apply -f https://k8s.io/examples/application/guestbook/frontend-service.yaml
-
Query the list of Services to verify that the frontend Service is running:
kubectl get services
The response should be similar to this:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE frontend ClusterIP 10.97.28.230 <none> 80/TCP 19s kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 3d19h redis-follower ClusterIP 10.110.162.42 <none> 6379/TCP 5m48s redis-leader ClusterIP 10.103.78.24 <none> 6379/TCP 11m
Viewing the Frontend Service via kubectl port-forward
-
Run the following command to forward port
8080
on your local machine to port80
on the service.kubectl port-forward svc/frontend 8080:80
The response should be similar to this:
Forwarding from 127.0.0.1:8080 -> 80 Forwarding from [::1]:8080 -> 80
-
load the page http://localhost:8080 in your browser to view your guestbook.
Viewing the Frontend Service via LoadBalancer
If you deployed the frontend-service.yaml
manifest with type: LoadBalancer
you need to find the IP address to view your Guestbook.
-
Run the following command to get the IP address for the frontend Service.
kubectl get service frontend
The response should be similar to this:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE frontend LoadBalancer 10.51.242.136 109.197.92.229 80:32372/TCP 1m
-
Copy the external IP address, and load the page in your browser to view your guestbook.
Scale the Web Frontend
You can scale up or down as needed because your servers are defined as a Service that uses a Deployment controller.
-
Run the following command to scale up the number of frontend Pods:
kubectl scale deployment frontend --replicas=5
-
Query the list of Pods to verify the number of frontend Pods running:
kubectl get pods
The response should look similar to this:
NAME READY STATUS RESTARTS AGE frontend-85595f5bf9-5df5m 1/1 Running 0 83s frontend-85595f5bf9-7zmg5 1/1 Running 0 83s frontend-85595f5bf9-cpskg 1/1 Running 0 15m frontend-85595f5bf9-l2l54 1/1 Running 0 14m frontend-85595f5bf9-l9c8z 1/1 Running 0 14m redis-follower-dddfbdcc9-82sfr 1/1 Running 0 97m redis-follower-dddfbdcc9-qrt5k 1/1 Running 0 97m redis-leader-fb76b4755-xjr2n 1/1 Running 0 108m
-
Run the following command to scale down the number of frontend Pods:
kubectl scale deployment frontend --replicas=2
-
Query the list of Pods to verify the number of frontend Pods running:
kubectl get pods
The response should look similar to this:
NAME READY STATUS RESTARTS AGE frontend-85595f5bf9-cpskg 1/1 Running 0 16m frontend-85595f5bf9-l9c8z 1/1 Running 0 15m redis-follower-dddfbdcc9-82sfr 1/1 Running 0 98m redis-follower-dddfbdcc9-qrt5k 1/1 Running 0 98m redis-leader-fb76b4755-xjr2n 1/1 Running 0 109m
Cleaning up
Deleting the Deployments and Services also deletes any running Pods. Use labels to delete multiple resources with one command.
-
Run the following commands to delete all Pods, Deployments, and Services.
kubectl delete deployment -l app=redis kubectl delete service -l app=redis kubectl delete deployment frontend kubectl delete service frontend
The response should look similar to this:
deployment.apps "redis-follower" deleted deployment.apps "redis-leader" deleted deployment.apps "frontend" deleted service "frontend" deleted
-
Query the list of Pods to verify that no Pods are running:
kubectl get pods
The response should look similar to this:
No resources found in default namespace.
What's next
- Complete the Kubernetes Basics Interactive Tutorials
- Use Kubernetes to create a blog using Persistent Volumes for MySQL and Wordpress
- Read more about connecting applications
- Read more about Managing Resources
5.6 - Stateful Applications
5.6.1 - StatefulSet Basics
This tutorial provides an introduction to managing applications with StatefulSets. It demonstrates how to create, delete, scale, and update the Pods of StatefulSets.
Before you begin
Before you begin this tutorial, you should familiarize yourself with the following Kubernetes concepts:
- Pods
- Cluster DNS
- Headless Services
- PersistentVolumes
- PersistentVolume Provisioning
- StatefulSets
- The kubectl command line tool
Objectives
StatefulSets are intended to be used with stateful applications and distributed systems. However, the administration of stateful applications and distributed systems on Kubernetes is a broad, complex topic. In order to demonstrate the basic features of a StatefulSet, and not to conflate the former topic with the latter, you will deploy a simple web application using a StatefulSet.
After this tutorial, you will be familiar with the following.
- How to create a StatefulSet
- How a StatefulSet manages its Pods
- How to delete a StatefulSet
- How to scale a StatefulSet
- How to update a StatefulSet's Pods
Creating a StatefulSet
Begin by creating a StatefulSet using the example below. It is similar to the
example presented in the
StatefulSets concept.
It creates a headless Service,
nginx
, to publish the IP addresses of Pods in the StatefulSet, web
.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
Download the example above, and save it to a file named web.yaml
You will need to use two terminal windows. In the first terminal, use
kubectl get
to watch the creation
of the StatefulSet's Pods.
kubectl get pods -w -l app=nginx
In the second terminal, use
kubectl apply
to create the
headless Service and StatefulSet defined in web.yaml
.
kubectl apply -f web.yaml
service/nginx created
statefulset.apps/web created
The command above creates two Pods, each running an
NGINX webserver. Get the nginx
Service...
kubectl get service nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx ClusterIP None <none> 80/TCP 12s
...then get the web
StatefulSet, to verify that both were created successfully:
kubectl get statefulset web
NAME DESIRED CURRENT AGE
web 2 1 20s
Ordered Pod Creation
For a StatefulSet with n replicas, when Pods are being deployed, they are
created sequentially, ordered from {0..n-1}. Examine the output of the
kubectl get
command in the first terminal. Eventually, the output will
look like the example below.
kubectl get pods -w -l app=nginx
NAME READY STATUS RESTARTS AGE
web-0 0/1 Pending 0 0s
web-0 0/1 Pending 0 0s
web-0 0/1 ContainerCreating 0 0s
web-0 1/1 Running 0 19s
web-1 0/1 Pending 0 0s
web-1 0/1 Pending 0 0s
web-1 0/1 ContainerCreating 0 0s
web-1 1/1 Running 0 18s
Notice that the web-1
Pod is not launched until the web-0
Pod is
Running (see Pod Phase)
and Ready (see type
in Pod Conditions).
Pods in a StatefulSet
Pods in a StatefulSet have a unique ordinal index and a stable network identity.
Examining the Pod's Ordinal Index
Get the StatefulSet's Pods:
kubectl get pods -l app=nginx
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 1m
web-1 1/1 Running 0 1m
As mentioned in the StatefulSets
concept, the Pods in a StatefulSet have a sticky, unique identity. This identity
is based on a unique ordinal index that is assigned to each Pod by the
StatefulSet controller.
The Pods' names take the form <statefulset name>-<ordinal index>
.
Since the web
StatefulSet has two replicas, it creates two Pods, web-0
and web-1
.
Using Stable Network Identities
Each Pod has a stable hostname based on its ordinal index. Use
kubectl exec
to execute the
hostname
command in each Pod:
for i in 0 1; do kubectl exec "web-$i" -- sh -c 'hostname'; done
web-0
web-1
Use kubectl run
to execute
a container that provides the nslookup
command from the dnsutils
package.
Using nslookup
on the Pods' hostnames, you can examine their in-cluster DNS
addresses:
kubectl run -i --tty --image busybox:1.28 dns-test --restart=Never --rm
which starts a new shell. In that new shell, run:
# Run this in the dns-test container shell
nslookup web-0.nginx
The output is similar to:
Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: web-0.nginx
Address 1: 10.244.1.6
nslookup web-1.nginx
Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: web-1.nginx
Address 1: 10.244.2.6
(and now exit the container shell: exit
)
The CNAME of the headless service points to SRV records (one for each Pod that is Running and Ready). The SRV records point to A record entries that contain the Pods' IP addresses.
In one terminal, watch the StatefulSet's Pods:
kubectl get pod -w -l app=nginx
In a second terminal, use
kubectl delete
to delete all
the Pods in the StatefulSet:
kubectl delete pod -l app=nginx
pod "web-0" deleted
pod "web-1" deleted
Wait for the StatefulSet to restart them, and for both Pods to transition to Running and Ready:
kubectl get pod -w -l app=nginx
NAME READY STATUS RESTARTS AGE
web-0 0/1 ContainerCreating 0 0s
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 2s
web-1 0/1 Pending 0 0s
web-1 0/1 Pending 0 0s
web-1 0/1 ContainerCreating 0 0s
web-1 1/1 Running 0 34s
Use kubectl exec
and kubectl run
to view the Pods' hostnames and in-cluster
DNS entries. First, view the Pods' hostnames:
for i in 0 1; do kubectl exec web-$i -- sh -c 'hostname'; done
web-0
web-1
then, run:
kubectl run -i --tty --image busybox:1.28 dns-test --restart=Never --rm /bin/sh
which starts a new shell.
In that new shell, run:
# Run this in the dns-test container shell
nslookup web-0.nginx
The output is similar to:
Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: web-0.nginx
Address 1: 10.244.1.7
nslookup web-1.nginx
Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: web-1.nginx
Address 1: 10.244.2.8
(and now exit the container shell: exit
)
The Pods' ordinals, hostnames, SRV records, and A record names have not changed, but the IP addresses associated with the Pods may have changed. In the cluster used for this tutorial, they have. This is why it is important not to configure other applications to connect to Pods in a StatefulSet by IP address.
If you need to find and connect to the active members of a StatefulSet, you
should query the CNAME of the headless Service
(nginx.default.svc.cluster.local
). The SRV records associated with the
CNAME will contain only the Pods in the StatefulSet that are Running and
Ready.
If your application already implements connection logic that tests for
liveness and readiness, you can use the SRV records of the Pods (
web-0.nginx.default.svc.cluster.local
,
web-1.nginx.default.svc.cluster.local
), as they are stable, and your
application will be able to discover the Pods' addresses when they transition
to Running and Ready.
Writing to Stable Storage
Get the PersistentVolumeClaims for web-0
and web-1
:
kubectl get pvc -l app=nginx
The output is similar to:
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
www-web-0 Bound pvc-15c268c7-b507-11e6-932f-42010a800002 1Gi RWO 48s
www-web-1 Bound pvc-15c79307-b507-11e6-932f-42010a800002 1Gi RWO 48s
The StatefulSet controller created two PersistentVolumeClaims that are bound to two PersistentVolumes.
As the cluster used in this tutorial is configured to dynamically provision PersistentVolumes, the PersistentVolumes were created and bound automatically.
The NGINX webserver, by default, serves an index file from
/usr/share/nginx/html/index.html
. The volumeMounts
field in the
StatefulSet's spec
ensures that the /usr/share/nginx/html
directory is
backed by a PersistentVolume.
Write the Pods' hostnames to their index.html
files and verify that the NGINX
webservers serve the hostnames:
for i in 0 1; do kubectl exec "web-$i" -- sh -c 'echo "$(hostname)" > /usr/share/nginx/html/index.html'; done
for i in 0 1; do kubectl exec -i -t "web-$i" -- curl http://localhost/; done
web-0
web-1
If you instead see 403 Forbidden responses for the above curl command,
you will need to fix the permissions of the directory mounted by the volumeMounts
(due to a bug when using hostPath volumes),
by running:
for i in 0 1; do kubectl exec web-$i -- chmod 755 /usr/share/nginx/html; done
before retrying the curl
command above.
In one terminal, watch the StatefulSet's Pods:
kubectl get pod -w -l app=nginx
In a second terminal, delete all of the StatefulSet's Pods:
kubectl delete pod -l app=nginx
pod "web-0" deleted
pod "web-1" deleted
Examine the output of the kubectl get
command in the first terminal, and wait
for all of the Pods to transition to Running and Ready.
kubectl get pod -w -l app=nginx
NAME READY STATUS RESTARTS AGE
web-0 0/1 ContainerCreating 0 0s
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 2s
web-1 0/1 Pending 0 0s
web-1 0/1 Pending 0 0s
web-1 0/1 ContainerCreating 0 0s
web-1 1/1 Running 0 34s
Verify the web servers continue to serve their hostnames:
for i in 0 1; do kubectl exec -i -t "web-$i" -- curl http://localhost/; done
web-0
web-1
Even though web-0
and web-1
were rescheduled, they continue to serve their
hostnames because the PersistentVolumes associated with their
PersistentVolumeClaims are remounted to their volumeMounts
. No matter what
node web-0
and web-1
are scheduled on, their PersistentVolumes will be
mounted to the appropriate mount points.
Scaling a StatefulSet
Scaling a StatefulSet refers to increasing or decreasing the number of replicas.
This is accomplished by updating the replicas
field. You can use either
kubectl scale
or
kubectl patch
to scale a StatefulSet.
Scaling Up
In one terminal window, watch the Pods in the StatefulSet:
kubectl get pods -w -l app=nginx
In another terminal window, use kubectl scale
to scale the number of replicas
to 5:
kubectl scale sts web --replicas=5
statefulset.apps/web scaled
Examine the output of the kubectl get
command in the first terminal, and wait
for the three additional Pods to transition to Running and Ready.
kubectl get pods -w -l app=nginx
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 2h
web-1 1/1 Running 0 2h
NAME READY STATUS RESTARTS AGE
web-2 0/1 Pending 0 0s
web-2 0/1 Pending 0 0s
web-2 0/1 ContainerCreating 0 0s
web-2 1/1 Running 0 19s
web-3 0/1 Pending 0 0s
web-3 0/1 Pending 0 0s
web-3 0/1 ContainerCreating 0 0s
web-3 1/1 Running 0 18s
web-4 0/1 Pending 0 0s
web-4 0/1 Pending 0 0s
web-4 0/1 ContainerCreating 0 0s
web-4 1/1 Running 0 19s
The StatefulSet controller scaled the number of replicas. As with StatefulSet creation, the StatefulSet controller created each Pod sequentially with respect to its ordinal index, and it waited for each Pod's predecessor to be Running and Ready before launching the subsequent Pod.
Scaling Down
In one terminal, watch the StatefulSet's Pods:
kubectl get pods -w -l app=nginx
In another terminal, use kubectl patch
to scale the StatefulSet back down to
three replicas:
kubectl patch sts web -p '{"spec":{"replicas":3}}'
statefulset.apps/web patched
Wait for web-4
and web-3
to transition to Terminating.
kubectl get pods -w -l app=nginx
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 3h
web-1 1/1 Running 0 3h
web-2 1/1 Running 0 55s
web-3 1/1 Running 0 36s
web-4 0/1 ContainerCreating 0 18s
NAME READY STATUS RESTARTS AGE
web-4 1/1 Running 0 19s
web-4 1/1 Terminating 0 24s
web-4 1/1 Terminating 0 24s
web-3 1/1 Terminating 0 42s
web-3 1/1 Terminating 0 42s
Ordered Pod Termination
The controller deleted one Pod at a time, in reverse order with respect to its ordinal index, and it waited for each to be completely shutdown before deleting the next.
Get the StatefulSet's PersistentVolumeClaims:
kubectl get pvc -l app=nginx
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
www-web-0 Bound pvc-15c268c7-b507-11e6-932f-42010a800002 1Gi RWO 13h
www-web-1 Bound pvc-15c79307-b507-11e6-932f-42010a800002 1Gi RWO 13h
www-web-2 Bound pvc-e1125b27-b508-11e6-932f-42010a800002 1Gi RWO 13h
www-web-3 Bound pvc-e1176df6-b508-11e6-932f-42010a800002 1Gi RWO 13h
www-web-4 Bound pvc-e11bb5f8-b508-11e6-932f-42010a800002 1Gi RWO 13h
There are still five PersistentVolumeClaims and five PersistentVolumes. When exploring a Pod's stable storage, we saw that the PersistentVolumes mounted to the Pods of a StatefulSet are not deleted when the StatefulSet's Pods are deleted. This is still true when Pod deletion is caused by scaling the StatefulSet down.
Updating StatefulSets
In Kubernetes 1.7 and later, the StatefulSet controller supports automated updates. The
strategy used is determined by the spec.updateStrategy
field of the
StatefulSet API Object. This feature can be used to upgrade the container
images, resource requests and/or limits, labels, and annotations of the Pods in a
StatefulSet. There are two valid update strategies, RollingUpdate
and
OnDelete
.
RollingUpdate
update strategy is the default for StatefulSets.
Rolling Update
The RollingUpdate
update strategy will update all Pods in a StatefulSet, in
reverse ordinal order, while respecting the StatefulSet guarantees.
Patch the web
StatefulSet to apply the RollingUpdate
update strategy:
kubectl patch statefulset web -p '{"spec":{"updateStrategy":{"type":"RollingUpdate"}}}'
statefulset.apps/web patched
In one terminal window, patch the web
StatefulSet to change the container
image again:
kubectl patch statefulset web --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"gcr.io/google_containers/nginx-slim:0.8"}]'
statefulset.apps/web patched
In another terminal, watch the Pods in the StatefulSet:
kubectl get pod -l app=nginx -w
The output is similar to:
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 7m
web-1 1/1 Running 0 7m
web-2 1/1 Running 0 8m
web-2 1/1 Terminating 0 8m
web-2 1/1 Terminating 0 8m
web-2 0/1 Terminating 0 8m
web-2 0/1 Terminating 0 8m
web-2 0/1 Terminating 0 8m
web-2 0/1 Terminating 0 8m
web-2 0/1 Pending 0 0s
web-2 0/1 Pending 0 0s
web-2 0/1 ContainerCreating 0 0s
web-2 1/1 Running 0 19s
web-1 1/1 Terminating 0 8m
web-1 0/1 Terminating 0 8m
web-1 0/1 Terminating 0 8m
web-1 0/1 Terminating 0 8m
web-1 0/1 Pending 0 0s
web-1 0/1 Pending 0 0s
web-1 0/1 ContainerCreating 0 0s
web-1 1/1 Running 0 6s
web-0 1/1 Terminating 0 7m
web-0 1/1 Terminating 0 7m
web-0 0/1 Terminating 0 7m
web-0 0/1 Terminating 0 7m
web-0 0/1 Terminating 0 7m
web-0 0/1 Terminating 0 7m
web-0 0/1 Pending 0 0s
web-0 0/1 Pending 0 0s
web-0 0/1 ContainerCreating 0 0s
web-0 1/1 Running 0 10s
The Pods in the StatefulSet are updated in reverse ordinal order. The StatefulSet controller terminates each Pod, and waits for it to transition to Running and Ready prior to updating the next Pod. Note that, even though the StatefulSet controller will not proceed to update the next Pod until its ordinal successor is Running and Ready, it will restore any Pod that fails during the update to its current version.
Pods that have already received the update will be restored to the updated version, and Pods that have not yet received the update will be restored to the previous version. In this way, the controller attempts to continue to keep the application healthy and the update consistent in the presence of intermittent failures.
Get the Pods to view their container images:
for p in 0 1 2; do kubectl get pod "web-$p" --template '{{range $i, $c := .spec.containers}}{{$c.image}}{{end}}'; echo; done
k8s.gcr.io/nginx-slim:0.8
k8s.gcr.io/nginx-slim:0.8
k8s.gcr.io/nginx-slim:0.8
All the Pods in the StatefulSet are now running the previous container image.
kubectl rollout status sts/<name>
to view
the status of a rolling update to a StatefulSet
Staging an Update
You can stage an update to a StatefulSet by using the partition
parameter of
the RollingUpdate
update strategy. A staged update will keep all of the Pods
in the StatefulSet at the current version while allowing mutations to the
StatefulSet's .spec.template
.
Patch the web
StatefulSet to add a partition to the updateStrategy
field:
kubectl patch statefulset web -p '{"spec":{"updateStrategy":{"type":"RollingUpdate","rollingUpdate":{"partition":3}}}}'
statefulset.apps/web patched
Patch the StatefulSet again to change the container's image:
kubectl patch statefulset web --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"k8s.gcr.io/nginx-slim:0.7"}]'
statefulset.apps/web patched
Delete a Pod in the StatefulSet:
kubectl delete pod web-2
pod "web-2" deleted
Wait for the Pod to be Running and Ready.
kubectl get pod -l app=nginx -w
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 4m
web-1 1/1 Running 0 4m
web-2 0/1 ContainerCreating 0 11s
web-2 1/1 Running 0 18s
Get the Pod's container image:
kubectl get pod web-2 --template '{{range $i, $c := .spec.containers}}{{$c.image}}{{end}}'
k8s.gcr.io/nginx-slim:0.8
Notice that, even though the update strategy is RollingUpdate
the StatefulSet
restored the Pod with its original container. This is because the
ordinal of the Pod is less than the partition
specified by the
updateStrategy
.
Rolling Out a Canary
You can roll out a canary to test a modification by decrementing the partition
you specified above.
Patch the StatefulSet to decrement the partition:
kubectl patch statefulset web -p '{"spec":{"updateStrategy":{"type":"RollingUpdate","rollingUpdate":{"partition":2}}}}'
statefulset.apps/web patched
Wait for web-2
to be Running and Ready.
kubectl get pod -l app=nginx -w
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 4m
web-1 1/1 Running 0 4m
web-2 0/1 ContainerCreating 0 11s
web-2 1/1 Running 0 18s
Get the Pod's container:
kubectl get pod web-2 --template '{{range $i, $c := .spec.containers}}{{$c.image}}{{end}}'
k8s.gcr.io/nginx-slim:0.7
When you changed the partition
, the StatefulSet controller automatically
updated the web-2
Pod because the Pod's ordinal was greater than or equal to
the partition
.
Delete the web-1
Pod:
kubectl delete pod web-1
pod "web-1" deleted
Wait for the web-1
Pod to be Running and Ready.
kubectl get pod -l app=nginx -w
The output is similar to:
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 6m
web-1 0/1 Terminating 0 6m
web-2 1/1 Running 0 2m
web-1 0/1 Terminating 0 6m
web-1 0/1 Terminating 0 6m
web-1 0/1 Terminating 0 6m
web-1 0/1 Pending 0 0s
web-1 0/1 Pending 0 0s
web-1 0/1 ContainerCreating 0 0s
web-1 1/1 Running 0 18s
Get the web-1
Pod's container image:
kubectl get pod web-1 --template '{{range $i, $c := .spec.containers}}{{$c.image}}{{end}}'
k8s.gcr.io/nginx-slim:0.8
web-1
was restored to its original configuration because the Pod's ordinal
was less than the partition. When a partition is specified, all Pods with an
ordinal that is greater than or equal to the partition will be updated when the
StatefulSet's .spec.template
is updated. If a Pod that has an ordinal less
than the partition is deleted or otherwise terminated, it will be restored to
its original configuration.
Phased Roll Outs
You can perform a phased roll out (e.g. a linear, geometric, or exponential
roll out) using a partitioned rolling update in a similar manner to how you
rolled out a canary. To perform a phased roll out, set
the partition
to the ordinal at which you want the controller to pause the
update.
The partition is currently set to 2
. Set the partition to 0
:
kubectl patch statefulset web -p '{"spec":{"updateStrategy":{"type":"RollingUpdate","rollingUpdate":{"partition":0}}}}'
statefulset.apps/web patched
Wait for all of the Pods in the StatefulSet to become Running and Ready.
kubectl get pod -l app=nginx -w
The output is similar to:
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 3m
web-1 0/1 ContainerCreating 0 11s
web-2 1/1 Running 0 2m
web-1 1/1 Running 0 18s
web-0 1/1 Terminating 0 3m
web-0 1/1 Terminating 0 3m
web-0 0/1 Terminating 0 3m
web-0 0/1 Terminating 0 3m
web-0 0/1 Terminating 0 3m
web-0 0/1 Terminating 0 3m
web-0 0/1 Pending 0 0s
web-0 0/1 Pending 0 0s
web-0 0/1 ContainerCreating 0 0s
web-0 1/1 Running 0 3s
Get the container image details for the Pods in the StatefulSet:
for p in 0 1 2; do kubectl get pod "web-$p" --template '{{range $i, $c := .spec.containers}}{{$c.image}}{{end}}'; echo; done
k8s.gcr.io/nginx-slim:0.7
k8s.gcr.io/nginx-slim:0.7
k8s.gcr.io/nginx-slim:0.7
By moving the partition
to 0
, you allowed the StatefulSet to
continue the update process.
On Delete
The OnDelete
update strategy implements the legacy (1.6 and prior) behavior,
When you select this update strategy, the StatefulSet controller will not
automatically update Pods when a modification is made to the StatefulSet's
.spec.template
field. This strategy can be selected by setting the
.spec.template.updateStrategy.type
to OnDelete
.
Deleting StatefulSets
StatefulSet supports both Non-Cascading and Cascading deletion. In a Non-Cascading Delete, the StatefulSet's Pods are not deleted when the StatefulSet is deleted. In a Cascading Delete, both the StatefulSet and its Pods are deleted.
Non-Cascading Delete
In one terminal window, watch the Pods in the StatefulSet.
kubectl get pods -w -l app=nginx
Use kubectl delete
to delete the
StatefulSet. Make sure to supply the --cascade=orphan
parameter to the
command. This parameter tells Kubernetes to only delete the StatefulSet, and to
not delete any of its Pods.
kubectl delete statefulset web --cascade=orphan
statefulset.apps "web" deleted
Get the Pods, to examine their status:
kubectl get pods -l app=nginx
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 6m
web-1 1/1 Running 0 7m
web-2 1/1 Running 0 5m
Even though web
has been deleted, all of the Pods are still Running and Ready.
Delete web-0
:
kubectl delete pod web-0
pod "web-0" deleted
Get the StatefulSet's Pods:
kubectl get pods -l app=nginx
NAME READY STATUS RESTARTS AGE
web-1 1/1 Running 0 10m
web-2 1/1 Running 0 7m
As the web
StatefulSet has been deleted, web-0
has not been relaunched.
In one terminal, watch the StatefulSet's Pods.
kubectl get pods -w -l app=nginx
In a second terminal, recreate the StatefulSet. Note that, unless
you deleted the nginx
Service (which you should not have), you will see
an error indicating that the Service already exists.
kubectl apply -f web.yaml
statefulset.apps/web created
service/nginx unchanged
Ignore the error. It only indicates that an attempt was made to create the nginx headless Service even though that Service already exists.
Examine the output of the kubectl get
command running in the first terminal.
kubectl get pods -w -l app=nginx
NAME READY STATUS RESTARTS AGE
web-1 1/1 Running 0 16m
web-2 1/1 Running 0 2m
NAME READY STATUS RESTARTS AGE
web-0 0/1 Pending 0 0s
web-0 0/1 Pending 0 0s
web-0 0/1 ContainerCreating 0 0s
web-0 1/1 Running 0 18s
web-2 1/1 Terminating 0 3m
web-2 0/1 Terminating 0 3m
web-2 0/1 Terminating 0 3m
web-2 0/1 Terminating 0 3m
When the web
StatefulSet was recreated, it first relaunched web-0
.
Since web-1
was already Running and Ready, when web-0
transitioned to
Running and Ready, it adopted this Pod. Since you recreated the StatefulSet
with replicas
equal to 2, once web-0
had been recreated, and once
web-1
had been determined to already be Running and Ready, web-2
was
terminated.
Let's take another look at the contents of the index.html
file served by the
Pods' webservers:
for i in 0 1; do kubectl exec -i -t "web-$i" -- curl http://localhost/; done
web-0
web-1
Even though you deleted both the StatefulSet and the web-0
Pod, it still
serves the hostname originally entered into its index.html
file. This is
because the StatefulSet never deletes the PersistentVolumes associated with a
Pod. When you recreated the StatefulSet and it relaunched web-0
, its original
PersistentVolume was remounted.
Cascading Delete
In one terminal window, watch the Pods in the StatefulSet.
kubectl get pods -w -l app=nginx
In another terminal, delete the StatefulSet again. This time, omit the
--cascade=orphan
parameter.
kubectl delete statefulset web
statefulset.apps "web" deleted
Examine the output of the kubectl get
command running in the first terminal,
and wait for all of the Pods to transition to Terminating.
kubectl get pods -w -l app=nginx
NAME READY STATUS RESTARTS AGE
web-0 1/1 Running 0 11m
web-1 1/1 Running 0 27m
NAME READY STATUS RESTARTS AGE
web-0 1/1 Terminating 0 12m
web-1 1/1 Terminating 0 29m
web-0 0/1 Terminating 0 12m
web-0 0/1 Terminating 0 12m
web-0 0/1 Terminating 0 12m
web-1 0/1 Terminating 0 29m
web-1 0/1 Terminating 0 29m
web-1 0/1 Terminating 0 29m
As you saw in the Scaling Down section, the Pods are terminated one at a time, with respect to the reverse order of their ordinal indices. Before terminating a Pod, the StatefulSet controller waits for the Pod's successor to be completely terminated.
nginx
Service manually.
kubectl delete service nginx
service "nginx" deleted
Recreate the StatefulSet and headless Service one more time:
kubectl apply -f web.yaml
service/nginx created
statefulset.apps/web created
When all of the StatefulSet's Pods transition to Running and Ready, retrieve
the contents of their index.html
files:
for i in 0 1; do kubectl exec -i -t "web-$i" -- curl http://localhost/; done
web-0
web-1
Even though you completely deleted the StatefulSet, and all of its Pods, the
Pods are recreated with their PersistentVolumes mounted, and web-0
and
web-1
continue to serve their hostnames.
Finally, delete the nginx
Service...
kubectl delete service nginx
service "nginx" deleted
...and the web
StatefulSet:
kubectl delete statefulset web
statefulset "web" deleted
Pod Management Policy
For some distributed systems, the StatefulSet ordering guarantees are
unnecessary and/or undesirable. These systems require only uniqueness and
identity. To address this, in Kubernetes 1.7, we introduced
.spec.podManagementPolicy
to the StatefulSet API Object.
OrderedReady Pod Management
OrderedReady
pod management is the default for StatefulSets. It tells the
StatefulSet controller to respect the ordering guarantees demonstrated
above.
Parallel Pod Management
Parallel
pod management tells the StatefulSet controller to launch or
terminate all Pods in parallel, and not to wait for Pods to become Running
and Ready or completely terminated prior to launching or terminating another
Pod. This option only affects the behavior for scaling operations. Updates are not affected.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
podManagementPolicy: "Parallel"
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
Download the example above, and save it to a file named web-parallel.yaml
This manifest is identical to the one you downloaded above except that the .spec.podManagementPolicy
of the web
StatefulSet is set to Parallel
.
In one terminal, watch the Pods in the StatefulSet.
kubectl get pod -l app=nginx -w
In another terminal, create the StatefulSet and Service in the manifest:
kubectl apply -f web-parallel.yaml
service/nginx created
statefulset.apps/web created
Examine the output of the kubectl get
command that you executed in the first terminal.
kubectl get pod -l app=nginx -w
NAME READY STATUS RESTARTS AGE
web-0 0/1 Pending 0 0s
web-0 0/1 Pending 0 0s
web-1 0/1 Pending 0 0s
web-1 0/1 Pending 0 0s
web-0 0/1 ContainerCreating 0 0s
web-1 0/1 ContainerCreating 0 0s
web-0 1/1 Running 0 10s
web-1 1/1 Running 0 10s
The StatefulSet controller launched both web-0
and web-1
at the same time.
Keep the second terminal open, and, in another terminal window scale the StatefulSet:
kubectl scale statefulset/web --replicas=4
statefulset.apps/web scaled
Examine the output of the terminal where the kubectl get
command is running.
web-3 0/1 Pending 0 0s
web-3 0/1 Pending 0 0s
web-3 0/1 Pending 0 7s
web-3 0/1 ContainerCreating 0 7s
web-2 1/1 Running 0 10s
web-3 1/1 Running 0 26s
The StatefulSet launched two new Pods, and it did not wait for the first to become Running and Ready prior to launching the second.
Cleaning up
You should have two terminals open, ready for you to run kubectl
commands as
part of cleanup.
kubectl delete sts web
# sts is an abbreviation for statefulset
You can watch kubectl get
to see those Pods being deleted.
kubectl get pod -l app=nginx -w
web-3 1/1 Terminating 0 9m
web-2 1/1 Terminating 0 9m
web-3 1/1 Terminating 0 9m
web-2 1/1 Terminating 0 9m
web-1 1/1 Terminating 0 44m
web-0 1/1 Terminating 0 44m
web-0 0/1 Terminating 0 44m
web-3 0/1 Terminating 0 9m
web-2 0/1 Terminating 0 9m
web-1 0/1 Terminating 0 44m
web-0 0/1 Terminating 0 44m
web-2 0/1 Terminating 0 9m
web-2 0/1 Terminating 0 9m
web-2 0/1 Terminating 0 9m
web-1 0/1 Terminating 0 44m
web-1 0/1 Terminating 0 44m
web-1 0/1 Terminating 0 44m
web-0 0/1 Terminating 0 44m
web-0 0/1 Terminating 0 44m
web-0 0/1 Terminating 0 44m
web-3 0/1 Terminating 0 9m
web-3 0/1 Terminating 0 9m
web-3 0/1 Terminating 0 9m
During deletion, a StatefulSet removes all Pods concurrently; it does not wait for a Pod's ordinal successor to terminate prior to deleting that Pod.
Close the terminal where the kubectl get
command is running and delete the nginx
Service:
kubectl delete svc nginx
You also need to delete the persistent storage media for the PersistentVolumes used in this tutorial.
Follow the necessary steps, based on your environment, storage configuration, and provisioning method, to ensure that all storage is reclaimed.
5.6.2 - Example: Deploying WordPress and MySQL with Persistent Volumes
This tutorial shows you how to deploy a WordPress site and a MySQL database using Minikube. Both applications use PersistentVolumes and PersistentVolumeClaims to store data.
A PersistentVolume (PV) is a piece of storage in the cluster that has been manually provisioned by an administrator, or dynamically provisioned by Kubernetes using a StorageClass. A PersistentVolumeClaim (PVC) is a request for storage by a user that can be fulfilled by a PV. PersistentVolumes and PersistentVolumeClaims are independent from Pod lifecycles and preserve data through restarting, rescheduling, and even deleting Pods.
Objectives
- Create PersistentVolumeClaims and PersistentVolumes
- Create a
kustomization.yaml
with- a Secret generator
- MySQL resource configs
- WordPress resource configs
- Apply the kustomization directory by
kubectl apply -k ./
- Clean up
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To check the version, enterkubectl version
.
The example shown on this page works with kubectl
1.14 and above.
Download the following configuration files:
Create PersistentVolumeClaims and PersistentVolumes
MySQL and Wordpress each require a PersistentVolume to store data. Their PersistentVolumeClaims will be created at the deployment step.
Many cluster environments have a default StorageClass installed. When a StorageClass is not specified in the PersistentVolumeClaim, the cluster's default StorageClass is used instead.
When a PersistentVolumeClaim is created, a PersistentVolume is dynamically provisioned based on the StorageClass configuration.
hostPath
provisioner. hostPath
volumes are only suitable for development and testing. With hostPath
volumes, your data lives in /tmp
on the node the Pod is scheduled onto and does not move between nodes. If a Pod dies and gets scheduled to another node in the cluster, or the node is rebooted, the data is lost.
hostPath
provisioner, the --enable-hostpath-provisioner
flag must be set in the controller-manager
component.
Create a kustomization.yaml
Add a Secret generator
A Secret is an object that stores a piece of sensitive data like a password or key. Since 1.14, kubectl
supports the management of Kubernetes objects using a kustomization file. You can create a Secret by generators in kustomization.yaml
.
Add a Secret generator in kustomization.yaml
from the following command. You will need to replace YOUR_PASSWORD
with the password you want to use.
cat <<EOF >./kustomization.yaml
secretGenerator:
- name: mysql-pass
literals:
- password=YOUR_PASSWORD
EOF
Add resource configs for MySQL and WordPress
The following manifest describes a single-instance MySQL Deployment. The MySQL container mounts the PersistentVolume at /var/lib/mysql. The MYSQL_ROOT_PASSWORD
environment variable sets the database password from the Secret.
apiVersion: v1
kind: Service
metadata:
name: wordpress-mysql
labels:
app: wordpress
spec:
ports:
- port: 3306
selector:
app: wordpress
tier: mysql
clusterIP: None
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-pv-claim
labels:
app: wordpress
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: wordpress-mysql
labels:
app: wordpress
spec:
selector:
matchLabels:
app: wordpress
tier: mysql
strategy:
type: Recreate
template:
metadata:
labels:
app: wordpress
tier: mysql
spec:
containers:
- image: mysql:5.6
name: mysql
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-pass
key: password
ports:
- containerPort: 3306
name: mysql
volumeMounts:
- name: mysql-persistent-storage
mountPath: /var/lib/mysql
volumes:
- name: mysql-persistent-storage
persistentVolumeClaim:
claimName: mysql-pv-claim
The following manifest describes a single-instance WordPress Deployment. The WordPress container mounts the
PersistentVolume at /var/www/html
for website data files. The WORDPRESS_DB_HOST
environment variable sets
the name of the MySQL Service defined above, and WordPress will access the database by Service. The
WORDPRESS_DB_PASSWORD
environment variable sets the database password from the Secret kustomize generated.
apiVersion: v1
kind: Service
metadata:
name: wordpress
labels:
app: wordpress
spec:
ports:
- port: 80
selector:
app: wordpress
tier: frontend
type: LoadBalancer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: wp-pv-claim
labels:
app: wordpress
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: wordpress
labels:
app: wordpress
spec:
selector:
matchLabels:
app: wordpress
tier: frontend
strategy:
type: Recreate
template:
metadata:
labels:
app: wordpress
tier: frontend
spec:
containers:
- image: wordpress:4.8-apache
name: wordpress
env:
- name: WORDPRESS_DB_HOST
value: wordpress-mysql
- name: WORDPRESS_DB_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-pass
key: password
ports:
- containerPort: 80
name: wordpress
volumeMounts:
- name: wordpress-persistent-storage
mountPath: /var/www/html
volumes:
- name: wordpress-persistent-storage
persistentVolumeClaim:
claimName: wp-pv-claim
-
Download the MySQL deployment configuration file.
curl -LO https://k8s.io/examples/application/wordpress/mysql-deployment.yaml
-
Download the WordPress configuration file.
curl -LO https://k8s.io/examples/application/wordpress/wordpress-deployment.yaml
-
Add them to
kustomization.yaml
file.
cat <<EOF >>./kustomization.yaml
resources:
- mysql-deployment.yaml
- wordpress-deployment.yaml
EOF
Apply and Verify
The kustomization.yaml
contains all the resources for deploying a WordPress site and a
MySQL database. You can apply the directory by
kubectl apply -k ./
Now you can verify that all objects exist.
-
Verify that the Secret exists by running the following command:
kubectl get secrets
The response should be like this:
NAME TYPE DATA AGE mysql-pass-c57bb4t7mf Opaque 1 9s
-
Verify that a PersistentVolume got dynamically provisioned.
kubectl get pvc
Note: It can take up to a few minutes for the PVs to be provisioned and bound.The response should be like this:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mysql-pv-claim Bound pvc-8cbd7b2e-4044-11e9-b2bb-42010a800002 20Gi RWO standard 77s wp-pv-claim Bound pvc-8cd0df54-4044-11e9-b2bb-42010a800002 20Gi RWO standard 77s
-
Verify that the Pod is running by running the following command:
kubectl get pods
Note: It can take up to a few minutes for the Pod's Status to beRUNNING
.The response should be like this:
NAME READY STATUS RESTARTS AGE wordpress-mysql-1894417608-x5dzt 1/1 Running 0 40s
-
Verify that the Service is running by running the following command:
kubectl get services wordpress
The response should be like this:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE wordpress LoadBalancer 10.0.0.89 <pending> 80:32406/TCP 4m
Note: Minikube can only expose Services throughNodePort
. The EXTERNAL-IP is always pending. -
Run the following command to get the IP Address for the WordPress Service:
minikube service wordpress --url
The response should be like this:
http://1.2.3.4:32406
-
Copy the IP address, and load the page in your browser to view your site.
You should see the WordPress set up page similar to the following screenshot.
Either install WordPress by creating a username and password or delete your instance.
Cleaning up
-
Run the following command to delete your Secret, Deployments, Services and PersistentVolumeClaims:
kubectl delete -k ./
What's next
- Learn more about Introspection and Debugging
- Learn more about Jobs
- Learn more about Port Forwarding
- Learn how to Get a Shell to a Container
5.6.3 - Example: Deploying Cassandra with a StatefulSet
This tutorial shows you how to run Apache Cassandra on Kubernetes. Cassandra, a database, needs persistent storage to provide data durability (application state). In this example, a custom Cassandra seed provider lets the database discover new Cassandra instances as they join the Cassandra cluster.
StatefulSets make it easier to deploy stateful applications into your Kubernetes cluster. For more information on the features used in this tutorial, see StatefulSet.
Cassandra and Kubernetes both use the term node to mean a member of a cluster. In this tutorial, the Pods that belong to the StatefulSet are Cassandra nodes and are members of the Cassandra cluster (called a ring). When those Pods run in your Kubernetes cluster, the Kubernetes control plane schedules those Pods onto Kubernetes Nodes.
When a Cassandra node starts, it uses a seed list to bootstrap discovery of other nodes in the ring. This tutorial deploys a custom Cassandra seed provider that lets the database discover new Cassandra Pods as they appear inside your Kubernetes cluster.
Objectives
- Create and validate a Cassandra headless Service.
- Use a StatefulSet to create a Cassandra ring.
- Validate the StatefulSet.
- Modify the StatefulSet.
- Delete the StatefulSet and its Pods.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
To complete this tutorial, you should already have a basic familiarity with Pods, Services, and StatefulSets.
Additional Minikube setup instructions
Minikube defaults to 2048MB of memory and 2 CPU. Running Minikube with the default resource configuration results in insufficient resource errors during this tutorial. To avoid these errors, start Minikube with the following settings:
minikube start --memory 5120 --cpus=4
Creating a headless Service for Cassandra
In Kubernetes, a Service describes a set of Pods that perform the same task.
The following Service is used for DNS lookups between Cassandra Pods and clients within your cluster:
apiVersion: v1
kind: Service
metadata:
labels:
app: cassandra
name: cassandra
spec:
clusterIP: None
ports:
- port: 9042
selector:
app: cassandra
Create a Service to track all Cassandra StatefulSet members from the cassandra-service.yaml
file:
kubectl apply -f https://k8s.io/examples/application/cassandra/cassandra-service.yaml
Validating (optional)
Get the Cassandra Service.
kubectl get svc cassandra
The response is
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cassandra ClusterIP None <none> 9042/TCP 45s
If you don't see a Service named cassandra
, that means creation failed. Read
Debug Services
for help troubleshooting common issues.
Using a StatefulSet to create a Cassandra ring
The StatefulSet manifest, included below, creates a Cassandra ring that consists of three Pods.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
labels:
app: cassandra
spec:
serviceName: cassandra
replicas: 3
selector:
matchLabels:
app: cassandra
template:
metadata:
labels:
app: cassandra
spec:
terminationGracePeriodSeconds: 1800
containers:
- name: cassandra
image: gcr.io/google-samples/cassandra:v13
imagePullPolicy: Always
ports:
- containerPort: 7000
name: intra-node
- containerPort: 7001
name: tls-intra-node
- containerPort: 7199
name: jmx
- containerPort: 9042
name: cql
resources:
limits:
cpu: "500m"
memory: 1Gi
requests:
cpu: "500m"
memory: 1Gi
securityContext:
capabilities:
add:
- IPC_LOCK
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- nodetool drain
env:
- name: MAX_HEAP_SIZE
value: 512M
- name: HEAP_NEWSIZE
value: 100M
- name: CASSANDRA_SEEDS
value: "cassandra-0.cassandra.default.svc.cluster.local"
- name: CASSANDRA_CLUSTER_NAME
value: "K8Demo"
- name: CASSANDRA_DC
value: "DC1-K8Demo"
- name: CASSANDRA_RACK
value: "Rack1-K8Demo"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
readinessProbe:
exec:
command:
- /bin/bash
- -c
- /ready-probe.sh
initialDelaySeconds: 15
timeoutSeconds: 5
# These volume mounts are persistent. They are like inline claims,
# but not exactly because the names need to match exactly one of
# the stateful pod volumes.
volumeMounts:
- name: cassandra-data
mountPath: /cassandra_data
# These are converted to volume claims by the controller
# and mounted at the paths mentioned above.
# do not use these in production until ssd GCEPersistentDisk or other ssd pd
volumeClaimTemplates:
- metadata:
name: cassandra-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast
resources:
requests:
storage: 1Gi
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: fast
provisioner: k8s.io/minikube-hostpath
parameters:
type: pd-ssd
Create the Cassandra StatefulSet from the cassandra-statefulset.yaml
file:
# Use this if you are able to apply cassandra-statefulset.yaml unmodified
kubectl apply -f https://k8s.io/examples/application/cassandra/cassandra-statefulset.yaml
If you need to modify cassandra-statefulset.yaml
to suit your cluster, download
https://k8s.io/examples/application/cassandra/cassandra-statefulset.yaml and then apply
that manifest, from the folder you saved the modified version into:
# Use this if you needed to modify cassandra-statefulset.yaml locally
kubectl apply -f cassandra-statefulset.yaml
Validating the Cassandra StatefulSet
-
Get the Cassandra StatefulSet:
kubectl get statefulset cassandra
The response should be similar to:
NAME DESIRED CURRENT AGE cassandra 3 0 13s
The
StatefulSet
resource deploys Pods sequentially. -
Get the Pods to see the ordered creation status:
kubectl get pods -l="app=cassandra"
The response should be similar to:
NAME READY STATUS RESTARTS AGE cassandra-0 1/1 Running 0 1m cassandra-1 0/1 ContainerCreating 0 8s
It can take several minutes for all three Pods to deploy. Once they are deployed, the same command returns output similar to:
NAME READY STATUS RESTARTS AGE cassandra-0 1/1 Running 0 10m cassandra-1 1/1 Running 0 9m cassandra-2 1/1 Running 0 8m
-
Run the Cassandra nodetool inside the first Pod, to display the status of the ring.
kubectl exec -it cassandra-0 -- nodetool status
The response should look something like:
Datacenter: DC1-K8Demo ====================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 172.17.0.5 83.57 KiB 32 74.0% e2dd09e6-d9d3-477e-96c5-45094c08db0f Rack1-K8Demo UN 172.17.0.4 101.04 KiB 32 58.8% f89d6835-3a42-4419-92b3-0e62cae1479c Rack1-K8Demo UN 172.17.0.6 84.74 KiB 32 67.1% a6a1e8c2-3dc5-4417-b1a0-26507af2aaad Rack1-K8Demo
Modifying the Cassandra StatefulSet
Use kubectl edit
to modify the size of a Cassandra StatefulSet.
-
Run the following command:
kubectl edit statefulset cassandra
This command opens an editor in your terminal. The line you need to change is the
replicas
field. The following sample is an excerpt of the StatefulSet file:# Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: apps/v1 kind: StatefulSet metadata: creationTimestamp: 2016-08-13T18:40:58Z generation: 1 labels: app: cassandra name: cassandra namespace: default resourceVersion: "323" uid: 7a219483-6185-11e6-a910-42010a8a0fc0 spec: replicas: 3
-
Change the number of replicas to 4, and then save the manifest.
The StatefulSet now scales to run with 4 Pods.
-
Get the Cassandra StatefulSet to verify your change:
kubectl get statefulset cassandra
The response should be similar to:
NAME DESIRED CURRENT AGE cassandra 4 4 36m
Cleaning up
Deleting or scaling a StatefulSet down does not delete the volumes associated with the StatefulSet. This setting is for your safety because your data is more valuable than automatically purging all related StatefulSet resources.
-
Run the following commands (chained together into a single command) to delete everything in the Cassandra StatefulSet:
grace=$(kubectl get pod cassandra-0 -o=jsonpath='{.spec.terminationGracePeriodSeconds}') \ && kubectl delete statefulset -l app=cassandra \ && echo "Sleeping ${grace} seconds" 1>&2 \ && sleep $grace \ && kubectl delete persistentvolumeclaim -l app=cassandra
-
Run the following command to delete the Service you set up for Cassandra:
kubectl delete service -l app=cassandra
Cassandra container environment variables
The Pods in this tutorial use the gcr.io/google-samples/cassandra:v13
image from Google's container registry.
The Docker image above is based on debian-base
and includes OpenJDK 8.
This image includes a standard Cassandra installation from the Apache Debian repo.
By using environment variables you can change values that are inserted into cassandra.yaml
.
Environment variable | Default value |
---|---|
CASSANDRA_CLUSTER_NAME |
'Test Cluster' |
CASSANDRA_NUM_TOKENS |
32 |
CASSANDRA_RPC_ADDRESS |
0.0.0.0 |
What's next
- Learn how to Scale a StatefulSet.
- Learn more about the KubernetesSeedProvider
- See more custom Seed Provider Configurations
5.6.4 - Running ZooKeeper, A Distributed System Coordinator
This tutorial demonstrates running Apache Zookeeper on Kubernetes using StatefulSets, PodDisruptionBudgets, and PodAntiAffinity.
Before you begin
Before starting this tutorial, you should be familiar with the following Kubernetes concepts:
- Pods
- Cluster DNS
- Headless Services
- PersistentVolumes
- PersistentVolume Provisioning
- StatefulSets
- PodDisruptionBudgets
- PodAntiAffinity
- kubectl CLI
You must have a cluster with at least four nodes, and each node requires at least 2 CPUs and 4 GiB of memory. In this tutorial you will cordon and drain the cluster's nodes. This means that the cluster will terminate and evict all Pods on its nodes, and the nodes will temporarily become unschedulable. You should use a dedicated cluster for this tutorial, or you should ensure that the disruption you cause will not interfere with other tenants.
This tutorial assumes that you have configured your cluster to dynamically provision PersistentVolumes. If your cluster is not configured to do so, you will have to manually provision three 20 GiB volumes before starting this tutorial.
Objectives
After this tutorial, you will know the following.
- How to deploy a ZooKeeper ensemble using StatefulSet.
- How to consistently configure the ensemble.
- How to spread the deployment of ZooKeeper servers in the ensemble.
- How to use PodDisruptionBudgets to ensure service availability during planned maintenance.
ZooKeeper
Apache ZooKeeper is a distributed, open-source coordination service for distributed applications. ZooKeeper allows you to read, write, and observe updates to data. Data are organized in a file system like hierarchy and replicated to all ZooKeeper servers in the ensemble (a set of ZooKeeper servers). All operations on data are atomic and sequentially consistent. ZooKeeper ensures this by using the Zab consensus protocol to replicate a state machine across all servers in the ensemble.
The ensemble uses the Zab protocol to elect a leader, and the ensemble cannot write data until that election is complete. Once complete, the ensemble uses Zab to ensure that it replicates all writes to a quorum before it acknowledges and makes them visible to clients. Without respect to weighted quorums, a quorum is a majority component of the ensemble containing the current leader. For instance, if the ensemble has three servers, a component that contains the leader and one other server constitutes a quorum. If the ensemble can not achieve a quorum, the ensemble cannot write data.
ZooKeeper servers keep their entire state machine in memory, and write every mutation to a durable WAL (Write Ahead Log) on storage media. When a server crashes, it can recover its previous state by replaying the WAL. To prevent the WAL from growing without bound, ZooKeeper servers will periodically snapshot them in memory state to storage media. These snapshots can be loaded directly into memory, and all WAL entries that preceded the snapshot may be discarded.
Creating a ZooKeeper ensemble
The manifest below contains a Headless Service, a Service, a PodDisruptionBudget, and a StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: zk-hs
labels:
app: zk
spec:
ports:
- port: 2888
name: server
- port: 3888
name: leader-election
clusterIP: None
selector:
app: zk
---
apiVersion: v1
kind: Service
metadata:
name: zk-cs
labels:
app: zk
spec:
ports:
- port: 2181
name: client
selector:
app: zk
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
selector:
matchLabels:
app: zk
maxUnavailable: 1
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zk
spec:
selector:
matchLabels:
app: zk
serviceName: zk-hs
replicas: 3
updateStrategy:
type: RollingUpdate
podManagementPolicy: OrderedReady
template:
metadata:
labels:
app: zk
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- zk
topologyKey: "kubernetes.io/hostname"
containers:
- name: kubernetes-zookeeper
imagePullPolicy: Always
image: "k8s.gcr.io/kubernetes-zookeeper:1.0-3.4.10"
resources:
requests:
memory: "1Gi"
cpu: "0.5"
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
command:
- sh
- -c
- "start-zookeeper \
--servers=3 \
--data_dir=/var/lib/zookeeper/data \
--data_log_dir=/var/lib/zookeeper/data/log \
--conf_dir=/opt/zookeeper/conf \
--client_port=2181 \
--election_port=3888 \
--server_port=2888 \
--tick_time=2000 \
--init_limit=10 \
--sync_limit=5 \
--heap=512M \
--max_client_cnxns=60 \
--snap_retain_count=3 \
--purge_interval=12 \
--max_session_timeout=40000 \
--min_session_timeout=4000 \
--log_level=INFO"
readinessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 10
timeoutSeconds: 5
livenessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 10
timeoutSeconds: 5
volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper
securityContext:
runAsUser: 1000
fsGroup: 1000
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
Open a terminal, and use the
kubectl apply
command to create the
manifest.
kubectl apply -f https://k8s.io/examples/application/zookeeper/zookeeper.yaml
This creates the zk-hs
Headless Service, the zk-cs
Service,
the zk-pdb
PodDisruptionBudget, and the zk
StatefulSet.
service/zk-hs created
service/zk-cs created
poddisruptionbudget.policy/zk-pdb created
statefulset.apps/zk created
Use kubectl get
to watch the
StatefulSet controller create the StatefulSet's Pods.
kubectl get pods -w -l app=zk
Once the zk-2
Pod is Running and Ready, use CTRL-C
to terminate kubectl.
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 19s
zk-0 1/1 Running 0 40s
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
zk-1 0/1 ContainerCreating 0 0s
zk-1 0/1 Running 0 18s
zk-1 1/1 Running 0 40s
zk-2 0/1 Pending 0 0s
zk-2 0/1 Pending 0 0s
zk-2 0/1 ContainerCreating 0 0s
zk-2 0/1 Running 0 19s
zk-2 1/1 Running 0 40s
The StatefulSet controller creates three Pods, and each Pod has a container with a ZooKeeper server.
Facilitating leader election
Because there is no terminating algorithm for electing a leader in an anonymous network, Zab requires explicit membership configuration to perform leader election. Each server in the ensemble needs to have a unique identifier, all servers need to know the global set of identifiers, and each identifier needs to be associated with a network address.
Use kubectl exec
to get the hostnames
of the Pods in the zk
StatefulSet.
for i in 0 1 2; do kubectl exec zk-$i -- hostname; done
The StatefulSet controller provides each Pod with a unique hostname based on its ordinal index. The hostnames take the form of <statefulset name>-<ordinal index>
. Because the replicas
field of the zk
StatefulSet is set to 3
, the Set's controller creates three Pods with their hostnames set to zk-0
, zk-1
, and
zk-2
.
zk-0
zk-1
zk-2
The servers in a ZooKeeper ensemble use natural numbers as unique identifiers, and store each server's identifier in a file called myid
in the server's data directory.
To examine the contents of the myid
file for each server use the following command.
for i in 0 1 2; do echo "myid zk-$i";kubectl exec zk-$i -- cat /var/lib/zookeeper/data/myid; done
Because the identifiers are natural numbers and the ordinal indices are non-negative integers, you can generate an identifier by adding 1 to the ordinal.
myid zk-0
1
myid zk-1
2
myid zk-2
3
To get the Fully Qualified Domain Name (FQDN) of each Pod in the zk
StatefulSet use the following command.
for i in 0 1 2; do kubectl exec zk-$i -- hostname -f; done
The zk-hs
Service creates a domain for all of the Pods,
zk-hs.default.svc.cluster.local
.
zk-0.zk-hs.default.svc.cluster.local
zk-1.zk-hs.default.svc.cluster.local
zk-2.zk-hs.default.svc.cluster.local
The A records in Kubernetes DNS resolve the FQDNs to the Pods' IP addresses. If Kubernetes reschedules the Pods, it will update the A records with the Pods' new IP addresses, but the A records names will not change.
ZooKeeper stores its application configuration in a file named zoo.cfg
. Use kubectl exec
to view the contents of the zoo.cfg
file in the zk-0
Pod.
kubectl exec zk-0 -- cat /opt/zookeeper/conf/zoo.cfg
In the server.1
, server.2
, and server.3
properties at the bottom of
the file, the 1
, 2
, and 3
correspond to the identifiers in the
ZooKeeper servers' myid
files. They are set to the FQDNs for the Pods in
the zk
StatefulSet.
clientPort=2181
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/log
tickTime=2000
initLimit=10
syncLimit=2000
maxClientCnxns=60
minSessionTimeout= 4000
maxSessionTimeout= 40000
autopurge.snapRetainCount=3
autopurge.purgeInterval=0
server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888
server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888
Achieving consensus
Consensus protocols require that the identifiers of each participant be unique. No two participants in the Zab protocol should claim the same unique identifier. This is necessary to allow the processes in the system to agree on which processes have committed which data. If two Pods are launched with the same ordinal, two ZooKeeper servers would both identify themselves as the same server.
kubectl get pods -w -l app=zk
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 19s
zk-0 1/1 Running 0 40s
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
zk-1 0/1 ContainerCreating 0 0s
zk-1 0/1 Running 0 18s
zk-1 1/1 Running 0 40s
zk-2 0/1 Pending 0 0s
zk-2 0/1 Pending 0 0s
zk-2 0/1 ContainerCreating 0 0s
zk-2 0/1 Running 0 19s
zk-2 1/1 Running 0 40s
The A records for each Pod are entered when the Pod becomes Ready. Therefore,
the FQDNs of the ZooKeeper servers will resolve to a single endpoint, and that
endpoint will be the unique ZooKeeper server claiming the identity configured
in its myid
file.
zk-0.zk-hs.default.svc.cluster.local
zk-1.zk-hs.default.svc.cluster.local
zk-2.zk-hs.default.svc.cluster.local
This ensures that the servers
properties in the ZooKeepers' zoo.cfg
files
represents a correctly configured ensemble.
server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888
server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888
When the servers use the Zab protocol to attempt to commit a value, they will either achieve consensus and commit the value (if leader election has succeeded and at least two of the Pods are Running and Ready), or they will fail to do so (if either of the conditions are not met). No state will arise where one server acknowledges a write on behalf of another.
Sanity testing the ensemble
The most basic sanity test is to write data to one ZooKeeper server and to read the data from another.
The command below executes the zkCli.sh
script to write world
to the path /hello
on the zk-0
Pod in the ensemble.
kubectl exec zk-0 -- zkCli.sh create /hello world
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
Created /hello
To get the data from the zk-1
Pod use the following command.
kubectl exec zk-1 -- zkCli.sh get /hello
The data that you created on zk-0
is available on all the servers in the
ensemble.
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
world
cZxid = 0x100000002
ctime = Thu Dec 08 15:13:30 UTC 2016
mZxid = 0x100000002
mtime = Thu Dec 08 15:13:30 UTC 2016
pZxid = 0x100000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0
Providing durable storage
As mentioned in the ZooKeeper Basics section, ZooKeeper commits all entries to a durable WAL, and periodically writes snapshots in memory state, to storage media. Using WALs to provide durability is a common technique for applications that use consensus protocols to achieve a replicated state machine.
Use the kubectl delete
command to delete the
zk
StatefulSet.
kubectl delete statefulset zk
statefulset.apps "zk" deleted
Watch the termination of the Pods in the StatefulSet.
kubectl get pods -w -l app=zk
When zk-0
if fully terminated, use CTRL-C
to terminate kubectl.
zk-2 1/1 Terminating 0 9m
zk-0 1/1 Terminating 0 11m
zk-1 1/1 Terminating 0 10m
zk-2 0/1 Terminating 0 9m
zk-2 0/1 Terminating 0 9m
zk-2 0/1 Terminating 0 9m
zk-1 0/1 Terminating 0 10m
zk-1 0/1 Terminating 0 10m
zk-1 0/1 Terminating 0 10m
zk-0 0/1 Terminating 0 11m
zk-0 0/1 Terminating 0 11m
zk-0 0/1 Terminating 0 11m
Reapply the manifest in zookeeper.yaml
.
kubectl apply -f https://k8s.io/examples/application/zookeeper/zookeeper.yaml
This creates the zk
StatefulSet object, but the other API objects in the manifest are not modified because they already exist.
Watch the StatefulSet controller recreate the StatefulSet's Pods.
kubectl get pods -w -l app=zk
Once the zk-2
Pod is Running and Ready, use CTRL-C
to terminate kubectl.
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 19s
zk-0 1/1 Running 0 40s
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
zk-1 0/1 ContainerCreating 0 0s
zk-1 0/1 Running 0 18s
zk-1 1/1 Running 0 40s
zk-2 0/1 Pending 0 0s
zk-2 0/1 Pending 0 0s
zk-2 0/1 ContainerCreating 0 0s
zk-2 0/1 Running 0 19s
zk-2 1/1 Running 0 40s
Use the command below to get the value you entered during the sanity test,
from the zk-2
Pod.
kubectl exec zk-2 zkCli.sh get /hello
Even though you terminated and recreated all of the Pods in the zk
StatefulSet, the ensemble still serves the original value.
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
world
cZxid = 0x100000002
ctime = Thu Dec 08 15:13:30 UTC 2016
mZxid = 0x100000002
mtime = Thu Dec 08 15:13:30 UTC 2016
pZxid = 0x100000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0
The volumeClaimTemplates
field of the zk
StatefulSet's spec
specifies a PersistentVolume provisioned for each Pod.
volumeClaimTemplates:
- metadata:
name: datadir
annotations:
volume.alpha.kubernetes.io/storage-class: anything
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 20Gi
The StatefulSet
controller generates a PersistentVolumeClaim
for each Pod in
the StatefulSet
.
Use the following command to get the StatefulSet
's PersistentVolumeClaims
.
kubectl get pvc -l app=zk
When the StatefulSet
recreated its Pods, it remounts the Pods' PersistentVolumes.
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
datadir-zk-0 Bound pvc-bed742cd-bcb1-11e6-994f-42010a800002 20Gi RWO 1h
datadir-zk-1 Bound pvc-bedd27d2-bcb1-11e6-994f-42010a800002 20Gi RWO 1h
datadir-zk-2 Bound pvc-bee0817e-bcb1-11e6-994f-42010a800002 20Gi RWO 1h
The volumeMounts
section of the StatefulSet
's container template
mounts the PersistentVolumes in the ZooKeeper servers' data directories.
volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper
When a Pod in the zk
StatefulSet
is (re)scheduled, it will always have the
same PersistentVolume
mounted to the ZooKeeper server's data directory.
Even when the Pods are rescheduled, all the writes made to the ZooKeeper
servers' WALs, and all their snapshots, remain durable.
Ensuring consistent configuration
As noted in the Facilitating Leader Election and Achieving Consensus sections, the servers in a ZooKeeper ensemble require consistent configuration to elect a leader and form a quorum. They also require consistent configuration of the Zab protocol in order for the protocol to work correctly over a network. In our example we achieve consistent configuration by embedding the configuration directly into the manifest.
Get the zk
StatefulSet.
kubectl get sts zk -o yaml
…
command:
- sh
- -c
- "start-zookeeper \
--servers=3 \
--data_dir=/var/lib/zookeeper/data \
--data_log_dir=/var/lib/zookeeper/data/log \
--conf_dir=/opt/zookeeper/conf \
--client_port=2181 \
--election_port=3888 \
--server_port=2888 \
--tick_time=2000 \
--init_limit=10 \
--sync_limit=5 \
--heap=512M \
--max_client_cnxns=60 \
--snap_retain_count=3 \
--purge_interval=12 \
--max_session_timeout=40000 \
--min_session_timeout=4000 \
--log_level=INFO"
…
The command used to start the ZooKeeper servers passed the configuration as command line parameter. You can also use environment variables to pass configuration to the ensemble.
Configuring logging
One of the files generated by the zkGenConfig.sh
script controls ZooKeeper's logging.
ZooKeeper uses Log4j, and, by default,
it uses a time and size based rolling file appender for its logging configuration.
Use the command below to get the logging configuration from one of Pods in the zk
StatefulSet
.
kubectl exec zk-0 cat /usr/etc/zookeeper/log4j.properties
The logging configuration below will cause the ZooKeeper process to write all of its logs to the standard output file stream.
zookeeper.root.logger=CONSOLE
zookeeper.console.threshold=INFO
log4j.rootLogger=${zookeeper.root.logger}
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.Threshold=${zookeeper.console.threshold}
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n
This is the simplest possible way to safely log inside the container. Because the applications write logs to standard out, Kubernetes will handle log rotation for you. Kubernetes also implements a sane retention policy that ensures application logs written to standard out and standard error do not exhaust local storage media.
Use kubectl logs
to retrieve the last 20 log lines from one of the Pods.
kubectl logs zk-0 --tail 20
You can view application logs written to standard out or standard error using kubectl logs
and from the Kubernetes Dashboard.
2016-12-06 19:34:16,236 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52740
2016-12-06 19:34:16,237 [myid:1] - INFO [Thread-1136:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52740 (no session established for client)
2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52749
2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52749
2016-12-06 19:34:26,156 [myid:1] - INFO [Thread-1137:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52749 (no session established for client)
2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52750
2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52750
2016-12-06 19:34:26,226 [myid:1] - INFO [Thread-1138:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52750 (no session established for client)
2016-12-06 19:34:36,151 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52760
2016-12-06 19:34:36,152 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52760
2016-12-06 19:34:36,152 [myid:1] - INFO [Thread-1139:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52760 (no session established for client)
2016-12-06 19:34:36,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52761
2016-12-06 19:34:36,231 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52761
2016-12-06 19:34:36,231 [myid:1] - INFO [Thread-1140:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52761 (no session established for client)
2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52767
2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52767
2016-12-06 19:34:46,149 [myid:1] - INFO [Thread-1141:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52767 (no session established for client)
2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52768
2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52768
2016-12-06 19:34:46,230 [myid:1] - INFO [Thread-1142:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52768 (no session established for client)
Kubernetes integrates with many logging solutions. You can choose a logging solution that best fits your cluster and applications. For cluster-level logging and aggregation, consider deploying a sidecar container to rotate and ship your logs.
Configuring a non-privileged user
The best practices to allow an application to run as a privileged user inside of a container are a matter of debate. If your organization requires that applications run as a non-privileged user you can use a SecurityContext to control the user that the entry point runs as.
The zk
StatefulSet
's Pod template
contains a SecurityContext
.
securityContext:
runAsUser: 1000
fsGroup: 1000
In the Pods' containers, UID 1000 corresponds to the zookeeper user and GID 1000 corresponds to the zookeeper group.
Get the ZooKeeper process information from the zk-0
Pod.
kubectl exec zk-0 -- ps -elf
As the runAsUser
field of the securityContext
object is set to 1000,
instead of running as root, the ZooKeeper process runs as the zookeeper user.
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
4 S zookeep+ 1 0 0 80 0 - 1127 - 20:46 ? 00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground
0 S zookeep+ 27 1 0 80 0 - 1155556 - 20:46 ? 00:00:19 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg
By default, when the Pod's PersistentVolumes is mounted to the ZooKeeper server's data directory, it is only accessible by the root user. This configuration prevents the ZooKeeper process from writing to its WAL and storing its snapshots.
Use the command below to get the file permissions of the ZooKeeper data directory on the zk-0
Pod.
kubectl exec -ti zk-0 -- ls -ld /var/lib/zookeeper/data
Because the fsGroup
field of the securityContext
object is set to 1000, the ownership of the Pods' PersistentVolumes is set to the zookeeper group, and the ZooKeeper process is able to read and write its data.
drwxr-sr-x 3 zookeeper zookeeper 4096 Dec 5 20:45 /var/lib/zookeeper/data
Managing the ZooKeeper process
The ZooKeeper documentation mentions that "You will want to have a supervisory process that manages each of your ZooKeeper server processes (JVM)." Utilizing a watchdog (supervisory process) to restart failed processes in a distributed system is a common pattern. When deploying an application in Kubernetes, rather than using an external utility as a supervisory process, you should use Kubernetes as the watchdog for your application.
Updating the ensemble
The zk
StatefulSet
is configured to use the RollingUpdate
update strategy.
You can use kubectl patch
to update the number of cpus
allocated to the servers.
kubectl patch sts zk --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value":"0.3"}]'
statefulset.apps/zk patched
Use kubectl rollout status
to watch the status of the update.
kubectl rollout status sts/zk
waiting for statefulset rolling update to complete 0 pods at revision zk-5db4499664...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 1 pods at revision zk-5db4499664...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 2 pods at revision zk-5db4499664...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
statefulset rolling update complete 3 pods at revision zk-5db4499664...
This terminates the Pods, one at a time, in reverse ordinal order, and recreates them with the new configuration. This ensures that quorum is maintained during a rolling update.
Use the kubectl rollout history
command to view a history or previous configurations.
kubectl rollout history sts/zk
The output is similar to this:
statefulsets "zk"
REVISION
1
2
Use the kubectl rollout undo
command to roll back the modification.
kubectl rollout undo sts/zk
The output is similar to this:
statefulset.apps/zk rolled back
Handling process failure
Restart Policies control how
Kubernetes handles process failures for the entry point of the container in a Pod.
For Pods in a StatefulSet
, the only appropriate RestartPolicy
is Always, and this
is the default value. For stateful applications you should never override
the default policy.
Use the following command to examine the process tree for the ZooKeeper server running in the zk-0
Pod.
kubectl exec zk-0 -- ps -ef
The command used as the container's entry point has PID 1, and the ZooKeeper process, a child of the entry point, has PID 27.
UID PID PPID C STIME TTY TIME CMD
zookeep+ 1 0 0 15:03 ? 00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground
zookeep+ 27 1 0 15:03 ? 00:00:03 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg
In another terminal watch the Pods in the zk
StatefulSet
with the following command.
kubectl get pod -w -l app=zk
In another terminal, terminate the ZooKeeper process in Pod zk-0
with the following command.
kubectl exec zk-0 -- pkill java
The termination of the ZooKeeper process caused its parent process to terminate. Because the RestartPolicy
of the container is Always, it restarted the parent process.
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 0 21m
zk-1 1/1 Running 0 20m
zk-2 1/1 Running 0 19m
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Error 0 29m
zk-0 0/1 Running 1 29m
zk-0 1/1 Running 1 29m
If your application uses a script (such as zkServer.sh
) to launch the process
that implements the application's business logic, the script must terminate with the
child process. This ensures that Kubernetes will restart the application's
container when the process implementing the application's business logic fails.
Testing for liveness
Configuring your application to restart failed processes is not enough to keep a distributed system healthy. There are scenarios where a system's processes can be both alive and unresponsive, or otherwise unhealthy. You should use liveness probes to notify Kubernetes that your application's processes are unhealthy and it should restart them.
The Pod template
for the zk
StatefulSet
specifies a liveness probe.
livenessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 15
timeoutSeconds: 5
The probe calls a bash script that uses the ZooKeeper ruok
four letter
word to test the server's health.
OK=$(echo ruok | nc 127.0.0.1 $1)
if [ "$OK" == "imok" ]; then
exit 0
else
exit 1
fi
In one terminal window, use the following command to watch the Pods in the zk
StatefulSet.
kubectl get pod -w -l app=zk
In another window, using the following command to delete the zookeeper-ready
script from the file system of Pod zk-0
.
kubectl exec zk-0 -- rm /opt/zookeeper/bin/zookeeper-ready
When the liveness probe for the ZooKeeper process fails, Kubernetes will automatically restart the process for you, ensuring that unhealthy processes in the ensemble are restarted.
kubectl get pod -w -l app=zk
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 0 1h
zk-1 1/1 Running 0 1h
zk-2 1/1 Running 0 1h
NAME READY STATUS RESTARTS AGE
zk-0 0/1 Running 0 1h
zk-0 0/1 Running 1 1h
zk-0 1/1 Running 1 1h
Testing for readiness
Readiness is not the same as liveness. If a process is alive, it is scheduled and healthy. If a process is ready, it is able to process input. Liveness is a necessary, but not sufficient, condition for readiness. There are cases, particularly during initialization and termination, when a process can be alive but not ready.
If you specify a readiness probe, Kubernetes will ensure that your application's processes will not receive network traffic until their readiness checks pass.
For a ZooKeeper server, liveness implies readiness. Therefore, the readiness
probe from the zookeeper.yaml
manifest is identical to the liveness probe.
readinessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 15
timeoutSeconds: 5
Even though the liveness and readiness probes are identical, it is important to specify both. This ensures that only healthy servers in the ZooKeeper ensemble receive network traffic.
Tolerating Node failure
ZooKeeper needs a quorum of servers to successfully commit mutations to data. For a three server ensemble, two servers must be healthy for writes to succeed. In quorum based systems, members are deployed across failure domains to ensure availability. To avoid an outage, due to the loss of an individual machine, best practices preclude co-locating multiple instances of the application on the same machine.
By default, Kubernetes may co-locate Pods in a StatefulSet
on the same node.
For the three server ensemble you created, if two servers are on the same node, and that node fails,
the clients of your ZooKeeper service will experience an outage until at least one of the Pods can be rescheduled.
You should always provision additional capacity to allow the processes of critical
systems to be rescheduled in the event of node failures. If you do so, then the
outage will only last until the Kubernetes scheduler reschedules one of the ZooKeeper
servers. However, if you want your service to tolerate node failures with no downtime,
you should set podAntiAffinity
.
Use the command below to get the nodes for Pods in the zk
StatefulSet
.
for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done
All of the Pods in the zk
StatefulSet
are deployed on different nodes.
kubernetes-node-cxpk
kubernetes-node-a5aq
kubernetes-node-2g2d
This is because the Pods in the zk
StatefulSet
have a PodAntiAffinity
specified.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- zk
topologyKey: "kubernetes.io/hostname"
The requiredDuringSchedulingIgnoredDuringExecution
field tells the
Kubernetes Scheduler that it should never co-locate two Pods which have app
label
as zk
in the domain defined by the topologyKey
. The topologyKey
kubernetes.io/hostname
indicates that the domain is an individual node. Using
different rules, labels, and selectors, you can extend this technique to spread
your ensemble across physical, network, and power failure domains.
Surviving maintenance
In this section you will cordon and drain nodes. If you are using this tutorial on a shared cluster, be sure that this will not adversely affect other tenants.
The previous section showed you how to spread your Pods across nodes to survive unplanned node failures, but you also need to plan for temporary node failures that occur due to planned maintenance.
Use this command to get the nodes in your cluster.
kubectl get nodes
This tutorial assumes a cluster with at least four nodes. If the cluster has more than four, use kubectl cordon
to cordon all but four nodes. Constraining to four nodes will ensure Kubernetes encounters affinity and PodDisruptionBudget constraints when scheduling zookeeper Pods in the following maintenance simulation.
kubectl cordon <node-name>
Use this command to get the zk-pdb
PodDisruptionBudget
.
kubectl get pdb zk-pdb
The max-unavailable
field indicates to Kubernetes that at most one Pod from
zk
StatefulSet
can be unavailable at any time.
NAME MIN-AVAILABLE MAX-UNAVAILABLE ALLOWED-DISRUPTIONS AGE
zk-pdb N/A 1 1
In one terminal, use this command to watch the Pods in the zk
StatefulSet
.
kubectl get pods -w -l app=zk
In another terminal, use this command to get the nodes that the Pods are currently scheduled on.
for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done
The output is similar to this:
kubernetes-node-pb41
kubernetes-node-ixsl
kubernetes-node-i4c4
Use kubectl drain
to cordon and
drain the node on which the zk-0
Pod is scheduled.
kubectl drain $(kubectl get pod zk-0 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-emptydir-data
The output is similar to this:
node "kubernetes-node-pb41" cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-pb41, kube-proxy-kubernetes-node-pb41; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-o5elz
pod "zk-0" deleted
node "kubernetes-node-pb41" drained
As there are four nodes in your cluster, kubectl drain
, succeeds and the
zk-0
is rescheduled to another node.
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 2 1h
zk-1 1/1 Running 0 1h
zk-2 1/1 Running 0 1h
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 51s
zk-0 1/1 Running 0 1m
Keep watching the StatefulSet
's Pods in the first terminal and drain the node on which
zk-1
is scheduled.
kubectl drain $(kubectl get pod zk-1 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-emptydir-data
The output is similar to this:
"kubernetes-node-ixsl" cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-ixsl, kube-proxy-kubernetes-node-ixsl; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-voc74
pod "zk-1" deleted
node "kubernetes-node-ixsl" drained
The zk-1
Pod cannot be scheduled because the zk
StatefulSet
contains a PodAntiAffinity
rule preventing
co-location of the Pods, and as only two nodes are schedulable, the Pod will remain in a Pending state.
kubectl get pods -w -l app=zk
The output is similar to this:
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 2 1h
zk-1 1/1 Running 0 1h
zk-2 1/1 Running 0 1h
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 51s
zk-0 1/1 Running 0 1m
zk-1 1/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
Continue to watch the Pods of the StatefulSet, and drain the node on which
zk-2
is scheduled.
kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-emptydir-data
The output is similar to this:
node "kubernetes-node-i4c4" cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-i4c4, kube-proxy-kubernetes-node-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog
WARNING: Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog; Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-i4c4, kube-proxy-kubernetes-node-i4c4
There are pending pods when an error occurred: Cannot evict pod as it would violate the pod's disruption budget.
pod/zk-2
Use CTRL-C
to terminate to kubectl.
You cannot drain the third node because evicting zk-2
would violate zk-budget
. However, the node will remain cordoned.
Use zkCli.sh
to retrieve the value you entered during the sanity test from zk-0
.
kubectl exec zk-0 zkCli.sh get /hello
The service is still available because its PodDisruptionBudget
is respected.
WatchedEvent state:SyncConnected type:None path:null
world
cZxid = 0x200000002
ctime = Wed Dec 07 00:08:59 UTC 2016
mZxid = 0x200000002
mtime = Wed Dec 07 00:08:59 UTC 2016
pZxid = 0x200000002
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0
Use kubectl uncordon
to uncordon the first node.
kubectl uncordon kubernetes-node-pb41
The output is similar to this:
node "kubernetes-node-pb41" uncordoned
zk-1
is rescheduled on this node. Wait until zk-1
is Running and Ready.
kubectl get pods -w -l app=zk
The output is similar to this:
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 2 1h
zk-1 1/1 Running 0 1h
zk-2 1/1 Running 0 1h
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Terminating 2 2h
zk-0 0/1 Pending 0 0s
zk-0 0/1 Pending 0 0s
zk-0 0/1 ContainerCreating 0 0s
zk-0 0/1 Running 0 51s
zk-0 1/1 Running 0 1m
zk-1 1/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Terminating 0 2h
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 0s
zk-1 0/1 Pending 0 12m
zk-1 0/1 ContainerCreating 0 12m
zk-1 0/1 Running 0 13m
zk-1 1/1 Running 0 13m
Attempt to drain the node on which zk-2
is scheduled.
kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-emptydir-data
The output is similar to this:
node "kubernetes-node-i4c4" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-i4c4, kube-proxy-kubernetes-node-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog
pod "heapster-v1.2.0-2604621511-wht1r" deleted
pod "zk-2" deleted
node "kubernetes-node-i4c4" drained
This time kubectl drain
succeeds.
Uncordon the second node to allow zk-2
to be rescheduled.
kubectl uncordon kubernetes-node-ixsl
The output is similar to this:
node "kubernetes-node-ixsl" uncordoned
You can use kubectl drain
in conjunction with PodDisruptionBudgets
to ensure that your services remain available during maintenance.
If drain is used to cordon nodes and evict pods prior to taking the node offline for maintenance,
services that express a disruption budget will have that budget respected.
You should always allocate additional capacity for critical services so that their Pods can be immediately rescheduled.
Cleaning up
- Use
kubectl uncordon
to uncordon all the nodes in your cluster. - You must delete the persistent storage media for the PersistentVolumes used in this tutorial. Follow the necessary steps, based on your environment, storage configuration, and provisioning method, to ensure that all storage is reclaimed.
5.7 - Services
5.7.1 - Using Source IP
Applications running in a Kubernetes cluster find and communicate with each other, and the outside world, through the Service abstraction. This document explains what happens to the source IP of packets sent to different types of Services, and how you can toggle this behavior according to your needs.
Before you begin
Terminology
This document makes use of the following terms:
- NAT
- network address translation
- Source NAT
- replacing the source IP on a packet; in this page, that usually means replacing with the IP address of a node.
- Destination NAT
- replacing the destination IP on a packet; in this page, that usually means replacing with the IP address of a Pod
- VIP
- a virtual IP address, such as the one assigned to every Service in Kubernetes
- kube-proxy
- a network daemon that orchestrates Service VIP management on every node
Prerequisites
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
The examples use a small nginx webserver that echoes back the source IP of requests it receives through an HTTP header. You can create it as follows:
kubectl create deployment source-ip-app --image=k8s.gcr.io/echoserver:1.4
The output is:
deployment.apps/source-ip-app created
Objectives
- Expose a simple application through various types of Services
- Understand how each Service type handles source IP NAT
- Understand the tradeoffs involved in preserving source IP
Source IP for Services with Type=ClusterIP
Packets sent to ClusterIP from within the cluster are never source NAT'd if
you're running kube-proxy in
iptables mode,
(the default). You can query the kube-proxy mode by fetching
http://localhost:10249/proxyMode
on the node where kube-proxy is running.
kubectl get nodes
The output is similar to this:
NAME STATUS ROLES AGE VERSION
kubernetes-node-6jst Ready <none> 2h v1.13.0
kubernetes-node-cx31 Ready <none> 2h v1.13.0
kubernetes-node-jj1t Ready <none> 2h v1.13.0
Get the proxy mode on one of the nodes (kube-proxy listens on port 10249):
# Run this in a shell on the node you want to query.
curl http://localhost:10249/proxyMode
The output is:
iptables
You can test source IP preservation by creating a Service over the source IP app:
kubectl expose deployment source-ip-app --name=clusterip --port=80 --target-port=8080
The output is:
service/clusterip exposed
kubectl get svc clusterip
The output is similar to:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
clusterip ClusterIP 10.0.170.92 <none> 80/TCP 51s
And hitting the ClusterIP
from a pod in the same cluster:
kubectl run busybox -it --image=busybox:1.28 --restart=Never --rm
The output is similar to this:
Waiting for pod default/busybox to be running, status is Pending, pod ready: false
If you don't see a command prompt, try pressing enter.
You can then run a command inside that Pod:
# Run this inside the terminal from "kubectl run"
ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue
link/ether 0a:58:0a:f4:03:08 brd ff:ff:ff:ff:ff:ff
inet 10.244.3.8/24 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::188a:84ff:feb0:26a5/64 scope link
valid_lft forever preferred_lft forever
…then use wget
to query the local webserver
# Replace "10.0.170.92" with the IPv4 address of the Service named "clusterip"
wget -qO - 10.0.170.92
CLIENT VALUES:
client_address=10.244.3.8
command=GET
...
The client_address
is always the client pod's IP address, whether the client pod and server pod are in the same node or in different nodes.
Source IP for Services with Type=NodePort
Packets sent to Services with
Type=NodePort
are source NAT'd by default. You can test this by creating a NodePort
Service:
kubectl expose deployment source-ip-app --name=nodeport --port=80 --target-port=8080 --type=NodePort
The output is:
service/nodeport exposed
NODEPORT=$(kubectl get -o jsonpath="{.spec.ports[0].nodePort}" services nodeport)
NODES=$(kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }')
If you're running on a cloud provider, you may need to open up a firewall-rule
for the nodes:nodeport
reported above.
Now you can try reaching the Service from outside the cluster through the node
port allocated above.
for node in $NODES; do curl -s $node:$NODEPORT | grep -i client_address; done
The output is similar to:
client_address=10.180.1.1
client_address=10.240.0.5
client_address=10.240.0.3
Note that these are not the correct client IPs, they're cluster internal IPs. This is what happens:
- Client sends packet to
node2:nodePort
node2
replaces the source IP address (SNAT) in the packet with its own IP addressnode2
replaces the destination IP on the packet with the pod IP- packet is routed to node 1, and then to the endpoint
- the pod's reply is routed back to node2
- the pod's reply is sent back to the client
Visually:
To avoid this, Kubernetes has a feature to
preserve the client source IP.
If you set service.spec.externalTrafficPolicy
to the value Local
,
kube-proxy only proxies proxy requests to local endpoints, and does not
forward traffic to other nodes. This approach preserves the original
source IP address. If there are no local endpoints, packets sent to the
node are dropped, so you can rely on the correct source-ip in any packet
processing rules you might apply a packet that make it through to the
endpoint.
Set the service.spec.externalTrafficPolicy
field as follows:
kubectl patch svc nodeport -p '{"spec":{"externalTrafficPolicy":"Local"}}'
The output is:
service/nodeport patched
Now, re-run the test:
for node in $NODES; do curl --connect-timeout 1 -s $node:$NODEPORT | grep -i client_address; done
The output is similar to:
client_address=198.51.100.79
Note that you only got one reply, with the right client IP, from the one node on which the endpoint pod is running.
This is what happens:
- client sends packet to
node2:nodePort
, which doesn't have any endpoints - packet is dropped
- client sends packet to
node1:nodePort
, which does have endpoints - node1 routes packet to endpoint with the correct source IP
Visually:
Source IP for Services with Type=LoadBalancer
Packets sent to Services with
Type=LoadBalancer
are source NAT'd by default, because all schedulable Kubernetes nodes in the
Ready
state are eligible for load-balanced traffic. So if packets arrive
at a node without an endpoint, the system proxies it to a node with an
endpoint, replacing the source IP on the packet with the IP of the node (as
described in the previous section).
You can test this by exposing the source-ip-app through a load balancer:
kubectl expose deployment source-ip-app --name=loadbalancer --port=80 --target-port=8080 --type=LoadBalancer
The output is:
service/loadbalancer exposed
Print out the IP addresses of the Service:
kubectl get svc loadbalancer
The output is similar to this:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
loadbalancer LoadBalancer 10.0.65.118 203.0.113.140 80/TCP 5m
Next, send a request to this Service's external-ip:
curl 203.0.113.140
The output is similar to this:
CLIENT VALUES:
client_address=10.240.0.5
...
However, if you're running on Google Kubernetes Engine/GCE, setting the same service.spec.externalTrafficPolicy
field to Local
forces nodes without Service endpoints to remove
themselves from the list of nodes eligible for loadbalanced traffic by
deliberately failing health checks.
Visually:
You can test this by setting the annotation:
kubectl patch svc loadbalancer -p '{"spec":{"externalTrafficPolicy":"Local"}}'
You should immediately see the service.spec.healthCheckNodePort
field allocated
by Kubernetes:
kubectl get svc loadbalancer -o yaml | grep -i healthCheckNodePort
The output is similar to this:
healthCheckNodePort: 32122
The service.spec.healthCheckNodePort
field points to a port on every node
serving the health check at /healthz
. You can test this:
kubectl get pod -o wide -l run=source-ip-app
The output is similar to this:
NAME READY STATUS RESTARTS AGE IP NODE
source-ip-app-826191075-qehz4 1/1 Running 0 20h 10.180.1.136 kubernetes-node-6jst
Use curl
to fetch the /healthz
endpoint on various nodes:
# Run this locally on a node you choose
curl localhost:32122/healthz
1 Service Endpoints found
On a different node you might get a different result:
# Run this locally on a node you choose
curl localhost:32122/healthz
No Service Endpoints Found
A controller running on the
control plane is
responsible for allocating the cloud load balancer. The same controller also
allocates HTTP health checks pointing to this port/path on each node. Wait
about 10 seconds for the 2 nodes without endpoints to fail health checks,
then use curl
to query the IPv4 address of the load balancer:
curl 203.0.113.140
The output is similar to this:
CLIENT VALUES:
client_address=198.51.100.79
...
Cross-platform support
Only some cloud providers offer support for source IP preservation through
Services with Type=LoadBalancer
.
The cloud provider you're running on might fulfill the request for a loadbalancer
in a few different ways:
-
With a proxy that terminates the client connection and opens a new connection to your nodes/endpoints. In such cases the source IP will always be that of the cloud LB, not that of the client.
-
With a packet forwarder, such that requests from the client sent to the loadbalancer VIP end up at the node with the source IP of the client, not an intermediate proxy.
Load balancers in the first category must use an agreed upon
protocol between the loadbalancer and backend to communicate the true client IP
such as the HTTP Forwarded
or X-FORWARDED-FOR
headers, or the
proxy protocol.
Load balancers in the second category can leverage the feature described above
by creating an HTTP health check pointing at the port stored in
the service.spec.healthCheckNodePort
field on the Service.
Cleaning up
Delete the Services:
kubectl delete svc -l app=source-ip-app
Delete the Deployment, ReplicaSet and Pod:
kubectl delete deployment source-ip-app
What's next
- Learn more about connecting applications via services
- Read how to Create an External Load Balancer
6 - Reference
This section of the Kubernetes documentation contains references.
API Reference
-
Glossary - a comprehensive, standardized list of Kubernetes terminology
-
Using The Kubernetes API - overview of the API for Kubernetes.
-
API access control - details on how Kubernetes controls API access
Officially supported client libraries
To call the Kubernetes API from a programming language, you can use client libraries. Officially supported client libraries:
- Kubernetes Go client library
- Kubernetes Python client library
- Kubernetes Java client library
- Kubernetes JavaScript client library
- Kubernetes C# client library
- Kubernetes Haskell client library
CLI
- kubectl - Main CLI tool for running commands and managing Kubernetes clusters.
- JSONPath - Syntax guide for using JSONPath expressions with kubectl.
- kubeadm - CLI tool to easily provision a secure Kubernetes cluster.
Components
-
kubelet - The primary agent that runs on each node. The kubelet takes a set of PodSpecs and ensures that the described containers are running and healthy.
-
kube-apiserver - REST API that validates and configures data for API objects such as pods, services, replication controllers.
-
kube-controller-manager - Daemon that embeds the core control loops shipped with Kubernetes.
-
kube-proxy - Can do simple TCP/UDP stream forwarding or round-robin TCP/UDP forwarding across a set of back-ends.
-
kube-scheduler - Scheduler that manages availability, performance, and capacity.
-
List of ports and protocols that should be open on control plane and worker nodes
Config APIs
This section hosts the documentation for "unpublished" APIs which are used to configure kubernetes components or tools. Most of these APIs are not exposed by the API server in a RESTful way though they are essential for a user or an operator to use or manage a cluster.
- kube-apiserver configuration (v1alpha1)
- kube-apiserver configuration (v1)
- kube-apiserver encryption (v1)
- kubelet configuration (v1alpha1) and kubelet configuration (v1beta1)
- kubelet credential providers (v1alpha1)
- kube-scheduler configuration (v1beta2) and kube-scheduler configuration (v1beta3)
- kube-proxy configuration (v1alpha1)
audit.k8s.io/v1
API- Client authentication API (v1beta1) and Client authentication API (v1)
- WebhookAdmission configuration (v1)
Config API for kubeadm
Design Docs
An archive of the design docs for Kubernetes functionality. Good starting points are Kubernetes Architecture and Kubernetes Design Overview.
6.1 - Glossary
6.2 - API Overview
This section provides reference information for the Kubernetes API.
The REST API is the fundamental fabric of Kubernetes. All operations and communications between components, and external user commands are REST API calls that the API Server handles. Consequently, everything in the Kubernetes platform is treated as an API object and has a corresponding entry in the API.
The Kubernetes API reference lists the API for Kubernetes version v1.23.
For general background information, read The Kubernetes API. Controlling Access to the Kubernetes API describes how clients can authenticate to the Kubernetes API server, and how their requests are authorized.
API versioning
The JSON and Protobuf serialization schemas follow the same guidelines for schema changes. The following descriptions cover both formats.
The API versioning and software versioning are indirectly related. The API and release versioning proposal describes the relationship between API versioning and software versioning.
Different API versions indicate different levels of stability and support. You can find more information about the criteria for each level in the API Changes documentation.
Here's a summary of each level:
-
Alpha:
- The version names contain
alpha
(for example,v1alpha1
). - The software may contain bugs. Enabling a feature may expose bugs. A feature may be disabled by default.
- The support for a feature may be dropped at any time without notice.
- The API may change in incompatible ways in a later software release without notice.
- The software is recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.
- The version names contain
-
Beta:
-
The version names contain
beta
(for example,v2beta3
). -
The software is well tested. Enabling a feature is considered safe. Features are enabled by default.
-
The support for a feature will not be dropped, though the details may change.
-
The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens, migration instructions are provided. Schema changes may require deleting, editing, and re-creating API objects. The editing process may not be straightforward. The migration may require downtime for applications that rely on the feature.
-
The software is not recommended for production uses. Subsequent releases may introduce incompatible changes. If you have multiple clusters which can be upgraded independently, you may be able to relax this restriction.
Note: Please try beta features and provide feedback. After the features exit beta, it may not be practical to make more changes. -
-
Stable:
- The version name is
vX
whereX
is an integer. - The stable versions of features appear in released software for many subsequent versions.
- The version name is
API groups
API groups
make it easier to extend the Kubernetes API.
The API group is specified in a REST path and in the apiVersion
field of a
serialized object.
There are several API groups in Kubernetes:
- The core (also called legacy) group is found at REST path
/api/v1
. The core group is not specified as part of theapiVersion
field, for example,apiVersion: v1
. - The named groups are at REST path
/apis/$GROUP_NAME/$VERSION
and useapiVersion: $GROUP_NAME/$VERSION
(for example,apiVersion: batch/v1
). You can find the full list of supported API groups in Kubernetes API reference.
Enabling or disabling API groups
Certain resources and API groups are enabled by default. You can enable or
disable them by setting --runtime-config
on the API server. The
--runtime-config
flag accepts comma separated <key>[=<value>]
pairs
describing the runtime configuration of the API server. If the =<value>
part is omitted, it is treated as if =true
is specified. For example:
- to disable
batch/v1
, set--runtime-config=batch/v1=false
- to enable
batch/v2alpha1
, set--runtime-config=batch/v2alpha1
--runtime-config
changes.
Persistence
Kubernetes stores its serialized state in terms of the API resources by writing them into etcd.
What's next
- Learn more about API conventions
- Read the design documentation for aggregator
6.2.1 - Kubernetes API Concepts
The Kubernetes API is a resource-based (RESTful) programmatic interface provided via HTTP. It supports retrieving, creating, updating, and deleting primary resources via the standard HTTP verbs (POST, PUT, PATCH, DELETE, GET).
For some resources, the API includes additional subresources that allow fine grained authorization (such as a separating viewing details for a Pod from retrieving its logs), and can accept and serve those resources in different representations for convenience or efficiency.
Kubernetes supports efficient change notifications on resources via watches. Kubernetes also provides consistent list operations so that API clients can effectively cache, track, and synchronize the state of resources.
You can view the API reference online, or read on to learn about the API in general.
Kubernetes API terminology
Kubernetes generally leverages common RESTful terminology to describe the API concepts:
- A resource type is the name used in the URL (
pods
,namespaces
,services
) - All resource types have a concrete representation (their object schema) which is called a kind
- A list of instances of a resource is known as a collection
- A single instance of a resource type is called a resource, and also usually represents an object
- For some resource types, the API includes one or more sub-resources, which are represented as URI paths below the resource
Most Kubernetes API resource types are
objects:
they represent a concrete instance of a concept on the cluster, like a
pod or namespace. A smaller number of API resource types are virtual in
that they often represent operations on objects, rather than objects, such
as a permission check
(use a POST with a JSON-encoded body of SubjectAccessReview
to the
subjectaccessreviews
resource), or the eviction
sub-resource of a Pod
(used to trigger
API-initiated eviction).
Object names
All objects you can create via the API have a unique object name to allow idempotent creation and retrieval, except that virtual resource types may not have unique names if they are not retrievable, or do not rely on idempotency. Within a namespace, only one object of a given kind can have a given name at a time. However, if you delete the object, you can make a new object with the same name. Some objects are not namespaced (for example: Nodes), and so their names must be unique across the whole cluster.
API verbs
Almost all object resource types support the standard HTTP verbs - GET, POST, PUT, PATCH, and DELETE. Kubernetes also uses its own verbs, which are often written lowercase to distinguish them from HTTP verbs.
Kubernetes uses the term list to describe returning a collection of
resources to distinguish from retrieving a single resource which is usually called
a get. If you sent an HTTP GET request with the ?watch
query parameter,
Kubernetes calls this a watch and not a get (see
Efficient detection of changes for more details).
For PUT requests, Kubernetes internally classifies these as either create or update based on the state of the existing object. An update is different from a patch; the HTTP verb for a patch is PATCH.
Resource URIs
All resource types are either scoped by the cluster (/apis/GROUP/VERSION/*
) or to a
namespace (/apis/GROUP/VERSION/namespaces/NAMESPACE/*
). A namespace-scoped resource
type will be deleted when its namespace is deleted and access to that resource type
is controlled by authorization checks on the namespace scope.
You can also access collections of resources (for example: listing all Nodes). The following paths are used to retrieve collections and resources:
-
Cluster-scoped resources:
GET /apis/GROUP/VERSION/RESOURCETYPE
- return the collection of resources of the resource typeGET /apis/GROUP/VERSION/RESOURCETYPE/NAME
- return the resource with NAME under the resource type
-
Namespace-scoped resources:
GET /apis/GROUP/VERSION/RESOURCETYPE
- return the collection of all instances of the resource type across all namespacesGET /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE
- return collection of all instances of the resource type in NAMESPACEGET /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME
- return the instance of the resource type with NAME in NAMESPACE
Since a namespace is a cluster-scoped resource type, you can retrieve the list
(“collection”) of all namespaces with GET /api/v1/namespaces
and details about
a particular namespace with GET /api/v1/namespaces/NAME
.
- Cluster-scoped subresource:
GET /apis/GROUP/VERSION/RESOURCETYPE/NAME/SUBRESOURCE
- Namespace-scoped subresource:
GET /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME/SUBRESOURCE
The verbs supported for each subresource will differ depending on the object - see the API reference for more information. It is not possible to access sub-resources across multiple resources - generally a new virtual resource type would be used if that becomes necessary.
Efficient detection of changes
The Kubernetes API allows clients to make an initial request for an object or a collection, and then to track changes since that initial request: a watch. Clients can send a list or a get and then make a follow-up watch request.
To make this change tracking possible, every Kubernetes object has a resourceVersion
field representing the version of that resource as stored in the underlying persistence
layer. When retrieving a collection of resources (either namespace or cluster scoped),
the response from the API server contains a resourceVersion
value. The client can
use that resourceVersion
to initiate a watch against the API server.
When you send a watch request, the API server responds with a stream of
changes. These changes itemize the outcome of operations (such as create, delete,
and update) that occurred after the resourceVersion
you specified as a parameter
to the watch request. The overall watch mechanism allows a client to fetch
the current state and then subscribe to subsequent changes, without missing any events.
If a client watch is disconnected then that client can start a new watch from
the last returned resourceVersion
; the client could also perform a fresh get /
list request and begin again. See Resource Version Semantics
for more detail.
For example:
-
List all of the pods in a given namespace.
GET /api/v1/namespaces/test/pods --- 200 OK Content-Type: application/json { "kind": "PodList", "apiVersion": "v1", "metadata": {"resourceVersion":"10245"}, "items": [...] }
-
Starting from resource version 10245, receive notifications of any API operations (such as create, delete, apply or update) that affect Pods in the test namespace. Each change notification is a JSON document. The HTTP response body (served as
application/json
) consists a series of JSON documents.GET /api/v1/namespaces/test/pods?watch=1&resourceVersion=10245 --- 200 OK Transfer-Encoding: chunked Content-Type: application/json { "type": "ADDED", "object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "10596", ...}, ...} } { "type": "MODIFIED", "object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "11020", ...}, ...} } ...
A given Kubernetes server will only preserve a historical record of changes for a
limited time. Clusters using etcd 3 preserve changes in the last 5 minutes by default.
When the requested watch operations fail because the historical version of that
resource is not available, clients must handle the case by recognizing the status code
410 Gone
, clearing their local cache, performing a new get or list operation,
and starting the watch from the resourceVersion
that was returned.
For subscribing to collections, Kubernetes client libraries typically offer some form
of standard tool for this list-then-watch logic. (In the Go client library,
this is called a Reflector
and is located in the k8s.io/client-go/tools/cache
package.)
Watch bookmarks
To mitigate the impact of short history window, the Kubernetes API provides a watch
event named BOOKMARK
. It is a special kind of event to mark that all changes up
to a given resourceVersion
the client is requesting have already been sent. The
document representing the BOOKMARK
event is of the type requested by the request,
but only includes a .metadata.resourceVersion
field. For example:
GET /api/v1/namespaces/test/pods?watch=1&resourceVersion=10245&allowWatchBookmarks=true
---
200 OK
Transfer-Encoding: chunked
Content-Type: application/json
{
"type": "ADDED",
"object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "10596", ...}, ...}
}
...
{
"type": "BOOKMARK",
"object": {"kind": "Pod", "apiVersion": "v1", "metadata": {"resourceVersion": "12746"} }
}
As a client, you can request BOOKMARK
events by setting the
allowWatchBookmarks=true
query parameter to a watch request, but you shouldn't
assume bookmarks are returned at any specific interval, nor can clients assume that
the API server will send any BOOKMARK
event even when requested.
Retrieving large results sets in chunks
Kubernetes v1.9 [beta]
On large clusters, retrieving the collection of some resource types may result in very large responses that can impact the server and client. For instance, a cluster may have tens of thousands of Pods, each of which is equivalent to roughly 2 KiB of encoded JSON. Retrieving all pods across all namespaces may result in a very large response (10-20MB) and consume a large amount of server resources.
Provided that you don't explicitly disable the APIListChunking
feature gate, the
Kubernetes API server supports the ability to break a single large collection request
into many smaller chunks while preserving the consistency of the total request. Each
chunk can be returned sequentially which reduces both the total size of the request and
allows user-oriented clients to display results incrementally to improve responsiveness.
You can request that the API server handles a list by serving single collection
using pages (which Kubernetes calls chunks). To retrieve a single collection in
chunks, two query parameters limit
and continue
are supported on requests against
collections, and a response field continue
is returned from all list operations
in the collection's metadata
field. A client should specify the maximum results they
wish to receive in each chunk with limit
and the server will return up to limit
resources in the result and include a continue
value if there are more resources
in the collection.
As an API client, you can then pass this continue
value to the API server on the
next request, to instruct the server to return the next page (chunk) of results. By
continuing until the server returns an empty continue
value, you can retrieve the
entire collection.
Like a watch operation, a continue
token will expire after a short amount
of time (by default 5 minutes) and return a 410 Gone
if more results cannot be
returned. In this case, the client will need to start from the beginning or omit the
limit
parameter.
For example, if there are 1,253 pods on the cluster and you wants to receive chunks of 500 pods at a time, request those chunks as follows:
-
List all of the pods on a cluster, retrieving up to 500 pods each time.
GET /api/v1/pods?limit=500 --- 200 OK Content-Type: application/json { "kind": "PodList", "apiVersion": "v1", "metadata": { "resourceVersion":"10245", "continue": "ENCODED_CONTINUE_TOKEN", ... }, "items": [...] // returns pods 1-500 }
-
Continue the previous call, retrieving the next set of 500 pods.
GET /api/v1/pods?limit=500&continue=ENCODED_CONTINUE_TOKEN --- 200 OK Content-Type: application/json { "kind": "PodList", "apiVersion": "v1", "metadata": { "resourceVersion":"10245", "continue": "ENCODED_CONTINUE_TOKEN_2", ... }, "items": [...] // returns pods 501-1000 }
-
Continue the previous call, retrieving the last 253 pods.
GET /api/v1/pods?limit=500&continue=ENCODED_CONTINUE_TOKEN_2 --- 200 OK Content-Type: application/json { "kind": "PodList", "apiVersion": "v1", "metadata": { "resourceVersion":"10245", "continue": "", // continue token is empty because we have reached the end of the list ... }, "items": [...] // returns pods 1001-1253 }
Notice that the resourceVersion
of the collection remains constant across each request,
indicating the server is showing you a consistent snapshot of the pods. Pods that
are created, updated, or deleted after version 10245
would not be shown unless
you make a separate list request without the continue
token. This allows you
to break large requests into smaller chunks and then perform a watch operation
on the full set without missing any updates.
remainingItemCount
is the number of subsequent items in the collection that are not
included in this response. If the list request contained label or field
selectors then the number of
remaining items is unknown and the API server does not include a remainingItemCount
field in its response.
If the list is complete (either because it is not chunking, or because this is the
last chunk), then there are no more remaining items and the API server does not include a
remainingItemCount
field in its response. The intended use of the remainingItemCount
is estimating the size of a collection.
Collections
In Kubernetes terminology, the response you get from a list is
a collection. However, Kubernetes defines concrete kinds for
collections of different types of resource. Collections have a kind
named for the resource kind, with List
appended.
When you query the API for a particular type, all items returned by that query are
of that type.
For example, when you list Services, the collection response
has kind
set to
ServiceList
; each item in that collection represents a single Service. For example:
GET /api/v1/services
{
"kind": "ServiceList",
"apiVersion": "v1",
"metadata": {
"resourceVersion": "2947301"
},
"items": [
{
"metadata": {
"name": "kubernetes",
"namespace": "default",
...
"metadata": {
"name": "kube-dns",
"namespace": "kube-system",
...
There are dozens of collection types (such as PodList
, ServiceList
,
and NodeList
) defined in the Kubernetes API.
You can get more information about each collection type from the
Kubernetes API documentation.
Some tools, such as kubectl
, represent the Kubernetes collection
mechanism slightly differently from the Kubernetes API itself.
Because the output of kubectl
might include the response from
multiple list operations at the API level, kubectl
represents
a list of items using kind: List
. For example:
kubectl get services -A -o yaml
apiVersion: v1
kind: List
metadata:
resourceVersion: ""
selfLink: ""
items:
- apiVersion: v1
kind: Service
metadata:
creationTimestamp: "2021-06-03T14:54:12Z"
labels:
component: apiserver
provider: kubernetes
name: kubernetes
namespace: default
...
- apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "9153"
prometheus.io/scrape: "true"
creationTimestamp: "2021-06-03T14:54:14Z"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
kubernetes.io/name: CoreDNS
name: kube-dns
namespace: kube-system
Keep in mind that the Kubernetes API does not have a kind
named List
.
kind: List
is a client-side, internal implementation detail for processing
collections that might be of different kinds of object. Avoid depending on
kind: List
in automation or other code.
Receiving resources as Tables
When you run kubectl get
, the default output format is a simple tabular
representation of one or more instances of a particular resource type. In the past,
clients were required to reproduce the tabular and describe output implemented in
kubectl
to perform simple lists of objects.
A few limitations of that approach include non-trivial logic when dealing with
certain objects. Additionally, types provided by API aggregation or third party
resources are not known at compile time. This means that generic implementations
had to be in place for types unrecognized by a client.
In order to avoid potential limitations as described above, clients may request
the Table representation of objects, delegating specific details of printing to the
server. The Kubernetes API implements standard HTTP content type negotiation: passing
an Accept
header containing a value of application/json;as=Table;g=meta.k8s.io;v=v1
with a GET
call will request that the server return objects in the Table content
type.
For example, list all of the pods on a cluster in the Table format.
GET /api/v1/pods
Accept: application/json;as=Table;g=meta.k8s.io;v=v1
---
200 OK
Content-Type: application/json
{
"kind": "Table",
"apiVersion": "meta.k8s.io/v1",
...
"columnDefinitions": [
...
]
}
For API resource types that do not have a custom Table definition known to the control
plane, the API server returns a default Table response that consists of the resource's
name
and creationTimestamp
fields.
GET /apis/crd.example.com/v1alpha1/namespaces/default/resources
---
200 OK
Content-Type: application/json
...
{
"kind": "Table",
"apiVersion": "meta.k8s.io/v1",
...
"columnDefinitions": [
{
"name": "Name",
"type": "string",
...
},
{
"name": "Created At",
"type": "date",
...
}
]
}
Not all API resource types support a Table response; for example, a
CustomResourceDefinitions
might not define field-to-table mappings, and an APIService that
extends the core Kubernetes API
might not serve Table responses at all. If you are implementing a client that
uses the Table information and must work against all resource types, including
extensions, you should make requests that specify multiple content types in the
Accept
header. For example:
Accept: application/json;as=Table;g=meta.k8s.io;v=v1, application/json
Alternate representations of resources
By default, Kubernetes returns objects serialized to JSON with content type
application/json
. This is the default serialization format for the API. However,
clients may request the more efficient
Protobuf representation of these objects for better performance at scale.
The Kubernetes API implements standard HTTP content type negotiation: passing an
Accept
header with a GET
call will request that the server tries to return
a response in your preferred media type, while sending an object in Protobuf to
the server for a PUT
or POST
call means that you must set the Content-Type
header appropriately.
The server will return a response with a Content-Type
header if the requested
format is supported, or the 406 Not acceptable
error if none of the media types you
requested are supported. All built-in resource types support the application/json
media type.
See the Kubernetes API reference for a list of supported content types for each API.
For example:
-
List all of the pods on a cluster in Protobuf format.
GET /api/v1/pods Accept: application/vnd.kubernetes.protobuf --- 200 OK Content-Type: application/vnd.kubernetes.protobuf ... binary encoded PodList object
-
Create a pod by sending Protobuf encoded data to the server, but request a response in JSON.
POST /api/v1/namespaces/test/pods Content-Type: application/vnd.kubernetes.protobuf Accept: application/json ... binary encoded Pod object --- 200 OK Content-Type: application/json { "kind": "Pod", "apiVersion": "v1", ... }
Not all API resource types support Protobuf; specifically, Protobuf isn't available for
resources that are defined as
CustomResourceDefinitions
or are served via the
aggregation layer.
As a client, if you might need to work with extension types you should specify multiple
content types in the request Accept
header to support fallback to JSON.
For example:
Accept: application/vnd.kubernetes.protobuf, application/json
Kubernetes Protobuf encoding
Kubernetes uses an envelope wrapper to encode Protobuf responses. That wrapper starts with a 4 byte magic number to help identify content in disk or in etcd as Protobuf (as opposed to JSON), and then is followed by a Protobuf encoded wrapper message, which describes the encoding and type of the underlying object and then contains the object.
The wrapper format is:
A four byte magic number prefix:
Bytes 0-3: "k8s\x00" [0x6b, 0x38, 0x73, 0x00]
An encoded Protobuf message with the following IDL:
message Unknown {
// typeMeta should have the string values for "kind" and "apiVersion" as set on the JSON object
optional TypeMeta typeMeta = 1;
// raw will hold the complete serialized object in protobuf. See the protobuf definitions in the client libraries for a given kind.
optional bytes raw = 2;
// contentEncoding is encoding used for the raw data. Unspecified means no encoding.
optional string contentEncoding = 3;
// contentType is the serialization method used to serialize 'raw'. Unspecified means application/vnd.kubernetes.protobuf and is usually
// omitted.
optional string contentType = 4;
}
message TypeMeta {
// apiVersion is the group/version for this type
optional string apiVersion = 1;
// kind is the name of the object schema. A protobuf definition should exist for this object.
optional string kind = 2;
}
application/vnd.kubernetes.protobuf
that does
not match the expected prefix should reject the response, as future versions may need
to alter the serialization format in an incompatible way and will do so by changing
the prefix.
Resource deletion
When you delete a resource this takes place in two phases.
- finalization
- removal
{
"kind": "ConfigMap",
"apiVersion": "v1",
"metadata": {
"finalizers": {"url.io/neat-finalization", "other-url.io/my-finalizer"},
"deletionTimestamp": nil,
}
}
When a client first sends a delete to request removal of a resource, the .metadata.deletionTimestamp
is set to the current time.
Once the .metadata.deletionTimestamp
is set, external controllers that act on finalizers
may start performing their cleanup work at any time, in any order.
Order is not enforced between finalizers because it would introduce significant
risk of stuck .metadata.finalizers
.
The .metadata.finalizers
field is shared: any actor with permission can reorder it.
If the finalizer list were processed in order, then this might lead to a situation
in which the component responsible for the first finalizer in the list is
waiting for some signal (field value, external system, or other) produced by a
component responsible for a finalizer later in the list, resulting in a deadlock.
Without enforced ordering, finalizers are free to order amongst themselves and are not vulnerable to ordering changes in the list.
Once the last finalizer is removed, the resource is actually removed from etcd.
Single resource API
The Kubernetes API verbs get, create, apply, update, patch, delete and proxy support single resources only. These verbs with single resource support have no support for submitting multiple resources together in an ordered or unordered list or transaction.
When clients (including kubectl) act on a set of resources, the client makes a series of single-resource API requests, then aggregates the responses if needed.
By contrast, the Kubernetes API verbs list and watch allow getting multiple resources, and deletecollection allows deleting multiple resources.
Dry-run
Kubernetes v1.18 [stable]
When you use HTTP verbs that can modify resources (POST
, PUT
, PATCH
, and
DELETE
), you can submit your request in a dry run mode. Dry run mode helps to
evaluate a request through the typical request stages (admission chain, validation,
merge conflicts) up until persisting objects to storage. The response body for the
request is as close as possible to a non-dry-run response. Kubernetes guarantees that
dry-run requests will not be persisted in storage or have any other side effects.
Make a dry-run request
Dry-run is triggered by setting the dryRun
query parameter. This parameter is a
string, working as an enum, and the only accepted values are:
- [no value set]
- Allow side effects. You request this with a query string such as
?dryRun
or?dryRun&pretty=true
. The response is the final object that would have been persisted, or an error if the request could not be fulfilled. All
- Every stage runs as normal, except for the final storage stage where side effects are prevented.
When you set ?dryRun=All
, any relevant
admission controllers
are run, validating admission controllers check the request post-mutation, merge is
performed on PATCH
, fields are defaulted, and schema validation occurs. The changes
are not persisted to the underlying storage, but the final object which would have
been persisted is still returned to the user, along with the normal status code.
If the non-dry-run version of a request would trigger an admission controller that has
side effects, the request will be failed rather than risk an unwanted side effect. All
built in admission control plugins support dry-run. Additionally, admission webhooks can
declare in their
configuration object
that they do not have side effects, by setting their sideEffects
field to None
.
sideEffects
field should be
set to "NoneOnDryRun". That change is appropriate provided that the webhook is also
be modified to understand the DryRun
field in AdmissionReview, and to prevent side
effects on any request marked as dry runs.
Here is an example dry-run request that uses ?dryRun=All
:
POST /api/v1/namespaces/test/pods?dryRun=All
Content-Type: application/json
Accept: application/json
The response would look the same as for non-dry-run request, but the values of some generated fields may differ.
Generated values
Some values of an object are typically generated before the object is persisted. It is important not to rely upon the values of these fields set by a dry-run request, since these values will likely be different in dry-run mode from when the real request is made. Some of these fields are:
name
: ifgenerateName
is set,name
will have a unique random namecreationTimestamp
/deletionTimestamp
: records the time of creation/deletionUID
: uniquely identifies the object and is randomly generated (non-deterministic)resourceVersion
: tracks the persisted version of the object- Any field set by a mutating admission controller
- For the
Service
resource: Ports or IP addresses that the kube-apiserver assigns to Service objects
Dry-run authorization
Authorization for dry-run and non-dry-run requests is identical. Thus, to make a dry-run request, you must be authorized to make the non-dry-run request.
For example, to run a dry-run patch for a Deployment, you must be authorized to perform that patch. Here is an example of a rule for Kubernetes RBAC that allows patching Deployments:
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["patch"]
Server Side Apply
Kubernetes' Server Side Apply
feature allows the control plane to track managed fields for newly created objects.
Server Side Apply provides a clear pattern for managing field conflicts,
offers server-side Apply
and Update
operations, and replaces the
client-side functionality of kubectl apply
.
The API verb for Server-Side Apply is apply. See Server Side Apply for more details.
Resource versions
Resource versions are strings that identify the server's internal version of an object. Resource versions can be used by clients to determine when objects have changed, or to express data consistency requirements when getting, listing and watching resources. Resource versions must be treated as opaque by clients and passed unmodified back to the server.
You must not assume resource versions are numeric or collatable. API clients may only compare two resource versions for equality (this means that you must not compare resource versions for greater-than or less-than relationships).
resourceVersion
fields in metadata
Clients find resource versions in resources, including the resources from the response stream for a watch, or when using list to enumerate resources.
v1.meta/ObjectMeta - The metadata.resourceVersion
of a resource instance identifies the resource version the instance was last modified at.
v1.meta/ListMeta - The metadata.resourceVersion
of a resource collection (the response to a list) identifies the resource version at which the collection was constructed.
resourceVersion
parameters in query strings
The get, list, and watch operations support the resourceVersion
parameter.
From version v1.19, Kubernetes API servers also support the resourceVersionMatch
parameter on list requests.
The API server interprets the resourceVersion
parameter differently depending
on the operation you request, and on the value of resourceVersion
. If you set
resourceVersionMatch
then this also affects the way matching happens.
Semantics for get and list
For get and list, the semantics of resourceVersion
are:
get:
resourceVersion unset | resourceVersion="0" | resourceVersion="{value other than 0}" |
---|---|---|
Most Recent | Any | Not older than |
list:
From version v1.19, Kubernetes API servers support the resourceVersionMatch
parameter
on list requests. If you set both resourceVersion
and resourceVersionMatch
, the
resourceVersionMatch
parameter determines how the API server interprets
resourceVersion
.
You should always set the resourceVersionMatch
parameter when setting
resourceVersion
on a list request. However, be prepared to handle the case
where the API server that responds is unaware of resourceVersionMatch
and ignores it.
Unless you have strong consistency requirements, using resourceVersionMatch=NotOlderThan
and
a known resourceVersion
is preferable since it can achieve better performance and scalability
of your cluster than leaving resourceVersion
and resourceVersionMatch
unset, which requires
quorum read to be served.
Setting the resourceVersionMatch
parameter without setting resourceVersion
is not valid.
This table explains the behavior of list requests with various combinations of
resourceVersion
and resourceVersionMatch
:
resourceVersionMatch param | paging params | resourceVersion not set | resourceVersion="0" | resourceVersion="{value other than 0}" |
---|---|---|---|---|
unset | limit unset | Most Recent | Any | Not older than |
unset | limit=<n>, continue unset | Most Recent | Any | Exact |
unset | limit=<n>, continue=<token> | Continue Token, Exact | Invalid, treated as Continue Token, Exact | Invalid, HTTP 400 Bad Request |
resourceVersionMatch=Exact |
limit unset | Invalid | Invalid | Exact |
resourceVersionMatch=Exact |
limit=<n>, continue unset | Invalid | Invalid | Exact |
resourceVersionMatch=NotOlderThan |
limit unset | Invalid | Any | Not older than |
resourceVersionMatch=NotOlderThan |
limit=<n>, continue unset | Invalid | Any | Not older than |
resourceVersionMatch
parameter,
the behavior is the same as if you did not set it.
The meaning of the get and list semantics are:
- Any
- Return data at any resource version. The newest available resource version is preferred, but strong consistency is not required; data at any resource version may be served. It is possible for the request to return data at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches. Clients that cannot tolerate this should not use this semantic.
- Most recent
- Return data at the most recent resource version. The returned data must be consistent (in detail: served from etcd via a quorum read).
- Not older than
- Return data at least as new as the provided
resourceVersion
. The newest available data is preferred, but any data not older than the providedresourceVersion
may be served. For list requests to servers that honor theresourceVersionMatch
parameter, this guarantees that the collection's.metadata.resourceVersion
is not older than the requestedresourceVersion
, but does not make any guarantee about the.metadata.resourceVersion
of any of the items in that collection. - Exact
- Return data at the exact resource version provided. If the provided
resourceVersion
is unavailable, the server responds with HTTP 410 "Gone". For list requests to servers that honor theresourceVersionMatch
parameter, this guarantees that the collection's.metadata.resourceVersion
is the same as theresourceVersion
you requested in the query string. That guarantee does not apply to the.metadata.resourceVersion
of any items within that collection. - Continue Token, Exact
- Return data at the resource version of the initial paginated list call. The returned continue tokens are responsible for keeping track of the initially provided resource version for all paginated list calls after the initial paginated list.
.metadata.resourceVersion
tracks when that object was last updated, and not how up-to-date
the object is when served.
When using resourceVersionMatch=NotOlderThan
and limit is set, clients must
handle HTTP 410 "Gone" responses. For example, the client might retry with a
newer resourceVersion
or fall back to resourceVersion=""
.
When using resourceVersionMatch=Exact
and limit
is unset, clients must
verify that the collection's .metadata.resourceVersion
matches
the requested resourceVersion
, and handle the case where it does not. For
example, the client might fall back to a request with limit
set.
Semantics for watch
For watch, the semantics of resource version are:
watch:
resourceVersion unset | resourceVersion="0" | resourceVersion="{value other than 0}" |
---|---|---|
Get State and Start at Most Recent | Get State and Start at Any | Start at Exact |
The meaning of those watch semantics are:
- Get State and Start at Any
- Caution: Watches initialized this way may return arbitrarily stale data. Please review this semantic before using it, and favor the other semantics where possible.Start a watch at any resource version; the most recent resource version available is preferred, but not required. Any starting resource version is allowed. It is possible for the watch to start at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches. Clients that cannot tolerate this apparent rewinding should not start a watch with this semantic. To establish initial state, the watch begins with synthetic "Added" events for all resource instances that exist at the starting resource version. All following watch events are for all changes that occurred after the resource version the watch started at.
- Get State and Start at Most Recent
- Start a watch at the most recent resource version, which must be consistent (in detail: served from etcd via a quorum read). To establish initial state, the watch begins with synthetic "Added" events of all resources instances that exist at the starting resource version. All following watch events are for all changes that occurred after the resource version the watch started at.
- Start at Exact
- Start a watch at an exact resource version. The watch events are for all changes after the provided resource version. Unlike "Get State and Start at Most Recent" and "Get State and Start at Any", the watch is not started with synthetic "Added" events for the provided resource version. The client is assumed to already have the initial state at the starting resource version since the client provided the resource version.
"410 Gone" responses
Servers are not required to serve all older resource versions and may return a HTTP
410 (Gone)
status code if a client requests a resourceVersion
older than the
server has retained. Clients must be able to tolerate 410 (Gone)
responses. See
Efficient detection of changes for details on
how to handle 410 (Gone)
responses when watching resources.
If you request a resourceVersion
outside the applicable limit then, depending
on whether a request is served from cache or not, the API server may reply with a
410 Gone
HTTP response.
Unavailable resource versions
Servers are not required to serve unrecognized resource versions. If you request list or get for a resource version that the API server does not recognize, then the API server may either:
- wait briefly for the resource version to become available, then timeout with a
504 (Gateway Timeout)
if the provided resource versions does not become available in a reasonable amount of time; - respond with a
Retry-After
response header indicating how many seconds a client should wait before retrying the request.
If you request a resource version that an API server does not recognize, the kube-apiserver additionally identifies its error responses with a "Too large resource version" message.
If you make a watch request for an unrecognized resource version, the API server may wait indefinitely (until the request timeout) for the resource version to become available.
6.2.2 - Server-Side Apply
Kubernetes v1.22 [stable]
Introduction
Server Side Apply helps users and controllers manage their resources through declarative configurations. Clients can create and modify their objects declaratively by sending their fully specified intent.
A fully specified intent is a partial object that only includes the fields and values for which the user has an opinion. That intent either creates a new object or is combined, by the server, with the existing object.
The system supports multiple appliers collaborating on a single object.
Changes to an object's fields are tracked through a "field management" mechanism. When a field's value changes, ownership moves from its current manager to the manager making the change. When trying to apply an object, fields that have a different value and are owned by another manager will result in a conflict. This is done in order to signal that the operation might undo another collaborator's changes. Conflicts can be forced, in which case the value will be overridden, and the ownership will be transferred.
If you remove a field from a configuration and apply the configuration, server side apply checks if there are any other field managers that also own the field. If the field is not owned by any other field managers, it is either deleted from the live object or reset to its default value, if it has one. The same rule applies to associative list or map items.
Server side apply is meant both as a replacement for the original kubectl apply
and as a simpler mechanism for controllers to enact their changes.
If you have Server Side Apply enabled, the control plane tracks managed fields for all newly created objects.
Field Management
Compared to the last-applied
annotation managed by kubectl
, Server Side
Apply uses a more declarative approach, which tracks a user's field management,
rather than a user's last applied state. This means that as a side effect of
using Server Side Apply, information about which field manager manages each
field in an object also becomes available.
For a user to manage a field, in the Server Side Apply sense, means that the
user relies on and expects the value of the field not to change. The user who
last made an assertion about the value of a field will be recorded as the
current field manager. This can be done either by changing the value with
POST
, PUT
, or non-apply PATCH
, or by including the field in a config sent
to the Server Side Apply endpoint. When using Server-Side Apply, trying to
change a field which is managed by someone else will result in a rejected
request (if not forced, see Conflicts).
When two or more appliers set a field to the same value, they share ownership of that field. Any subsequent attempt to change the value of the shared field, by any of the appliers, results in a conflict. Shared field owners may give up ownership of a field by removing it from their configuration.
Field management is stored in amanagedFields
field that is part of an object's
metadata
.
A simple example of an object created by Server Side Apply could look like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: test-cm
namespace: default
labels:
test-label: test
managedFields:
- manager: kubectl
operation: Apply
apiVersion: v1
time: "2010-10-10T0:00:00Z"
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
f:test-label: {}
f:data:
f:key: {}
data:
key: some value
The above object contains a single manager in metadata.managedFields
. The
manager consists of basic information about the managing entity itself, like
operation type, API version, and the fields managed by it.
Nevertheless it is possible to change metadata.managedFields
through an
Update
operation. Doing so is highly discouraged, but might be a reasonable
option to try if, for example, the managedFields
get into an inconsistent
state (which clearly should not happen).
The format of the managedFields
is described in the
API.
Conflicts
A conflict is a special status error that occurs when an Apply
operation tries
to change a field, which another user also claims to manage. This prevents an
applier from unintentionally overwriting the value set by another user. When
this occurs, the applier has 3 options to resolve the conflicts:
-
Overwrite value, become sole manager: If overwriting the value was intentional (or if the applier is an automated process like a controller) the applier should set the
force
query parameter to true and make the request again. This forces the operation to succeed, changes the value of the field, and removes the field from all other managers' entries in managedFields. -
Don't overwrite value, give up management claim: If the applier doesn't care about the value of the field anymore, they can remove it from their config and make the request again. This leaves the value unchanged, and causes the field to be removed from the applier's entry in managedFields.
-
Don't overwrite value, become shared manager: If the applier still cares about the value of the field, but doesn't want to overwrite it, they can change the value of the field in their config to match the value of the object on the server, and make the request again. This leaves the value unchanged, and causes the field's management to be shared by the applier and all other field managers that already claimed to manage it.
Managers
Managers identify distinct workflows that are modifying the object (especially
useful on conflicts!), and can be specified through the fieldManager
query
parameter as part of a modifying request. It is required for the apply endpoint,
though kubectl will default it to kubectl
. For other updates, its default is
computed from the user-agent.
Apply and Update
The two operation types considered by this feature are Apply
(PATCH
with
content type application/apply-patch+yaml
) and Update
(all other operations
which modify the object). Both operations update the managedFields
, but behave
a little differently.
Whether you are submitting JSON data or YAML data, use
application/apply-patch+yaml
as the Content-Type
header value.
All JSON documents are valid YAML.
For instance, only the apply operation fails on conflicts while update does
not. Also, apply operations are required to identify themselves by providing a
fieldManager
query parameter, while the query parameter is optional for update
operations. Finally, when using the apply operation you cannot have
managedFields
in the object that is being applied.
An example object with multiple managers could look like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: test-cm
namespace: default
labels:
test-label: test
managedFields:
- manager: kubectl
operation: Apply
apiVersion: v1
fields:
f:metadata:
f:labels:
f:test-label: {}
- manager: kube-controller-manager
operation: Update
apiVersion: v1
time: '2019-03-30T16:00:00.000Z'
fields:
f:data:
f:key: {}
data:
key: new value
In this example, a second operation was run as an Update
by the manager called
kube-controller-manager
. The update changed a value in the data field which
caused the field's management to change to the kube-controller-manager
.
If this update would have been an Apply
operation, the operation
would have failed due to conflicting ownership.
Merge strategy
The merging strategy, implemented with Server Side Apply, provides a generally more stable object lifecycle. Server Side Apply tries to merge fields based on the actor who manages them instead of overruling based on values. This way multiple actors can update the same object without causing unexpected interference.
When a user sends a "fully-specified intent" object to the Server Side Apply endpoint, the server merges it with the live object favoring the value in the applied config if it is specified in both places. If the set of items present in the applied config is not a superset of the items applied by the same user last time, each missing item not managed by any other appliers is removed. For more information about how an object's schema is used to make decisions when merging, see sigs.k8s.io/structured-merge-diff.
A number of markers were added in Kubernetes 1.16 and 1.17, to allow API developers to describe the merge strategy supported by lists, maps, and structs. These markers can be applied to objects of the respective type, in Go files or in the OpenAPI schema definition of the CRD:
Golang marker | OpenAPI extension | Accepted values | Description | Introduced in |
---|---|---|---|---|
//+listType |
x-kubernetes-list-type |
atomic /set /map |
Applicable to lists. set applies to lists that include only scalar elements. These elements must be unique. map applies to lists of nested types only. The key values (see listMapKey ) must be unique in the list. atomic can apply to any list. If configured as atomic , the entire list is replaced during merge. At any point in time, a single manager owns the list. If set or map , different managers can manage entries separately. |
1.16 |
//+listMapKey |
x-kubernetes-list-map-keys |
List of field names, e.g. ["port", "protocol"] |
Only applicable when +listType=map . A list of field names whose values uniquely identify entries in the list. While there can be multiple keys, listMapKey is singular because keys need to be specified individually in the Go type. The key fields must be scalars. |
1.16 |
//+mapType |
x-kubernetes-map-type |
atomic /granular |
Applicable to maps. atomic means that the map can only be entirely replaced by a single manager. granular means that the map supports separate managers updating individual fields. |
1.17 |
//+structType |
x-kubernetes-map-type |
atomic /granular |
Applicable to structs; otherwise same usage and OpenAPI annotation as //+mapType . |
1.17 |
If listType
is missing, the API server interprets a
patchMergeStrategy=merge
marker as a listType=map
and the
corresponding patchMergeKey
marker as a listMapKey
.
The atomic
list type is recursive.
These markers are specified as comments and don't have to be repeated as field tags.
Compatibility across topology changes
On rare occurrences, a CRD or built-in type author may want to change the
specific topology of a field in their resource without incrementing its
version. Changing the topology of types, by upgrading the cluster or
updating the CRD, has different consequences when updating existing
objects. There are two categories of changes: when a field goes from
map
/set
/granular
to atomic
and the other way around.
When the listType
, mapType
, or structType
changes from
map
/set
/granular
to atomic
, the whole list, map, or struct of
existing objects will end-up being owned by actors who owned an element
of these types. This means that any further change to these objects
would cause a conflict.
When a list, map, or struct changes from atomic
to
map
/set
/granular
, the API server won't be able to infer the new
ownership of these fields. Because of that, no conflict will be produced
when objects have these fields updated. For that reason, it is not
recommended to change a type from atomic
to map
/set
/granular
.
Take for example, the custom resource:
apiVersion: example.com/v1
kind: Foo
metadata:
name: foo-sample
managedFields:
- manager: manager-one
operation: Apply
apiVersion: example.com/v1
fields:
f:spec:
f:data: {}
spec:
data:
key1: val1
key2: val2
Before spec.data
gets changed from atomic
to granular
,
manager-one
owns the field spec.data
, and all the fields within it
(key1
and key2
). When the CRD gets changed to make spec.data
granular
, manager-one
continues to own the top-level field
spec.data
(meaning no other managers can delete the map called data
without a conflict), but it no longer owns key1
and key2
, so another
manager can then modify or delete those fields without conflict.
Custom Resources
By default, Server Side Apply treats custom resources as unstructured data. All keys are treated the same as struct fields, and all lists are considered atomic.
If the Custom Resource Definition defines a schema that contains annotations as defined in the previous "Merge Strategy" section, these annotations will be used when merging objects of this type.
Using Server-Side Apply in a controller
As a developer of a controller, you can use server-side apply as a way to simplify the update logic of your controller. The main differences with a read-modify-write and/or patch are the following:
- the applied object must contain all the fields that the controller cares about.
- there is no way to remove fields that haven't been applied by the controller before (controller can still send a PATCH/UPDATE for these use-cases).
- the object doesn't have to be read beforehand,
resourceVersion
doesn't have to be specified.
It is strongly recommended for controllers to always "force" conflicts, since they might not be able to resolve or act on these conflicts.
Transferring Ownership
In addition to the concurrency controls provided by conflict resolution, Server Side Apply provides ways to perform coordinated field ownership transfers from users to controllers.
This is best explained by example. Let's look at how to safely transfer
ownership of the replicas
field from a user to a controller while enabling
automatic horizontal scaling for a Deployment, using the HorizontalPodAutoscaler
resource and its accompanying controller.
Say a user has defined deployment with replicas
set to the desired value:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
And the user has created the deployment using server side apply like so:
kubectl apply -f https://k8s.io/examples/application/ssa/nginx-deployment.yaml --server-side
Then later, HPA is enabled for the deployment, e.g.:
kubectl autoscale deployment nginx-deployment --cpu-percent=50 --min=1 --max=10
Now, the user would like to remove replicas
from their configuration, so they
don't accidentally fight with the HPA controller. However, there is a race: it
might take some time before HPA feels the need to adjust replicas
, and if
the user removes replicas
before the HPA writes to the field and becomes
its owner, then apiserver will set replicas
to 1, its default value. This
is not what the user wants to happen, even temporarily.
There are two solutions:
-
(basic) Leave
replicas
in the configuration; when HPA eventually writes to that field, the system gives the user a conflict over it. At that point, it is safe to remove from the configuration. -
(more advanced) If, however, the user doesn't want to wait, for example because they want to keep the cluster legible to coworkers, then they can take the following steps to make it safe to remove
replicas
from their configuration:
First, the user defines a new configuration containing only the replicas
field:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
The user applies that configuration using the field manager name handover-to-hpa
:
kubectl apply -f https://k8s.io/examples/application/ssa/nginx-deployment-replicas-only.yaml \
--server-side --field-manager=handover-to-hpa \
--validate=false
If the apply results in a conflict with the HPA controller, then do nothing. The conflict indicates the controller has claimed the field earlier in the process than it sometimes does.
At this point the user may remove the replicas
field from their configuration.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
Note that whenever the HPA controller sets the replicas
field to a new value,
the temporary field manager will no longer own any fields and will be
automatically deleted. No clean up is required.
Transferring Ownership Between Users
Users can transfer ownership of a field between each other by setting the field to the same value in both of their applied configs, causing them to share ownership of the field. Once the users share ownership of the field, one of them can remove the field from their applied configuration to give up ownership and complete the transfer to the other user.
Comparison with Client Side Apply
A consequence of the conflict detection and resolution implemented by Server Side Apply is that an applier always has up to date field values in their local state. If they don't, they get a conflict the next time they apply. Any of the three options to resolve conflicts results in the applied configuration being an up to date subset of the object on the server's fields.
This is different from Client Side Apply, where outdated values which have been overwritten by other users are left in an applier's local config. These values only become accurate when the user updates that specific field, if ever, and an applier has no way of knowing whether their next apply will overwrite other users' changes.
Another difference is that an applier using Client Side Apply is unable to change the API version they are using, but Server Side Apply supports this use case.
Upgrading from client-side apply to server-side apply
Client-side apply users who manage a resource with kubectl apply
can start
using server-side apply with the following flag.
kubectl apply --server-side [--dry-run=server]
By default, field management of the object transfers from client-side apply to kubectl server-side apply without encountering conflicts.
Keep the last-applied-configuration
annotation up to date.
The annotation infers client-side apply's managed fields.
Any fields not managed by client-side apply raise conflicts.
For example, if you used kubectl scale
to update the replicas field after
client-side apply, then this field is not owned by client-side apply and
creates conflicts on kubectl apply --server-side
.
This behavior applies to server-side apply with the kubectl
field manager.
As an exception, you can opt-out of this behavior by specifying a different,
non-default field manager, as seen in the following example. The default field
manager for kubectl server-side apply is kubectl
.
kubectl apply --server-side --field-manager=my-manager [--dry-run=server]
Downgrading from server-side apply to client-side apply
If you manage a resource with kubectl apply --server-side
,
you can downgrade to client-side apply directly with kubectl apply
.
Downgrading works because kubectl server-side apply keeps the
last-applied-configuration
annotation up-to-date if you use
kubectl apply
.
This behavior applies to server-side apply with the kubectl
field manager.
As an exception, you can opt-out of this behavior by specifying a different,
non-default field manager, as seen in the following example. The default field
manager for kubectl server-side apply is kubectl
.
kubectl apply --server-side --field-manager=my-manager [--dry-run=server]
API Endpoint
With the Server Side Apply feature enabled, the PATCH
endpoint accepts the
additional application/apply-patch+yaml
content type. Users of Server Side
Apply can send partially specified objects as YAML to this endpoint. When
applying a configuration, one should always include all the fields that they
have an opinion about.
Clearing ManagedFields
It is possible to strip all managedFields from an object by overwriting them
using MergePatch
, StrategicMergePatch
, JSONPatch
, or Update
, so every
non-apply operation. This can be done by overwriting the managedFields field
with an empty entry. Two examples are:
PATCH /api/v1/namespaces/default/configmaps/example-cm
Content-Type: application/merge-patch+json
Accept: application/json
Data: {"metadata":{"managedFields": [{}]}}
PATCH /api/v1/namespaces/default/configmaps/example-cm
Content-Type: application/json-patch+json
Accept: application/json
Data: [{"op": "replace", "path": "/metadata/managedFields", "value": [{}]}]
This will overwrite the managedFields with a list containing a single empty entry that then results in the managedFields being stripped entirely from the object. Note that setting the managedFields to an empty list will not reset the field. This is on purpose, so managedFields never get stripped by clients not aware of the field.
In cases where the reset operation is combined with changes to other fields than the managedFields, this will result in the managedFields being reset first and the other changes being processed afterwards. As a result the applier takes ownership of any fields updated in the same request.
6.2.3 - Client Libraries
This page contains an overview of the client libraries for using the Kubernetes API from various programming languages.
To write applications using the Kubernetes REST API, you do not need to implement the API calls and request/response types yourself. You can use a client library for the programming language you are using.
Client libraries often handle common tasks such as authentication for you. Most client libraries can discover and use the Kubernetes Service Account to authenticate if the API client is running inside the Kubernetes cluster, or can understand the kubeconfig file format to read the credentials and the API Server address.
Officially-supported Kubernetes client libraries
The following client libraries are officially maintained by Kubernetes SIG API Machinery.
Language | Client Library | Sample Programs |
---|---|---|
dotnet | github.com/kubernetes-client/csharp | browse |
Go | github.com/kubernetes/client-go/ | browse |
Haskell | github.com/kubernetes-client/haskell | browse |
Java | github.com/kubernetes-client/java | browse |
JavaScript | github.com/kubernetes-client/javascript | browse |
Python | github.com/kubernetes-client/python/ | browse |
Community-maintained client libraries
The following Kubernetes API client libraries are provided and maintained by their authors, not the Kubernetes team.
6.2.4 - Kubernetes Deprecation Policy
This document details the deprecation policy for various facets of the system.
Kubernetes is a large system with many components and many contributors. As with any such software, the feature set naturally evolves over time, and sometimes a feature may need to be removed. This could include an API, a flag, or even an entire feature. To avoid breaking existing users, Kubernetes follows a deprecation policy for aspects of the system that are slated to be removed.
Deprecating parts of the API
Since Kubernetes is an API-driven system, the API has evolved over time to reflect the evolving understanding of the problem space. The Kubernetes API is actually a set of APIs, called "API groups", and each API group is independently versioned. API versions fall into 3 main tracks, each of which has different policies for deprecation:
Example | Track |
---|---|
v1 | GA (generally available, stable) |
v1beta1 | Beta (pre-release) |
v1alpha1 | Alpha (experimental) |
A given release of Kubernetes can support any number of API groups and any number of versions of each.
The following rules govern the deprecation of elements of the API. This includes:
- REST resources (aka API objects)
- Fields of REST resources
- Annotations on REST resources, including "beta" annotations but not including "alpha" annotations.
- Enumerated or constant values
- Component config structures
These rules are enforced between official releases, not between arbitrary commits to master or release branches.
Rule #1: API elements may only be removed by incrementing the version of the API group.
Once an API element has been added to an API group at a particular version, it can not be removed from that version or have its behavior significantly changed, regardless of track.
Rule #2: API objects must be able to round-trip between API versions in a given release without information loss, with the exception of whole REST resources that do not exist in some versions.
For example, an object can be written as v1 and then read back as v2 and converted to v1, and the resulting v1 resource will be identical to the original. The representation in v2 might be different from v1, but the system knows how to convert between them in both directions. Additionally, any new field added in v2 must be able to round-trip to v1 and back, which means v1 might have to add an equivalent field or represent it as an annotation.
Rule #3: An API version in a given track may not be deprecated in favor of a less stable API version.
- GA API versions can replace beta and alpha API versions.
- Beta API versions can replace earlier beta and alpha API versions, but may not replace GA API versions.
- Alpha API versions can replace earlier alpha API versions, but may not replace GA or beta API versions.
Rule #4a: minimum API lifetime is determined by the API stability level
- GA API versions may be marked as deprecated, but must not be removed within a major version of Kubernetes
- Beta API versions must be supported for 9 months or 3 releases (whichever is longer) after deprecation
- Alpha API versions may be removed in any release without prior deprecation notice
This ensures beta API support covers the maximum supported version skew of 2 releases.
Rule #4b: The "preferred" API version and the "storage version" for a given group may not advance until after a release has been made that supports both the new version and the previous version
Users must be able to upgrade to a new release of Kubernetes and then roll back to a previous release, without converting anything to the new API version or suffering breakages (unless they explicitly used features only available in the newer version). This is particularly evident in the stored representation of objects.
All of this is best illustrated by examples. Imagine a Kubernetes release, version X, which introduces a new API group. A new Kubernetes release is made every approximately 4 months (3 per year). The following table describes which API versions are supported in a series of subsequent releases.
Release | API Versions | Preferred/Storage Version | Notes |
---|---|---|---|
X | v1alpha1 | v1alpha1 | |
X+1 | v1alpha2 | v1alpha2 |
|
X+2 | v1beta1 | v1beta1 |
|
X+3 | v1beta2, v1beta1 (deprecated) | v1beta1 |
|
X+4 | v1beta2, v1beta1 (deprecated) | v1beta2 | |
X+5 | v1, v1beta1 (deprecated), v1beta2 (deprecated) | v1beta2 |
|
X+6 | v1, v1beta2 (deprecated) | v1 |
|
X+7 | v1, v1beta2 (deprecated) | v1 | |
X+8 | v2alpha1, v1 | v1 |
|
X+9 | v2alpha2, v1 | v1 |
|
X+10 | v2beta1, v1 | v1 |
|
X+11 | v2beta2, v2beta1 (deprecated), v1 | v1 |
|
X+12 | v2, v2beta2 (deprecated), v2beta1 (deprecated), v1 (deprecated) | v1 |
|
X+13 | v2, v2beta1 (deprecated), v2beta2 (deprecated), v1 (deprecated) | v2 | |
X+14 | v2, v2beta2 (deprecated), v1 (deprecated) | v2 |
|
X+15 | v2, v1 (deprecated) | v2 |
|
REST resources (aka API objects)
Consider a hypothetical REST resource named Widget, which was present in API v1 in the above timeline, and which needs to be deprecated. We document and announce the deprecation in sync with release X+1. The Widget resource still exists in API version v1 (deprecated) but not in v2alpha1. The Widget resource continues to exist and function in releases up to and including X+8. Only in release X+9, when API v1 has aged out, does the Widget resource cease to exist, and the behavior get removed.
Starting in Kubernetes v1.19, making an API request to a deprecated REST API endpoint:
-
Returns a
Warning
header (as defined in RFC7234, Section 5.5) in the API response. -
Adds a
"k8s.io/deprecated":"true"
annotation to the audit event recorded for the request. -
Sets an
apiserver_requested_deprecated_apis
gauge metric to1
in thekube-apiserver
process. The metric has labels forgroup
,version
,resource
,subresource
that can be joined to theapiserver_request_total
metric, and aremoved_release
label that indicates the Kubernetes release in which the API will no longer be served. The following Prometheus query returns information about requests made to deprecated APIs which will be removed in v1.22:apiserver_requested_deprecated_apis{removed_release="1.22"} * on(group,version,resource,subresource) group_right() apiserver_request_total
Fields of REST resources
As with whole REST resources, an individual field which was present in API v1 must exist and function until API v1 is removed. Unlike whole resources, the v2 APIs may choose a different representation for the field, as long as it can be round-tripped. For example a v1 field named "magnitude" which was deprecated might be named "deprecatedMagnitude" in API v2. When v1 is eventually removed, the deprecated field can be removed from v2.
Enumerated or constant values
As with whole REST resources and fields thereof, a constant value which was supported in API v1 must exist and function until API v1 is removed.
Component config structures
Component configs are versioned and managed similar to REST resources.
Future work
Over time, Kubernetes will introduce more fine-grained API versions, at which point these rules will be adjusted as needed.
Deprecating a flag or CLI
The Kubernetes system is comprised of several different programs cooperating. Sometimes, a Kubernetes release might remove flags or CLI commands (collectively "CLI elements") in these programs. The individual programs naturally sort into two main groups - user-facing and admin-facing programs, which vary slightly in their deprecation policies. Unless a flag is explicitly prefixed or documented as "alpha" or "beta", it is considered GA.
CLI elements are effectively part of the API to the system, but since they are not versioned in the same way as the REST API, the rules for deprecation are as follows:
Rule #5a: CLI elements of user-facing components (e.g. kubectl) must function after their announced deprecation for no less than:
- GA: 12 months or 2 releases (whichever is longer)
- Beta: 3 months or 1 release (whichever is longer)
- Alpha: 0 releases
Rule #5b: CLI elements of admin-facing components (e.g. kubelet) must function after their announced deprecation for no less than:
- GA: 6 months or 1 release (whichever is longer)
- Beta: 3 months or 1 release (whichever is longer)
- Alpha: 0 releases
Rule #6: Deprecated CLI elements must emit warnings (optionally disable) when used.
Deprecating a feature or behavior
Occasionally a Kubernetes release needs to deprecate some feature or behavior of the system that is not controlled by the API or CLI. In this case, the rules for deprecation are as follows:
Rule #7: Deprecated behaviors must function for no less than 1 year after their announced deprecation.
This does not imply that all changes to the system are governed by this policy. This applies only to significant, user-visible behaviors which impact the correctness of applications running on Kubernetes or that impact the administration of Kubernetes clusters, and which are being removed entirely.
An exception to the above rule is feature gates. Feature gates are key=value pairs that allow for users to enable/disable experimental features.
Feature gates are intended to cover the development life cycle of a feature - they are not intended to be long-term APIs. As such, they are expected to be deprecated and removed after a feature becomes GA or is dropped.
As a feature moves through the stages, the associated feature gate evolves. The feature life cycle matched to its corresponding feature gate is:
- Alpha: the feature gate is disabled by default and can be enabled by the user.
- Beta: the feature gate is enabled by default and can be disabled by the user.
- GA: the feature gate is deprecated (see "Deprecation") and becomes non-operational.
- GA, deprecation window complete: the feature gate is removed and calls to it are no longer accepted.
Deprecation
Features can be removed at any point in the life cycle prior to GA. When features are removed prior to GA, their associated feature gates are also deprecated.
When an invocation tries to disable a non-operational feature gate, the call fails in order to avoid unsupported scenarios that might otherwise run silently.
In some cases, removing pre-GA features requires considerable time. Feature gates can remain operational until their associated feature is fully removed, at which point the feature gate itself can be deprecated.
When removing a feature gate for a GA feature also requires considerable time, calls to feature gates may remain operational if the feature gate has no effect on the feature, and if the feature gate causes no errors.
Features intended to be disabled by users should include a mechanism for disabling the feature in the associated feature gate.
Versioning for feature gates is different from the previously discussed components, therefore the rules for deprecation are as follows:
Rule #8: Feature gates must be deprecated when the corresponding feature they control transitions a lifecycle stage as follows. Feature gates must function for no less than:
- Beta feature to GA: 6 months or 2 releases (whichever is longer)
- Beta feature to EOL: 3 months or 1 release (whichever is longer)
- Alpha feature to EOL: 0 releases
Rule #9: Deprecated feature gates must respond with a warning when used. When a feature gate is deprecated it must be documented in both in the release notes and the corresponding CLI help. Both warnings and documentation must indicate whether a feature gate is non-operational.
Deprecating a metric
Each component of the Kubernetes control-plane exposes metrics (usually the
/metrics
endpoint), which are typically ingested by cluster administrators.
Not all metrics are the same: some metrics are commonly used as SLIs or used
to determine SLOs, these tend to have greater import. Other metrics are more
experimental in nature or are used primarily in the Kubernetes development
process.
Accordingly, metrics fall under two stability classes (ALPHA
and STABLE
);
this impacts removal of a metric during a Kubernetes release. These classes
are determined by the perceived importance of the metric. The rules for
deprecating and removing a metric are as follows:
Rule #9a: Metrics, for the corresponding stability class, must function for no less than:
- STABLE: 4 releases or 12 months (whichever is longer)
- ALPHA: 0 releases
Rule #9b: Metrics, after their announced deprecation, must function for no less than:
- STABLE: 3 releases or 9 months (whichever is longer)
- ALPHA: 0 releases
Deprecated metrics will have their description text prefixed with a deprecation notice string '(Deprecated from x.y)' and a warning log will be emitted during metric registration. Like their stable undeprecated counterparts, deprecated metrics will be automatically registered to the metrics endpoint and therefore visible.
On a subsequent release (when the metric's deprecatedVersion
is equal to
current_kubernetes_version - 3)), a deprecated metric will become a hidden metric.
Unlike their deprecated counterparts, hidden metrics will no longer be
automatically registered to the metrics endpoint (hence hidden). However, they
can be explicitly enabled through a command line flag on the binary
(--show-hidden-metrics-for-version=
). This provides cluster admins an
escape hatch to properly migrate off of a deprecated metric, if they were not
able to react to the earlier deprecation warnings. Hidden metrics should be
deleted after one release.
Exceptions
No policy can cover every possible situation. This policy is a living document, and will evolve over time. In practice, there will be situations that do not fit neatly into this policy, or for which this policy becomes a serious impediment. Such situations should be discussed with SIGs and project leaders to find the best solutions for those specific cases, always bearing in mind that Kubernetes is committed to being a stable system that, as much as possible, never breaks users. Exceptions will always be announced in all relevant release notes.
6.2.5 - Deprecated API Migration Guide
As the Kubernetes API evolves, APIs are periodically reorganized or upgraded. When APIs evolve, the old API is deprecated and eventually removed. This page contains information you need to know when migrating from deprecated API versions to newer and more stable API versions.
Removed APIs by release
v1.26
The v1.26 release will stop serving the following deprecated API versions:
Flow control resources
The flowcontrol.apiserver.k8s.io/v1beta1 API version of FlowSchema and PriorityLevelConfiguration will no longer be served in v1.26.
- Migrate manifests and API clients to use the flowcontrol.apiserver.k8s.io/v1beta2 API version, available since v1.23.
- All existing persisted objects are accessible via the new API
- No notable changes
HorizontalPodAutoscaler
The autoscaling/v2beta2 API version of HorizontalPodAutoscaler will no longer be served in v1.26.
- Migrate manifests and API clients to use the autoscaling/v2 API version, available since v1.23.
- All existing persisted objects are accessible via the new API
v1.25
The v1.25 release will stop serving the following deprecated API versions:
CronJob
The batch/v1beta1 API version of CronJob will no longer be served in v1.25.
- Migrate manifests and API clients to use the batch/v1 API version, available since v1.21.
- All existing persisted objects are accessible via the new API
- No notable changes
EndpointSlice
The discovery.k8s.io/v1beta1 API version of EndpointSlice will no longer be served in v1.25.
- Migrate manifests and API clients to use the discovery.k8s.io/v1 API version, available since v1.21.
- All existing persisted objects are accessible via the new API
- Notable changes in discovery.k8s.io/v1:
- use per Endpoint
nodeName
field instead of deprecatedtopology["kubernetes.io/hostname"]
field - use per Endpoint
zone
field instead of deprecatedtopology["topology.kubernetes.io/zone"]
field topology
is replaced with thedeprecatedTopology
field which is not writable in v1
- use per Endpoint
Event
The events.k8s.io/v1beta1 API version of Event will no longer be served in v1.25.
- Migrate manifests and API clients to use the events.k8s.io/v1 API version, available since v1.19.
- All existing persisted objects are accessible via the new API
- Notable changes in events.k8s.io/v1:
type
is limited toNormal
andWarning
involvedObject
is renamed toregarding
action
,reason
,reportingController
, andreportingInstance
are required when creating new events.k8s.io/v1 Events- use
eventTime
instead of the deprecatedfirstTimestamp
field (which is renamed todeprecatedFirstTimestamp
and not permitted in new events.k8s.io/v1 Events) - use
series.lastObservedTime
instead of the deprecatedlastTimestamp
field (which is renamed todeprecatedLastTimestamp
and not permitted in new events.k8s.io/v1 Events) - use
series.count
instead of the deprecatedcount
field (which is renamed todeprecatedCount
and not permitted in new events.k8s.io/v1 Events) - use
reportingController
instead of the deprecatedsource.component
field (which is renamed todeprecatedSource.component
and not permitted in new events.k8s.io/v1 Events) - use
reportingInstance
instead of the deprecatedsource.host
field (which is renamed todeprecatedSource.host
and not permitted in new events.k8s.io/v1 Events)
HorizontalPodAutoscaler
The autoscaling/v2beta1 API version of HorizontalPodAutoscaler will no longer be served in v1.25.
- Migrate manifests and API clients to use the autoscaling/v2 API version, available since v1.23.
- All existing persisted objects are accessible via the new API
PodDisruptionBudget
The policy/v1beta1 API version of PodDisruptionBudget will no longer be served in v1.25.
- Migrate manifests and API clients to use the policy/v1 API version, available since v1.21.
- All existing persisted objects are accessible via the new API
- Notable changes in policy/v1:
- an empty
spec.selector
({}
) written to apolicy/v1
PodDisruptionBudget selects all pods in the namespace (inpolicy/v1beta1
an emptyspec.selector
selected no pods). An unsetspec.selector
selects no pods in either API version.
- an empty
PodSecurityPolicy
PodSecurityPolicy in the policy/v1beta1 API version will no longer be served in v1.25, and the PodSecurityPolicy admission controller will be removed.
PodSecurityPolicy replacements are still under discussion, but current use can be migrated to 3rd-party admission webhooks now.
RuntimeClass
RuntimeClass in the node.k8s.io/v1beta1 API version will no longer be served in v1.25.
- Migrate manifests and API clients to use the node.k8s.io/v1 API version, available since v1.20.
- All existing persisted objects are accessible via the new API
- No notable changes
v1.22
The v1.22 release stopped serving the following deprecated API versions:
Webhook resources
The admissionregistration.k8s.io/v1beta1 API version of MutatingWebhookConfiguration and ValidatingWebhookConfiguration is no longer served as of v1.22.
- Migrate manifests and API clients to use the admissionregistration.k8s.io/v1 API version, available since v1.16.
- All existing persisted objects are accessible via the new APIs
- Notable changes:
webhooks[*].failurePolicy
default changed fromIgnore
toFail
for v1webhooks[*].matchPolicy
default changed fromExact
toEquivalent
for v1webhooks[*].timeoutSeconds
default changed from30s
to10s
for v1webhooks[*].sideEffects
default value is removed, and the field made required, and onlyNone
andNoneOnDryRun
are permitted for v1webhooks[*].admissionReviewVersions
default value is removed and the field made required for v1 (supported versions for AdmissionReview arev1
andv1beta1
)webhooks[*].name
must be unique in the list for objects created viaadmissionregistration.k8s.io/v1
CustomResourceDefinition
The apiextensions.k8s.io/v1beta1 API version of CustomResourceDefinition is no longer served as of v1.22.
- Migrate manifests and API clients to use the apiextensions.k8s.io/v1 API version, available since v1.16.
- All existing persisted objects are accessible via the new API
- Notable changes:
spec.scope
is no longer defaulted toNamespaced
and must be explicitly specifiedspec.version
is removed in v1; usespec.versions
insteadspec.validation
is removed in v1; usespec.versions[*].schema
insteadspec.subresources
is removed in v1; usespec.versions[*].subresources
insteadspec.additionalPrinterColumns
is removed in v1; usespec.versions[*].additionalPrinterColumns
insteadspec.conversion.webhookClientConfig
is moved tospec.conversion.webhook.clientConfig
in v1spec.conversion.conversionReviewVersions
is moved tospec.conversion.webhook.conversionReviewVersions
in v1spec.versions[*].schema.openAPIV3Schema
is now required when creating v1 CustomResourceDefinition objects, and must be a structural schemaspec.preserveUnknownFields: true
is disallowed when creating v1 CustomResourceDefinition objects; it must be specified within schema definitions asx-kubernetes-preserve-unknown-fields: true
- In
additionalPrinterColumns
items, theJSONPath
field was renamed tojsonPath
in v1 (fixes #66531)
APIService
The apiregistration.k8s.io/v1beta1 API version of APIService is no longer served as of v1.22.
- Migrate manifests and API clients to use the apiregistration.k8s.io/v1 API version, available since v1.10.
- All existing persisted objects are accessible via the new API
- No notable changes
TokenReview
The authentication.k8s.io/v1beta1 API version of TokenReview is no longer served as of v1.22.
- Migrate manifests and API clients to use the authentication.k8s.io/v1 API version, available since v1.6.
- No notable changes
SubjectAccessReview resources
The authorization.k8s.io/v1beta1 API version of LocalSubjectAccessReview, SelfSubjectAccessReview, and SubjectAccessReview is no longer served as of v1.22.
- Migrate manifests and API clients to use the authorization.k8s.io/v1 API version, available since v1.6.
- Notable changes:
spec.group
was renamed tospec.groups
in v1 (fixes #32709)
CertificateSigningRequest
The certificates.k8s.io/v1beta1 API version of CertificateSigningRequest is no longer served as of v1.22.
- Migrate manifests and API clients to use the certificates.k8s.io/v1 API version, available since v1.19.
- All existing persisted objects are accessible via the new API
- Notable changes in
certificates.k8s.io/v1
:- For API clients requesting certificates:
spec.signerName
is now required (see known Kubernetes signers), and requests forkubernetes.io/legacy-unknown
are not allowed to be created via thecertificates.k8s.io/v1
APIspec.usages
is now required, may not contain duplicate values, and must only contain known usages
- For API clients approving or signing certificates:
status.conditions
may not contain duplicate typesstatus.conditions[*].status
is now requiredstatus.certificate
must be PEM-encoded, and contain onlyCERTIFICATE
blocks
- For API clients requesting certificates:
Lease
The coordination.k8s.io/v1beta1 API version of Lease is no longer served as of v1.22.
- Migrate manifests and API clients to use the coordination.k8s.io/v1 API version, available since v1.14.
- All existing persisted objects are accessible via the new API
- No notable changes
Ingress
The extensions/v1beta1 and networking.k8s.io/v1beta1 API versions of Ingress is no longer served as of v1.22.
- Migrate manifests and API clients to use the networking.k8s.io/v1 API version, available since v1.19.
- All existing persisted objects are accessible via the new API
- Notable changes:
spec.backend
is renamed tospec.defaultBackend
- The backend
serviceName
field is renamed toservice.name
- Numeric backend
servicePort
fields are renamed toservice.port.number
- String backend
servicePort
fields are renamed toservice.port.name
pathType
is now required for each specified path. Options arePrefix
,Exact
, andImplementationSpecific
. To match the undefinedv1beta1
behavior, useImplementationSpecific
.
IngressClass
The networking.k8s.io/v1beta1 API version of IngressClass is no longer served as of v1.22.
- Migrate manifests and API clients to use the networking.k8s.io/v1 API version, available since v1.19.
- All existing persisted objects are accessible via the new API
- No notable changes
RBAC resources
The rbac.authorization.k8s.io/v1beta1 API version of ClusterRole, ClusterRoleBinding, Role, and RoleBinding is no longer served as of v1.22.
- Migrate manifests and API clients to use the rbac.authorization.k8s.io/v1 API version, available since v1.8.
- All existing persisted objects are accessible via the new APIs
- No notable changes
PriorityClass
The scheduling.k8s.io/v1beta1 API version of PriorityClass is no longer served as of v1.22.
- Migrate manifests and API clients to use the scheduling.k8s.io/v1 API version, available since v1.14.
- All existing persisted objects are accessible via the new API
- No notable changes
Storage resources
The storage.k8s.io/v1beta1 API version of CSIDriver, CSINode, StorageClass, and VolumeAttachment is no longer served as of v1.22.
- Migrate manifests and API clients to use the storage.k8s.io/v1 API version
- CSIDriver is available in storage.k8s.io/v1 since v1.19.
- CSINode is available in storage.k8s.io/v1 since v1.17
- StorageClass is available in storage.k8s.io/v1 since v1.6
- VolumeAttachment is available in storage.k8s.io/v1 v1.13
- All existing persisted objects are accessible via the new APIs
- No notable changes
v1.16
The v1.16 release stopped serving the following deprecated API versions:
NetworkPolicy
The extensions/v1beta1 API version of NetworkPolicy is no longer served as of v1.16.
- Migrate manifests and API clients to use the networking.k8s.io/v1 API version, available since v1.8.
- All existing persisted objects are accessible via the new API
DaemonSet
The extensions/v1beta1 and apps/v1beta2 API versions of DaemonSet are no longer served as of v1.16.
- Migrate manifests and API clients to use the apps/v1 API version, available since v1.9.
- All existing persisted objects are accessible via the new API
- Notable changes:
spec.templateGeneration
is removedspec.selector
is now required and immutable after creation; use the existing template labels as the selector for seamless upgradesspec.updateStrategy.type
now defaults toRollingUpdate
(the default inextensions/v1beta1
wasOnDelete
)
Deployment
The extensions/v1beta1, apps/v1beta1, and apps/v1beta2 API versions of Deployment are no longer served as of v1.16.
- Migrate manifests and API clients to use the apps/v1 API version, available since v1.9.
- All existing persisted objects are accessible via the new API
- Notable changes:
spec.rollbackTo
is removedspec.selector
is now required and immutable after creation; use the existing template labels as the selector for seamless upgradesspec.progressDeadlineSeconds
now defaults to600
seconds (the default inextensions/v1beta1
was no deadline)spec.revisionHistoryLimit
now defaults to10
(the default inapps/v1beta1
was2
, the default inextensions/v1beta1
was to retain all)maxSurge
andmaxUnavailable
now default to25%
(the default inextensions/v1beta1
was1
)
StatefulSet
The apps/v1beta1 and apps/v1beta2 API versions of StatefulSet are no longer served as of v1.16.
- Migrate manifests and API clients to use the apps/v1 API version, available since v1.9.
- All existing persisted objects are accessible via the new API
- Notable changes:
spec.selector
is now required and immutable after creation; use the existing template labels as the selector for seamless upgradesspec.updateStrategy.type
now defaults toRollingUpdate
(the default inapps/v1beta1
wasOnDelete
)
ReplicaSet
The extensions/v1beta1, apps/v1beta1, and apps/v1beta2 API versions of ReplicaSet are no longer served as of v1.16.
- Migrate manifests and API clients to use the apps/v1 API version, available since v1.9.
- All existing persisted objects are accessible via the new API
- Notable changes:
spec.selector
is now required and immutable after creation; use the existing template labels as the selector for seamless upgrades
PodSecurityPolicy
The extensions/v1beta1 API version of PodSecurityPolicy is no longer served as of v1.16.
- Migrate manifests and API client to use the policy/v1beta1 API version, available since v1.10.
- Note that the policy/v1beta1 API version of PodSecurityPolicy will be removed in v1.25.
What to do
Test with deprecated APIs disabled
You can test your clusters by starting an API server with specific API versions disabled to simulate upcoming removals. Add the following flag to the API server startup arguments:
--runtime-config=<group>/<version>=false
For example:
--runtime-config=admissionregistration.k8s.io/v1beta1=false,apiextensions.k8s.io/v1beta1,...
Locate use of deprecated APIs
Use client warnings, metrics, and audit information available in 1.19+ to locate use of deprecated APIs.
Migrate to non-deprecated APIs
-
Update custom integrations and controllers to call the non-deprecated APIs
-
Change YAML files to reference the non-deprecated APIs
You can use the
kubectl-convert
command (kubectl convert
prior to v1.20) to automatically convert an existing object:kubectl-convert -f <file> --output-version <group>/<version>
.For example, to convert an older Deployment to
apps/v1
, you can run:kubectl-convert -f ./my-deployment.yaml --output-version apps/v1
Note that this may use non-ideal default values. To learn more about a specific resource, check the Kubernetes API reference.
6.2.6 - Kubernetes API health endpoints
The Kubernetes API server provides API endpoints to indicate the current status of the API server. This page describes these API endpoints and explains how you can use them.
API endpoints for health
The Kubernetes API server provides 3 API endpoints (healthz
, livez
and readyz
) to indicate the current status of the API server.
The healthz
endpoint is deprecated (since Kubernetes v1.16), and you should use the more specific livez
and readyz
endpoints instead.
The livez
endpoint can be used with the --livez-grace-period
flag to specify the startup duration.
For a graceful shutdown you can specify the --shutdown-delay-duration
flag with the /readyz
endpoint.
Machines that check the healthz
/livez
/readyz
of the API server should rely on the HTTP status code.
A status code 200
indicates the API server is healthy
/live
/ready
, depending on the called endpoint.
The more verbose options shown below are intended to be used by human operators to debug their cluster or understand the state of the API server.
The following examples will show how you can interact with the health API endpoints.
For all endpoints, you can use the verbose
parameter to print out the checks and their status.
This can be useful for a human operator to debug the current status of the API server, it is not intended to be consumed by a machine:
curl -k https://localhost:6443/livez?verbose
or from a remote host with authentication:
kubectl get --raw='/readyz?verbose'
The output will look like this:
[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
healthz check passed
The Kubernetes API server also supports to exclude specific checks. The query parameters can also be combined like in this example:
curl -k 'https://localhost:6443/readyz?verbose&exclude=etcd'
The output show that the etcd
check is excluded:
[+]ping ok
[+]log ok
[+]etcd excluded: ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]shutdown ok
healthz check passed
Individual health checks
Kubernetes v1.23 [alpha]
Each individual health check exposes an HTTP endpoint and can be checked individually.
The schema for the individual health checks is /livez/<healthcheck-name>
where livez
and readyz
and be used to indicate if you want to check the liveness or the readiness of the API server.
The <healthcheck-name>
path can be discovered using the verbose
flag from above and take the path between [+]
and ok
.
These individual health checks should not be consumed by machines but can be helpful for a human operator to debug a system:
curl -k https://localhost:6443/livez/etcd
6.3 - API Access Control
For an introduction to how Kubernetes implements and controls API access, read Controlling Access to the Kubernetes API.
Reference documentation:
- Authenticating
- Admission Controllers
- Authorization
- Certificate Signing Requests
- including CSR approval and certificate signing
- Service accounts
6.3.1 - Authenticating
This page provides an overview of authenticating.
Users in Kubernetes
All Kubernetes clusters have two categories of users: service accounts managed by Kubernetes, and normal users.
It is assumed that a cluster-independent service manages normal users in the following ways:
- an administrator distributing private keys
- a user store like Keystone or Google Accounts
- a file with a list of usernames and passwords
In this regard, Kubernetes does not have objects which represent normal user accounts. Normal users cannot be added to a cluster through an API call.
Even though a normal user cannot be added via an API call, any user that presents a valid certificate signed by the cluster's certificate authority (CA) is considered authenticated. In this configuration, Kubernetes determines the username from the common name field in the 'subject' of the cert (e.g., "/CN=bob"). From there, the role based access control (RBAC) sub-system would determine whether the user is authorized to perform a specific operation on a resource. For more details, refer to the normal users topic in certificate request for more details about this.
In contrast, service accounts are users managed by the Kubernetes API. They are
bound to specific namespaces, and created automatically by the API server or
manually through API calls. Service accounts are tied to a set of credentials
stored as Secrets
, which are mounted into pods allowing in-cluster processes
to talk to the Kubernetes API.
API requests are tied to either a normal user or a service account, or are treated
as anonymous requests. This means every process inside or outside the cluster, from
a human user typing kubectl
on a workstation, to kubelets
on nodes, to members
of the control plane, must authenticate when making requests to the API server,
or be treated as an anonymous user.
Authentication strategies
Kubernetes uses client certificates, bearer tokens, or an authenticating proxy to authenticate API requests through authentication plugins. As HTTP requests are made to the API server, plugins attempt to associate the following attributes with the request:
- Username: a string which identifies the end user. Common values might be
kube-admin
orjane@example.com
. - UID: a string which identifies the end user and attempts to be more consistent and unique than username.
- Groups: a set of strings, each of which indicates the user's membership in a named logical collection of users. Common values might be
system:masters
ordevops-team
. - Extra fields: a map of strings to list of strings which holds additional information authorizers may find useful.
All values are opaque to the authentication system and only hold significance when interpreted by an authorizer.
You can enable multiple authentication methods at once. You should usually use at least two methods:
- service account tokens for service accounts
- at least one other method for user authentication.
When multiple authenticator modules are enabled, the first module to successfully authenticate the request short-circuits evaluation. The API server does not guarantee the order authenticators run in.
The system:authenticated
group is included in the list of groups for all authenticated users.
Integrations with other authentication protocols (LDAP, SAML, Kerberos, alternate x509 schemes, etc) can be accomplished using an authenticating proxy or the authentication webhook.
X509 Client Certs
Client certificate authentication is enabled by passing the --client-ca-file=SOMEFILE
option to API server. The referenced file must contain one or more certificate authorities
to use to validate client certificates presented to the API server. If a client certificate
is presented and verified, the common name of the subject is used as the user name for the
request. As of Kubernetes 1.4, client certificates can also indicate a user's group memberships
using the certificate's organization fields. To include multiple group memberships for a user,
include multiple organization fields in the certificate.
For example, using the openssl
command line tool to generate a certificate signing request:
openssl req -new -key jbeda.pem -out jbeda-csr.pem -subj "/CN=jbeda/O=app1/O=app2"
This would create a CSR for the username "jbeda", belonging to two groups, "app1" and "app2".
See Managing Certificates for how to generate a client cert.
Static Token File
The API server reads bearer tokens from a file when given the --token-auth-file=SOMEFILE
option on the command line. Currently, tokens last indefinitely, and the token list cannot be
changed without restarting the API server.
The token file is a csv file with a minimum of 3 columns: token, user name, user uid, followed by optional group names.
If you have more than one group the column must be double quoted e.g.
token,user,uid,"group1,group2,group3"
Putting a Bearer Token in a Request
When using bearer token authentication from an http client, the API
server expects an Authorization
header with a value of Bearer <token>
. The bearer token must be a character sequence that can be
put in an HTTP header value using no more than the encoding and
quoting facilities of HTTP. For example: if the bearer token is
31ada4fd-adec-460c-809a-9e56ceb75269
then it would appear in an HTTP
header as shown below.
Authorization: Bearer 31ada4fd-adec-460c-809a-9e56ceb75269
Bootstrap Tokens
Kubernetes v1.18 [stable]
To allow for streamlined bootstrapping for new clusters, Kubernetes includes a
dynamically-managed Bearer token type called a Bootstrap Token. These tokens
are stored as Secrets in the kube-system
namespace, where they can be
dynamically managed and created. Controller Manager contains a TokenCleaner
controller that deletes bootstrap tokens as they expire.
The tokens are of the form [a-z0-9]{6}.[a-z0-9]{16}
. The first component is a
Token ID and the second component is the Token Secret. You specify the token
in an HTTP header as follows:
Authorization: Bearer 781292.db7bc3a58fc5f07e
You must enable the Bootstrap Token Authenticator with the
--enable-bootstrap-token-auth
flag on the API Server. You must enable
the TokenCleaner controller via the --controllers
flag on the Controller
Manager. This is done with something like --controllers=*,tokencleaner
.
kubeadm
will do this for you if you are using it to bootstrap a cluster.
The authenticator authenticates as system:bootstrap:<Token ID>
. It is
included in the system:bootstrappers
group. The naming and groups are
intentionally limited to discourage users from using these tokens past
bootstrapping. The user names and group can be used (and are used by kubeadm
)
to craft the appropriate authorization policies to support bootstrapping a
cluster.
Please see Bootstrap Tokens for in depth
documentation on the Bootstrap Token authenticator and controllers along with
how to manage these tokens with kubeadm
.
Service Account Tokens
A service account is an automatically enabled authenticator that uses signed bearer tokens to verify requests. The plugin takes two optional flags:
--service-account-key-file
A file containing a PEM encoded key for signing bearer tokens. If unspecified, the API server's TLS private key will be used.--service-account-lookup
If enabled, tokens which are deleted from the API will be revoked.
Service accounts are usually created automatically by the API server and
associated with pods running in the cluster through the ServiceAccount
Admission Controller. Bearer tokens are
mounted into pods at well-known locations, and allow in-cluster processes to
talk to the API server. Accounts may be explicitly associated with pods using the
serviceAccountName
field of a PodSpec
.
serviceAccountName
is usually omitted because this is done automatically.
apiVersion: apps/v1 # this apiVersion is relevant as of Kubernetes 1.9
kind: Deployment
metadata:
name: nginx-deployment
namespace: default
spec:
replicas: 3
template:
metadata:
# ...
spec:
serviceAccountName: bob-the-bot
containers:
- name: nginx
image: nginx:1.14.2
Service account bearer tokens are perfectly valid to use outside the cluster and
can be used to create identities for long standing jobs that wish to talk to the
Kubernetes API. To manually create a service account, use the kubectl create serviceaccount (NAME)
command. This creates a service account in the current
namespace and an associated secret.
kubectl create serviceaccount jenkins
serviceaccount "jenkins" created
Check an associated secret:
kubectl get serviceaccounts jenkins -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
# ...
secrets:
- name: jenkins-token-1yvwg
The created secret holds the public CA of the API server and a signed JSON Web Token (JWT).
kubectl get secret jenkins-token-1yvwg -o yaml
apiVersion: v1
data:
ca.crt: (APISERVER'S CA BASE64 ENCODED)
namespace: ZGVmYXVsdA==
token: (BEARER TOKEN BASE64 ENCODED)
kind: Secret
metadata:
# ...
type: kubernetes.io/service-account-token
The signed JWT can be used as a bearer token to authenticate as the given service account. See above for how the token is included in a request. Normally these secrets are mounted into pods for in-cluster access to the API server, but can be used from outside the cluster as well.
Service accounts authenticate with the username system:serviceaccount:(NAMESPACE):(SERVICEACCOUNT)
,
and are assigned to the groups system:serviceaccounts
and system:serviceaccounts:(NAMESPACE)
.
WARNING: Because service account tokens are stored in secrets, any user with read access to those secrets can authenticate as the service account. Be cautious when granting permissions to service accounts and read capabilities for secrets.
OpenID Connect Tokens
OpenID Connect is a flavor of OAuth2 supported by some OAuth2 providers, notably Azure Active Directory, Salesforce, and Google. The protocol's main extension of OAuth2 is an additional field returned with the access token called an ID Token. This token is a JSON Web Token (JWT) with well known fields, such as a user's email, signed by the server.
To identify the user, the authenticator uses the id_token
(not the access_token
)
from the OAuth2 token response
as a bearer token. See above for how the token
is included in a request.
- Login to your identity provider
- Your identity provider will provide you with an
access_token
,id_token
and arefresh_token
- When using
kubectl
, use yourid_token
with the--token
flag or add it directly to yourkubeconfig
kubectl
sends yourid_token
in a header called Authorization to the API server- The API server will make sure the JWT signature is valid by checking against the certificate named in the configuration
- Check to make sure the
id_token
hasn't expired - Make sure the user is authorized
- Once authorized the API server returns a response to
kubectl
kubectl
provides feedback to the user
Since all of the data needed to validate who you are is in the id_token
, Kubernetes doesn't need to
"phone home" to the identity provider. In a model where every request is stateless this provides a very scalable solution for authentication. It does offer a few challenges:
- Kubernetes has no "web interface" to trigger the authentication process. There is no browser or interface to collect credentials which is why you need to authenticate to your identity provider first.
- The
id_token
can't be revoked, it's like a certificate so it should be short-lived (only a few minutes) so it can be very annoying to have to get a new token every few minutes. - To authenticate to the Kubernetes dashboard, you must use the
kubectl proxy
command or a reverse proxy that injects theid_token
.
Configuring the API Server
To enable the plugin, configure the following flags on the API server:
Parameter | Description | Example | Required |
---|---|---|---|
--oidc-issuer-url |
URL of the provider which allows the API server to discover public signing keys. Only URLs which use the https:// scheme are accepted. This is typically the provider's discovery URL without a path, for example "https://accounts.google.com" or "https://login.salesforce.com". This URL should point to the level below .well-known/openid-configuration |
If the discovery URL is https://accounts.google.com/.well-known/openid-configuration , the value should be https://accounts.google.com |
Yes |
--oidc-client-id |
A client id that all tokens must be issued for. | kubernetes | Yes |
--oidc-username-claim |
JWT claim to use as the user name. By default sub , which is expected to be a unique identifier of the end user. Admins can choose other claims, such as email or name , depending on their provider. However, claims other than email will be prefixed with the issuer URL to prevent naming clashes with other plugins. |
sub | No |
--oidc-username-prefix |
Prefix prepended to username claims to prevent clashes with existing names (such as system: users). For example, the value oidc: will create usernames like oidc:jane.doe . If this flag isn't provided and --oidc-username-claim is a value other than email the prefix defaults to ( Issuer URL )# where ( Issuer URL ) is the value of --oidc-issuer-url . The value - can be used to disable all prefixing. |
oidc: |
No |
--oidc-groups-claim |
JWT claim to use as the user's group. If the claim is present it must be an array of strings. | groups | No |
--oidc-groups-prefix |
Prefix prepended to group claims to prevent clashes with existing names (such as system: groups). For example, the value oidc: will create group names like oidc:engineering and oidc:infra . |
oidc: |
No |
--oidc-required-claim |
A key=value pair that describes a required claim in the ID Token. If set, the claim is verified to be present in the ID Token with a matching value. Repeat this flag to specify multiple claims. | claim=value |
No |
--oidc-ca-file |
The path to the certificate for the CA that signed your identity provider's web certificate. Defaults to the host's root CAs. | /etc/kubernetes/ssl/kc-ca.pem |
No |
Importantly, the API server is not an OAuth2 client, rather it can only be
configured to trust a single issuer. This allows the use of public providers,
such as Google, without trusting credentials issued to third parties. Admins who
wish to utilize multiple OAuth clients should explore providers which support the
azp
(authorized party) claim, a mechanism for allowing one client to issue
tokens on behalf of another.
Kubernetes does not provide an OpenID Connect Identity Provider. You can use an existing public OpenID Connect Identity Provider (such as Google, or others). Or, you can run your own Identity Provider, such as dex, Keycloak, CloudFoundry UAA, or Tremolo Security's OpenUnison.
For an identity provider to work with Kubernetes it must:
- Support OpenID connect discovery; not all do.
- Run in TLS with non-obsolete ciphers
- Have a CA signed certificate (even if the CA is not a commercial CA or is self signed)
A note about requirement #3 above, requiring a CA signed certificate. If you deploy your own identity provider (as opposed to one of the cloud providers like Google or Microsoft) you MUST have your identity provider's web server certificate signed by a certificate with the CA
flag set to TRUE
, even if it is self signed. This is due to GoLang's TLS client implementation being very strict to the standards around certificate validation. If you don't have a CA handy, you can use this script from the Dex team to create a simple CA and a signed certificate and key pair.
Or you can use this similar script that generates SHA256 certs with a longer life and larger key size.
Setup instructions for specific systems:
Using kubectl
Option 1 - OIDC Authenticator
The first option is to use the kubectl oidc
authenticator, which sets the id_token
as a bearer token for all requests and refreshes the token once it expires. After you've logged into your provider, use kubectl to add your id_token
, refresh_token
, client_id
, and client_secret
to configure the plugin.
Providers that don't return an id_token
as part of their refresh token response aren't supported by this plugin and should use "Option 2" below.
kubectl config set-credentials USER_NAME \
--auth-provider=oidc \
--auth-provider-arg=idp-issuer-url=( issuer url ) \
--auth-provider-arg=client-id=( your client id ) \
--auth-provider-arg=client-secret=( your client secret ) \
--auth-provider-arg=refresh-token=( your refresh token ) \
--auth-provider-arg=idp-certificate-authority=( path to your ca certificate ) \
--auth-provider-arg=id-token=( your id_token )
As an example, running the below command after authenticating to your identity provider:
kubectl config set-credentials mmosley \
--auth-provider=oidc \
--auth-provider-arg=idp-issuer-url=https://oidcidp.tremolo.lan:8443/auth/idp/OidcIdP \
--auth-provider-arg=client-id=kubernetes \
--auth-provider-arg=client-secret=1db158f6-177d-4d9c-8a8b-d36869918ec5 \
--auth-provider-arg=refresh-token=q1bKLFOyUiosTfawzA93TzZIDzH2TNa2SMm0zEiPKTUwME6BkEo6Sql5yUWVBSWpKUGphaWpxSVAfekBOZbBhaEW+VlFUeVRGcluyVF5JT4+haZmPsluFoFu5XkpXk5BXqHega4GAXlF+ma+vmYpFcHe5eZR+slBFpZKtQA= \
--auth-provider-arg=idp-certificate-authority=/root/ca.pem \
--auth-provider-arg=id-token=eyJraWQiOiJDTj1vaWRjaWRwLnRyZW1vbG8ubGFuLCBPVT1EZW1vLCBPPVRybWVvbG8gU2VjdXJpdHksIEw9QXJsaW5ndG9uLCBTVD1WaXJnaW5pYSwgQz1VUy1DTj1rdWJlLWNhLTEyMDIxNDc5MjEwMzYwNzMyMTUyIiwiYWxnIjoiUlMyNTYifQ.eyJpc3MiOiJodHRwczovL29pZGNpZHAudHJlbW9sby5sYW46ODQ0My9hdXRoL2lkcC9PaWRjSWRQIiwiYXVkIjoia3ViZXJuZXRlcyIsImV4cCI6MTQ4MzU0OTUxMSwianRpIjoiMm96US15TXdFcHV4WDlHZUhQdy1hZyIsImlhdCI6MTQ4MzU0OTQ1MSwibmJmIjoxNDgzNTQ5MzMxLCJzdWIiOiI0YWViMzdiYS1iNjQ1LTQ4ZmQtYWIzMC0xYTAxZWU0MWUyMTgifQ.w6p4J_6qQ1HzTG9nrEOrubxIMb9K5hzcMPxc9IxPx2K4xO9l-oFiUw93daH3m5pluP6K7eOE6txBuRVfEcpJSwlelsOsW8gb8VJcnzMS9EnZpeA0tW_p-mnkFc3VcfyXuhe5R3G7aa5d8uHv70yJ9Y3-UhjiN9EhpMdfPAoEB9fYKKkJRzF7utTTIPGrSaSU6d2pcpfYKaxIwePzEkT4DfcQthoZdy9ucNvvLoi1DIC-UocFD8HLs8LYKEqSxQvOcvnThbObJ9af71EwmuE21fO5KzMW20KtAeget1gnldOosPtz1G5EwvaQ401-RPQzPGMVBld0_zMCAwZttJ4knw
Which would produce the below configuration:
users:
- name: mmosley
user:
auth-provider:
config:
client-id: kubernetes
client-secret: 1db158f6-177d-4d9c-8a8b-d36869918ec5
id-token: eyJraWQiOiJDTj1vaWRjaWRwLnRyZW1vbG8ubGFuLCBPVT1EZW1vLCBPPVRybWVvbG8gU2VjdXJpdHksIEw9QXJsaW5ndG9uLCBTVD1WaXJnaW5pYSwgQz1VUy1DTj1rdWJlLWNhLTEyMDIxNDc5MjEwMzYwNzMyMTUyIiwiYWxnIjoiUlMyNTYifQ.eyJpc3MiOiJodHRwczovL29pZGNpZHAudHJlbW9sby5sYW46ODQ0My9hdXRoL2lkcC9PaWRjSWRQIiwiYXVkIjoia3ViZXJuZXRlcyIsImV4cCI6MTQ4MzU0OTUxMSwianRpIjoiMm96US15TXdFcHV4WDlHZUhQdy1hZyIsImlhdCI6MTQ4MzU0OTQ1MSwibmJmIjoxNDgzNTQ5MzMxLCJzdWIiOiI0YWViMzdiYS1iNjQ1LTQ4ZmQtYWIzMC0xYTAxZWU0MWUyMTgifQ.w6p4J_6qQ1HzTG9nrEOrubxIMb9K5hzcMPxc9IxPx2K4xO9l-oFiUw93daH3m5pluP6K7eOE6txBuRVfEcpJSwlelsOsW8gb8VJcnzMS9EnZpeA0tW_p-mnkFc3VcfyXuhe5R3G7aa5d8uHv70yJ9Y3-UhjiN9EhpMdfPAoEB9fYKKkJRzF7utTTIPGrSaSU6d2pcpfYKaxIwePzEkT4DfcQthoZdy9ucNvvLoi1DIC-UocFD8HLs8LYKEqSxQvOcvnThbObJ9af71EwmuE21fO5KzMW20KtAeget1gnldOosPtz1G5EwvaQ401-RPQzPGMVBld0_zMCAwZttJ4knw
idp-certificate-authority: /root/ca.pem
idp-issuer-url: https://oidcidp.tremolo.lan:8443/auth/idp/OidcIdP
refresh-token: q1bKLFOyUiosTfawzA93TzZIDzH2TNa2SMm0zEiPKTUwME6BkEo6Sql5yUWVBSWpKUGphaWpxSVAfekBOZbBhaEW+VlFUeVRGcluyVF5JT4+haZmPsluFoFu5XkpXk5BXq
name: oidc
Once your id_token
expires, kubectl
will attempt to refresh your id_token
using your refresh_token
and client_secret
storing the new values for the refresh_token
and id_token
in your .kube/config
.
Option 2 - Use the --token
Option
The kubectl
command lets you pass in a token using the --token
option. Copy and paste the id_token
into this option:
kubectl --token=eyJhbGciOiJSUzI1NiJ9.eyJpc3MiOiJodHRwczovL21sYi50cmVtb2xvLmxhbjo4MDQzL2F1dGgvaWRwL29pZGMiLCJhdWQiOiJrdWJlcm5ldGVzIiwiZXhwIjoxNDc0NTk2NjY5LCJqdGkiOiI2RDUzNXoxUEpFNjJOR3QxaWVyYm9RIiwiaWF0IjoxNDc0NTk2MzY5LCJuYmYiOjE0NzQ1OTYyNDksInN1YiI6Im13aW5kdSIsInVzZXJfcm9sZSI6WyJ1c2VycyIsIm5ldy1uYW1lc3BhY2Utdmlld2VyIl0sImVtYWlsIjoibXdpbmR1QG5vbW9yZWplZGkuY29tIn0.f2As579n9VNoaKzoF-dOQGmXkFKf1FMyNV0-va_B63jn-_n9LGSCca_6IVMP8pO-Zb4KvRqGyTP0r3HkHxYy5c81AnIh8ijarruczl-TK_yF5akjSTHFZD-0gRzlevBDiH8Q79NAr-ky0P4iIXS8lY9Vnjch5MF74Zx0c3alKJHJUnnpjIACByfF2SCaYzbWFMUNat-K1PaUk5-ujMBG7yYnr95xD-63n8CO8teGUAAEMx6zRjzfhnhbzX-ajwZLGwGUBT4WqjMs70-6a7_8gZmLZb2az1cZynkFRj2BaCkVT3A2RrjeEwZEtGXlMqKJ1_I2ulrOVsYx01_yD35-rw get nodes
Webhook Token Authentication
Webhook authentication is a hook for verifying bearer tokens.
--authentication-token-webhook-config-file
a configuration file describing how to access the remote webhook service.--authentication-token-webhook-cache-ttl
how long to cache authentication decisions. Defaults to two minutes.--authentication-token-webhook-version
determines whether to useauthentication.k8s.io/v1beta1
orauthentication.k8s.io/v1
TokenReview
objects to send/receive information from the webhook. Defaults tov1beta1
.
The configuration file uses the kubeconfig
file format. Within the file, clusters
refers to the remote service and
users
refers to the API server webhook. An example would be:
# Kubernetes API version
apiVersion: v1
# kind of the API object
kind: Config
# clusters refers to the remote service.
clusters:
- name: name-of-remote-authn-service
cluster:
certificate-authority: /path/to/ca.pem # CA for verifying the remote service.
server: https://authn.example.com/authenticate # URL of remote service to query. 'https' recommended for production.
# users refers to the API server's webhook configuration.
users:
- name: name-of-api-server
user:
client-certificate: /path/to/cert.pem # cert for the webhook plugin to use
client-key: /path/to/key.pem # key matching the cert
# kubeconfig files require a context. Provide one for the API server.
current-context: webhook
contexts:
- context:
cluster: name-of-remote-authn-service
user: name-of-api-server
name: webhook
When a client attempts to authenticate with the API server using a bearer token as discussed above,
the authentication webhook POSTs a JSON-serialized TokenReview
object containing the token to the remote service.
Note that webhook API objects are subject to the same versioning compatibility rules as other Kubernetes API objects.
Implementers should check the apiVersion
field of the request to ensure correct deserialization,
and must respond with a TokenReview
object of the same version as the request.
authentication.k8s.io/v1beta1
token reviews for backwards compatibility.
To opt into receiving authentication.k8s.io/v1
token reviews, the API server must be started with --authentication-token-webhook-version=v1
.
{
"apiVersion": "authentication.k8s.io/v1",
"kind": "TokenReview",
"spec": {
# Opaque bearer token sent to the API server
"token": "014fbff9a07c...",
# Optional list of the audience identifiers for the server the token was presented to.
# Audience-aware token authenticators (for example, OIDC token authenticators)
# should verify the token was intended for at least one of the audiences in this list,
# and return the intersection of this list and the valid audiences for the token in the response status.
# This ensures the token is valid to authenticate to the server it was presented to.
# If no audiences are provided, the token should be validated to authenticate to the Kubernetes API server.
"audiences": ["https://myserver.example.com", "https://myserver.internal.example.com"]
}
}
{
"apiVersion": "authentication.k8s.io/v1beta1",
"kind": "TokenReview",
"spec": {
# Opaque bearer token sent to the API server
"token": "014fbff9a07c...",
# Optional list of the audience identifiers for the server the token was presented to.
# Audience-aware token authenticators (for example, OIDC token authenticators)
# should verify the token was intended for at least one of the audiences in this list,
# and return the intersection of this list and the valid audiences for the token in the response status.
# This ensures the token is valid to authenticate to the server it was presented to.
# If no audiences are provided, the token should be validated to authenticate to the Kubernetes API server.
"audiences": ["https://myserver.example.com", "https://myserver.internal.example.com"]
}
}
The remote service is expected to fill the status
field of the request to indicate the success of the login.
The response body's spec
field is ignored and may be omitted.
The remote service must return a response using the same TokenReview
API version that it received.
A successful validation of the bearer token would return:
{
"apiVersion": "authentication.k8s.io/v1",
"kind": "TokenReview",
"status": {
"authenticated": true,
"user": {
# Required
"username": "janedoe@example.com",
# Optional
"uid": "42",
# Optional group memberships
"groups": ["developers", "qa"],
# Optional additional information provided by the authenticator.
# This should not contain confidential data, as it can be recorded in logs
# or API objects, and is made available to admission webhooks.
"extra": {
"extrafield1": [
"extravalue1",
"extravalue2"
]
}
},
# Optional list audience-aware token authenticators can return,
# containing the audiences from the `spec.audiences` list for which the provided token was valid.
# If this is omitted, the token is considered to be valid to authenticate to the Kubernetes API server.
"audiences": ["https://myserver.example.com"]
}
}
{
"apiVersion": "authentication.k8s.io/v1beta1",
"kind": "TokenReview",
"status": {
"authenticated": true,
"user": {
# Required
"username": "janedoe@example.com",
# Optional
"uid": "42",
# Optional group memberships
"groups": ["developers", "qa"],
# Optional additional information provided by the authenticator.
# This should not contain confidential data, as it can be recorded in logs
# or API objects, and is made available to admission webhooks.
"extra": {
"extrafield1": [
"extravalue1",
"extravalue2"
]
}
},
# Optional list audience-aware token authenticators can return,
# containing the audiences from the `spec.audiences` list for which the provided token was valid.
# If this is omitted, the token is considered to be valid to authenticate to the Kubernetes API server.
"audiences": ["https://myserver.example.com"]
}
}
An unsuccessful request would return:
{
"apiVersion": "authentication.k8s.io/v1",
"kind": "TokenReview",
"status": {
"authenticated": false,
# Optionally include details about why authentication failed.
# If no error is provided, the API will return a generic Unauthorized message.
# The error field is ignored when authenticated=true.
"error": "Credentials are expired"
}
}
{
"apiVersion": "authentication.k8s.io/v1beta1",
"kind": "TokenReview",
"status": {
"authenticated": false,
# Optionally include details about why authentication failed.
# If no error is provided, the API will return a generic Unauthorized message.
# The error field is ignored when authenticated=true.
"error": "Credentials are expired"
}
}
Authenticating Proxy
The API server can be configured to identify users from request header values, such as X-Remote-User
.
It is designed for use in combination with an authenticating proxy, which sets the request header value.
--requestheader-username-headers
Required, case-insensitive. Header names to check, in order, for the user identity. The first header containing a value is used as the username.--requestheader-group-headers
1.6+. Optional, case-insensitive. "X-Remote-Group" is suggested. Header names to check, in order, for the user's groups. All values in all specified headers are used as group names.--requestheader-extra-headers-prefix
1.6+. Optional, case-insensitive. "X-Remote-Extra-" is suggested. Header prefixes to look for to determine extra information about the user (typically used by the configured authorization plugin). Any headers beginning with any of the specified prefixes have the prefix removed. The remainder of the header name is lowercased and percent-decoded and becomes the extra key, and the header value is the extra value.
For example, with this configuration:
--requestheader-username-headers=X-Remote-User
--requestheader-group-headers=X-Remote-Group
--requestheader-extra-headers-prefix=X-Remote-Extra-
this request:
GET / HTTP/1.1
X-Remote-User: fido
X-Remote-Group: dogs
X-Remote-Group: dachshunds
X-Remote-Extra-Acme.com%2Fproject: some-project
X-Remote-Extra-Scopes: openid
X-Remote-Extra-Scopes: profile
would result in this user info:
name: fido
groups:
- dogs
- dachshunds
extra:
acme.com/project:
- some-project
scopes:
- openid
- profile
In order to prevent header spoofing, the authenticating proxy is required to present a valid client certificate to the API server for validation against the specified CA before the request headers are checked. WARNING: do not reuse a CA that is used in a different context unless you understand the risks and the mechanisms to protect the CA's usage.
--requestheader-client-ca-file
Required. PEM-encoded certificate bundle. A valid client certificate must be presented and validated against the certificate authorities in the specified file before the request headers are checked for user names.--requestheader-allowed-names
Optional. List of Common Name values (CNs). If set, a valid client certificate with a CN in the specified list must be presented before the request headers are checked for user names. If empty, any CN is allowed.
Anonymous requests
When enabled, requests that are not rejected by other configured authentication methods are
treated as anonymous requests, and given a username of system:anonymous
and a group of
system:unauthenticated
.
For example, on a server with token authentication configured, and anonymous access enabled,
a request providing an invalid bearer token would receive a 401 Unauthorized
error.
A request providing no bearer token would be treated as an anonymous request.
In 1.5.1-1.5.x, anonymous access is disabled by default, and can be enabled by
passing the --anonymous-auth=true
option to the API server.
In 1.6+, anonymous access is enabled by default if an authorization mode other than AlwaysAllow
is used, and can be disabled by passing the --anonymous-auth=false
option to the API server.
Starting in 1.6, the ABAC and RBAC authorizers require explicit authorization of the
system:anonymous
user or the system:unauthenticated
group, so legacy policy rules
that grant access to the *
user or *
group do not include anonymous users.
User impersonation
A user can act as another user through impersonation headers. These let requests manually override the user info a request authenticates as. For example, an admin could use this feature to debug an authorization policy by temporarily impersonating another user and seeing if a request was denied.
Impersonation requests first authenticate as the requesting user, then switch to the impersonated user info.
- A user makes an API call with their credentials and impersonation headers.
- API server authenticates the user.
- API server ensures the authenticated users have impersonation privileges.
- Request user info is replaced with impersonation values.
- Request is evaluated, authorization acts on impersonated user info.
The following HTTP headers can be used to performing an impersonation request:
Impersonate-User
: The username to act as.Impersonate-Group
: A group name to act as. Can be provided multiple times to set multiple groups. Optional. Requires "Impersonate-User".Impersonate-Extra-( extra name )
: A dynamic header used to associate extra fields with the user. Optional. Requires "Impersonate-User". In order to be preserved consistently,( extra name )
must be lower-case, and any characters which aren't legal in HTTP header labels MUST be utf8 and percent-encoded.Impersonate-Uid
: A unique identifier that represents the user being impersonated. Optional. Requires "Impersonate-User". Kubernetes does not impose any format requirements on this string.
( extra name )
could only contain characters which were legal in HTTP header labels.
Impersonate-Uid
is only available in versions 1.22.0 and higher.
An example of the impersonation headers used when impersonating a user with groups:
Impersonate-User: jane.doe@example.com
Impersonate-Group: developers
Impersonate-Group: admins
An example of the impersonation headers used when impersonating a user with a UID and extra fields:
Impersonate-User: jane.doe@example.com
Impersonate-Extra-dn: cn=jane,ou=engineers,dc=example,dc=com
Impersonate-Extra-acme.com%2Fproject: some-project
Impersonate-Extra-scopes: view
Impersonate-Extra-scopes: development
Impersonate-Uid: 06f6ce97-e2c5-4ab8-7ba5-7654dd08d52b
When using kubectl
set the --as
flag to configure the Impersonate-User
header, set the --as-group
flag to configure the Impersonate-Group
header.
kubectl drain mynode
Error from server (Forbidden): User "clark" cannot get nodes at the cluster scope. (get nodes mynode)
Set the --as
and --as-group
flag:
kubectl drain mynode --as=superman --as-group=system:masters
node/mynode cordoned
node/mynode drained
kubectl
cannot impersonate extra fields or UIDs.
To impersonate a user, group, user identifier (UID) or extra fields, the impersonating user must have the ability to perform the "impersonate" verb on the kind of attribute being impersonated ("user", "group", "uid", etc.). For clusters that enable the RBAC authorization plugin, the following ClusterRole encompasses the rules needed to set user and group impersonation headers:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: impersonator
rules:
- apiGroups: [""]
resources: ["users", "groups", "serviceaccounts"]
verbs: ["impersonate"]
For impersonation, extra fields and impersonated UIDs are both under the "authentication.k8s.io" apiGroup
.
Extra fields are evaluated as sub-resources of the resource "userextras". To
allow a user to use impersonation headers for the extra field "scopes" and
for UIDs, a user should be granted the following role:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: scopes-and-uid-impersonator
rules:
# Can set "Impersonate-Extra-scopes" header and the "Impersonate-Uid" header.
- apiGroups: ["authentication.k8s.io"]
resources: ["userextras/scopes", "uids"]
verbs: ["impersonate"]
The values of impersonation headers can also be restricted by limiting the set
of resourceNames
a resource can take.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: limited-impersonator
rules:
# Can impersonate the user "jane.doe@example.com"
- apiGroups: [""]
resources: ["users"]
verbs: ["impersonate"]
resourceNames: ["jane.doe@example.com"]
# Can impersonate the groups "developers" and "admins"
- apiGroups: [""]
resources: ["groups"]
verbs: ["impersonate"]
resourceNames: ["developers","admins"]
# Can impersonate the extras field "scopes" with the values "view" and "development"
- apiGroups: ["authentication.k8s.io"]
resources: ["userextras/scopes"]
verbs: ["impersonate"]
resourceNames: ["view", "development"]
# Can impersonate the uid "06f6ce97-e2c5-4ab8-7ba5-7654dd08d52b"
- apiGroups: ["authentication.k8s.io"]
resources: ["uids"]
verbs: ["impersonate"]
resourceNames: ["06f6ce97-e2c5-4ab8-7ba5-7654dd08d52b"]
client-go credential plugins
Kubernetes v1.22 [stable]
k8s.io/client-go
and tools using it such as kubectl
and kubelet
are able to execute an
external command to receive user credentials.
This feature is intended for client side integrations with authentication protocols not natively
supported by k8s.io/client-go
(LDAP, Kerberos, OAuth2, SAML, etc.). The plugin implements the
protocol specific logic, then returns opaque credentials to use. Almost all credential plugin
use cases require a server side component with support for the webhook token authenticator
to interpret the credential format produced by the client plugin.
Example use case
In a hypothetical use case, an organization would run an external service that exchanges LDAP credentials for user specific, signed tokens. The service would also be capable of responding to webhook token authenticator requests to validate the tokens. Users would be required to install a credential plugin on their workstation.
To authenticate against the API:
- The user issues a
kubectl
command. - Credential plugin prompts the user for LDAP credentials, exchanges credentials with external service for a token.
- Credential plugin returns token to client-go, which uses it as a bearer token against the API server.
- API server uses the webhook token authenticator to submit a
TokenReview
to the external service. - External service verifies the signature on the token and returns the user's username and groups.
Configuration
Credential plugins are configured through kubectl config files as part of the user fields.
apiVersion: v1
kind: Config
users:
- name: my-user
user:
exec:
# Command to execute. Required.
command: "example-client-go-exec-plugin"
# API version to use when decoding the ExecCredentials resource. Required.
#
# The API version returned by the plugin MUST match the version listed here.
#
# To integrate with tools that support multiple versions (such as client.authentication.k8s.io/v1alpha1),
# set an environment variable, pass an argument to the tool that indicates which version the exec plugin expects,
# or read the version from the ExecCredential object in the KUBERNETES_EXEC_INFO environment variable.
apiVersion: "client.authentication.k8s.io/v1"
# Environment variables to set when executing the plugin. Optional.
env:
- name: "FOO"
value: "bar"
# Arguments to pass when executing the plugin. Optional.
args:
- "arg1"
- "arg2"
# Text shown to the user when the executable doesn't seem to be present. Optional.
installHint: |
example-client-go-exec-plugin is required to authenticate
to the current cluster. It can be installed:
On macOS: brew install example-client-go-exec-plugin
On Ubuntu: apt-get install example-client-go-exec-plugin
On Fedora: dnf install example-client-go-exec-plugin
...
# Whether or not to provide cluster information, which could potentially contain
# very large CA data, to this exec plugin as a part of the KUBERNETES_EXEC_INFO
# environment variable.
provideClusterInfo: true
# The contract between the exec plugin and the standard input I/O stream. If the
# contract cannot be satisfied, this plugin will not be run and an error will be
# returned. Valid values are "Never" (this exec plugin never uses standard input),
# "IfAvailable" (this exec plugin wants to use standard input if it is available),
# or "Always" (this exec plugin requires standard input to function). Required.
interactiveMode: Never
clusters:
- name: my-cluster
cluster:
server: "https://172.17.4.100:6443"
certificate-authority: "/etc/kubernetes/ca.pem"
extensions:
- name: client.authentication.k8s.io/exec # reserved extension name for per cluster exec config
extension:
arbitrary: config
this: can be provided via the KUBERNETES_EXEC_INFO environment variable upon setting provideClusterInfo
you: ["can", "put", "anything", "here"]
contexts:
- name: my-cluster
context:
cluster: my-cluster
user: my-user
current-context: my-cluster
apiVersion: v1
kind: Config
users:
- name: my-user
user:
exec:
# Command to execute. Required.
command: "example-client-go-exec-plugin"
# API version to use when decoding the ExecCredentials resource. Required.
#
# The API version returned by the plugin MUST match the version listed here.
#
# To integrate with tools that support multiple versions (such as client.authentication.k8s.io/v1alpha1),
# set an environment variable, pass an argument to the tool that indicates which version the exec plugin expects,
# or read the version from the ExecCredential object in the KUBERNETES_EXEC_INFO environment variable.
apiVersion: "client.authentication.k8s.io/v1beta1"
# Environment variables to set when executing the plugin. Optional.
env:
- name: "FOO"
value: "bar"
# Arguments to pass when executing the plugin. Optional.
args:
- "arg1"
- "arg2"
# Text shown to the user when the executable doesn't seem to be present. Optional.
installHint: |
example-client-go-exec-plugin is required to authenticate
to the current cluster. It can be installed:
On macOS: brew install example-client-go-exec-plugin
On Ubuntu: apt-get install example-client-go-exec-plugin
On Fedora: dnf install example-client-go-exec-plugin
...
# Whether or not to provide cluster information, which could potentially contain
# very large CA data, to this exec plugin as a part of the KUBERNETES_EXEC_INFO
# environment variable.
provideClusterInfo: true
# The contract between the exec plugin and the standard input I/O stream. If the
# contract cannot be satisfied, this plugin will not be run and an error will be
# returned. Valid values are "Never" (this exec plugin never uses standard input),
# "IfAvailable" (this exec plugin wants to use standard input if it is available),
# or "Always" (this exec plugin requires standard input to function). Optional.
# Defaults to "IfAvailable".
interactiveMode: Never
clusters:
- name: my-cluster
cluster:
server: "https://172.17.4.100:6443"
certificate-authority: "/etc/kubernetes/ca.pem"
extensions:
- name: client.authentication.k8s.io/exec # reserved extension name for per cluster exec config
extension:
arbitrary: config
this: can be provided via the KUBERNETES_EXEC_INFO environment variable upon setting provideClusterInfo
you: ["can", "put", "anything", "here"]
contexts:
- name: my-cluster
context:
cluster: my-cluster
user: my-user
current-context: my-cluster
Relative command paths are interpreted as relative to the directory of the config file. If
KUBECONFIG is set to /home/jane/kubeconfig
and the exec command is ./bin/example-client-go-exec-plugin
,
the binary /home/jane/bin/example-client-go-exec-plugin
is executed.
- name: my-user
user:
exec:
# Path relative to the directory of the kubeconfig
command: "./bin/example-client-go-exec-plugin"
apiVersion: "client.authentication.k8s.io/v1"
interactiveMode: Never
Input and output formats
The executed command prints an ExecCredential
object to stdout
. k8s.io/client-go
authenticates against the Kubernetes API using the returned credentials in the status
.
The executed command is passed an ExecCredential
object as input via the KUBERNETES_EXEC_INFO
environment variable. This input contains helpful information like the expected API version
of the returned ExecCredential
object and whether or not the plugin can use stdin
to interact
with the user.
When run from an interactive session (i.e., a terminal), stdin
can be exposed directly
to the plugin. Plugins should use the spec.interactive
field of the input
ExecCredential
object from the KUBERNETES_EXEC_INFO
environment variable in order to
determine if stdin
has been provided. A plugin's stdin
requirements (i.e., whether
stdin
is optional, strictly required, or never used in order for the plugin
to run successfully) is declared via the user.exec.interactiveMode
field in the
kubeconfig (see table
below for valid values). The user.exec.interactiveMode
field is optional in client.authentication.k8s.io/v1beta1
and required in client.authentication.k8s.io/v1
.
interactiveMode Value |
Meaning |
---|---|
Never |
This exec plugin never needs to use standard input, and therefore the exec plugin will be run regardless of whether standard input is available for user input. |
IfAvailable |
This exec plugin would like to use standard input if it is available, but can still operate if standard input is not available. Therefore, the exec plugin will be run regardless of whether stdin is available for user input. If standard input is available for user input, then it will be provided to this exec plugin. |
Always |
This exec plugin requires standard input in order to run, and therefore the exec plugin will only be run if standard input is available for user input. If standard input is not available for user input, then the exec plugin will not be run and an error will be returned by the exec plugin runner. |
To use bearer token credentials, the plugin returns a token in the status of the
ExecCredential
{
"apiVersion": "client.authentication.k8s.io/v1",
"kind": "ExecCredential",
"status": {
"token": "my-bearer-token"
}
}
{
"apiVersion": "client.authentication.k8s.io/v1beta1",
"kind": "ExecCredential",
"status": {
"token": "my-bearer-token"
}
}
Alternatively, a PEM-encoded client certificate and key can be returned to use TLS client auth.
If the plugin returns a different certificate and key on a subsequent call, k8s.io/client-go
will close existing connections with the server to force a new TLS handshake.
If specified, clientKeyData
and clientCertificateData
must both must be present.
clientCertificateData
may contain additional intermediate certificates to send to the server.
{
"apiVersion": "client.authentication.k8s.io/v1",
"kind": "ExecCredential",
"status": {
"clientCertificateData": "-----BEGIN CERTIFICATE-----\n...\n-----END CERTIFICATE-----",
"clientKeyData": "-----BEGIN RSA PRIVATE KEY-----\n...\n-----END RSA PRIVATE KEY-----"
}
}
{
"apiVersion": "client.authentication.k8s.io/v1beta1",
"kind": "ExecCredential",
"status": {
"clientCertificateData": "-----BEGIN CERTIFICATE-----\n...\n-----END CERTIFICATE-----",
"clientKeyData": "-----BEGIN RSA PRIVATE KEY-----\n...\n-----END RSA PRIVATE KEY-----"
}
}
Optionally, the response can include the expiry of the credential formatted as a RFC 3339 timestamp.
Presence or absence of an expiry has the following impact:
- If an expiry is included, the bearer token and TLS credentials are cached until the expiry time is reached, or if the server responds with a 401 HTTP status code, or when the process exits.
- If an expiry is omitted, the bearer token and TLS credentials are cached until the server responds with a 401 HTTP status code or until the process exits.
{
"apiVersion": "client.authentication.k8s.io/v1",
"kind": "ExecCredential",
"status": {
"token": "my-bearer-token",
"expirationTimestamp": "2018-03-05T17:30:20-08:00"
}
}
{
"apiVersion": "client.authentication.k8s.io/v1beta1",
"kind": "ExecCredential",
"status": {
"token": "my-bearer-token",
"expirationTimestamp": "2018-03-05T17:30:20-08:00"
}
}
To enable the exec plugin to obtain cluster-specific information, set provideClusterInfo
on the user.exec
field in the kubeconfig.
The plugin will then be supplied this cluster-specific information in the KUBERNETES_EXEC_INFO
environment variable.
Information from this environment variable can be used to perform cluster-specific
credential acquisition logic.
The following ExecCredential
manifest describes a cluster information sample.
{
"apiVersion": "client.authentication.k8s.io/v1",
"kind": "ExecCredential",
"spec": {
"cluster": {
"server": "https://172.17.4.100:6443",
"certificate-authority-data": "LS0t...",
"config": {
"arbitrary": "config",
"this": "can be provided via the KUBERNETES_EXEC_INFO environment variable upon setting provideClusterInfo",
"you": ["can", "put", "anything", "here"]
}
},
"interactive": true
}
}
{
"apiVersion": "client.authentication.k8s.io/v1beta1",
"kind": "ExecCredential",
"spec": {
"cluster": {
"server": "https://172.17.4.100:6443",
"certificate-authority-data": "LS0t...",
"config": {
"arbitrary": "config",
"this": "can be provided via the KUBERNETES_EXEC_INFO environment variable upon setting provideClusterInfo",
"you": ["can", "put", "anything", "here"]
}
},
"interactive": true
}
}
What's next
6.3.2 - Authenticating with Bootstrap Tokens
Kubernetes v1.18 [stable]
Bootstrap tokens are a simple bearer token that is meant to be used when
creating new clusters or joining new nodes to an existing cluster. It was built
to support kubeadm, but can be used in other contexts
for users that wish to start clusters without kubeadm
. It is also built to
work, via RBAC policy, with the
Kubelet TLS Bootstrapping system.
Bootstrap Tokens Overview
Bootstrap Tokens are defined with a specific type
(bootstrap.kubernetes.io/token
) of secrets that lives in the kube-system
namespace. These Secrets are then read by the Bootstrap Authenticator in the
API Server. Expired tokens are removed with the TokenCleaner controller in the
Controller Manager. The tokens are also used to create a signature for a
specific ConfigMap used in a "discovery" process through a BootstrapSigner
controller.
Token Format
Bootstrap Tokens take the form of abcdef.0123456789abcdef
. More formally,
they must match the regular expression [a-z0-9]{6}\.[a-z0-9]{16}
.
The first part of the token is the "Token ID" and is considered public information. It is used when referring to a token without leaking the secret part used for authentication. The second part is the "Token Secret" and should only be shared with trusted parties.
Enabling Bootstrap Token Authentication
The Bootstrap Token authenticator can be enabled using the following flag on the API server:
--enable-bootstrap-token-auth
When enabled, bootstrapping tokens can be used as bearer token credentials to authenticate requests against the API server.
Authorization: Bearer 07401b.f395accd246ae52d
Tokens authenticate as the username system:bootstrap:<token id>
and are members
of the group system:bootstrappers
. Additional groups may be specified in the
token's Secret.
Expired tokens can be deleted automatically by enabling the tokencleaner
controller on the controller manager.
--controllers=*,tokencleaner
Bootstrap Token Secret Format
Each valid token is backed by a secret in the kube-system
namespace. You can
find the full design doc
here.
Here is what the secret looks like.
apiVersion: v1
kind: Secret
metadata:
# Name MUST be of form "bootstrap-token-<token id>"
name: bootstrap-token-07401b
namespace: kube-system
# Type MUST be 'bootstrap.kubernetes.io/token'
type: bootstrap.kubernetes.io/token
stringData:
# Human readable description. Optional.
description: "The default bootstrap token generated by 'kubeadm init'."
# Token ID and secret. Required.
token-id: 07401b
token-secret: f395accd246ae52d
# Expiration. Optional.
expiration: 2017-03-10T03:22:11Z
# Allowed usages.
usage-bootstrap-authentication: "true"
usage-bootstrap-signing: "true"
# Extra groups to authenticate the token as. Must start with "system:bootstrappers:"
auth-extra-groups: system:bootstrappers:worker,system:bootstrappers:ingress
The type of the secret must be bootstrap.kubernetes.io/token
and the name must
be bootstrap-token-<token id>
. It must also exist in the kube-system
namespace.
The usage-bootstrap-*
members indicate what this secret is intended to be used
for. A value must be set to true
to be enabled.
usage-bootstrap-authentication
indicates that the token can be used to authenticate to the API server as a bearer token.usage-bootstrap-signing
indicates that the token may be used to sign thecluster-info
ConfigMap as described below.
The expiration
field controls the expiry of the token. Expired tokens are
rejected when used for authentication and ignored during ConfigMap signing.
The expiry value is encoded as an absolute UTC time using RFC3339. Enable the
tokencleaner
controller to automatically delete expired tokens.
Token Management with kubeadm
You can use the kubeadm
tool to manage tokens on a running cluster. See the
kubeadm token docs for details.
ConfigMap Signing
In addition to authentication, the tokens can be used to sign a ConfigMap. This is used early in a cluster bootstrap process before the client trusts the API server. The signed ConfigMap can be authenticated by the shared token.
Enable ConfigMap signing by enabling the bootstrapsigner
controller on the
Controller Manager.
--controllers=*,bootstrapsigner
The ConfigMap that is signed is cluster-info
in the kube-public
namespace.
The typical flow is that a client reads this ConfigMap while unauthenticated and
ignoring TLS errors. It then validates the payload of the ConfigMap by looking
at a signature embedded in the ConfigMap.
The ConfigMap may look like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-info
namespace: kube-public
data:
jws-kubeconfig-07401b: eyJhbGciOiJIUzI1NiIsImtpZCI6IjA3NDAxYiJ9..tYEfbo6zDNo40MQE07aZcQX2m3EB2rO3NuXtxVMYm9U
kubeconfig: |
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: <really long certificate data>
server: https://10.138.0.2:6443
name: ""
contexts: []
current-context: ""
kind: Config
preferences: {}
users: []
The kubeconfig
member of the ConfigMap is a config file with only the cluster
information filled out. The key thing being communicated here is the
certificate-authority-data
. This may be expanded in the future.
The signature is a JWS signature using the "detached" mode. To validate the
signature, the user should encode the kubeconfig
payload according to JWS
rules (base64 encoded while discarding any trailing =
). That encoded payload
is then used to form a whole JWS by inserting it between the 2 dots. You can
verify the JWS using the HS256
scheme (HMAC-SHA256) with the full token (e.g.
07401b.f395accd246ae52d
) as the shared secret. Users must verify that HS256
is used.
Consult the kubeadm implementation details section for more information.
6.3.3 - Certificate Signing Requests
Kubernetes v1.19 [stable]
The Certificates API enables automation of X.509 credential provisioning by providing a programmatic interface for clients of the Kubernetes API to request and obtain X.509 certificates from a Certificate Authority (CA).
A CertificateSigningRequest (CSR) resource is used to request that a certificate be signed by a denoted signer, after which the request may be approved or denied before finally being signed.
Request signing process
The CertificateSigningRequest resource type allows a client to ask for an X.509 certificate
be issued, based on a signing request.
The CertificateSigningRequest object includes a PEM-encoded PKCS#10 signing request in
the spec.request
field. The CertificateSigningRequest denotes the signer (the
recipient that the request is being made to) using the spec.signerName
field.
Note that spec.signerName
is a required key after API version certificates.k8s.io/v1
.
In Kubernetes v1.22 and later, clients may optionally set the spec.expirationSeconds
field to request a particular lifetime for the issued certificate. The minimum valid
value for this field is 600
, i.e. ten minutes.
Once created, a CertificateSigningRequest must be approved before it can be signed.
Depending on the signer selected, a CertificateSigningRequest may be automatically approved
by a controller.
Otherwise, a CertificateSigningRequest must be manually approved either via the REST API (or client-go)
or by running kubectl certificate approve
. Likewise, a CertificateSigningRequest may also be denied,
which tells the configured signer that it must not sign the request.
For certificates that have been approved, the next step is signing. The relevant signing controller
first validates that the signing conditions are met and then creates a certificate.
The signing controller then updates the CertificateSigningRequest, storing the new certificate into
the status.certificate
field of the existing CertificateSigningRequest object. The
status.certificate
field is either empty or contains a X.509 certificate, encoded in PEM format.
The CertificateSigningRequest status.certificate
field is empty until the signer does this.
Once the status.certificate
field has been populated, the request has been completed and clients can now
fetch the signed certificate PEM data from the CertificateSigningRequest resource.
The signers can instead deny certificate signing if the approval conditions are not met.
In order to reduce the number of old CertificateSigningRequest resources left in a cluster, a garbage collection controller runs periodically. The garbage collection removes CertificateSigningRequests that have not changed state for some duration:
- Approved requests: automatically deleted after 1 hour
- Denied requests: automatically deleted after 1 hour
- Failed requests: automatically deleted after 1 hour
- Pending requests: automatically deleted after 24 hours
- All requests: automatically deleted after the issued certificate has expired
Signers
Custom signerNames can also be specified. All signers should provide information about how they work so that clients can predict what will happen to their CSRs. This includes:
- Trust distribution: how trust (CA bundles) are distributed.
- Permitted subjects: any restrictions on and behavior when a disallowed subject is requested.
- Permitted x509 extensions: including IP subjectAltNames, DNS subjectAltNames, Email subjectAltNames, URI subjectAltNames etc, and behavior when a disallowed extension is requested.
- Permitted key usages / extended key usages: any restrictions on and behavior when usages different than the signer-determined usages are specified in the CSR.
- Expiration/certificate lifetime: whether it is fixed by the signer, configurable by the admin, determined by the CSR
spec.expirationSeconds
field, etc and the behavior when the signer-determined expiration is different from the CSRspec.expirationSeconds
field. - CA bit allowed/disallowed: and behavior if a CSR contains a request a for a CA certificate when the signer does not permit it.
Commonly, the status.certificate
field contains a single PEM-encoded X.509
certificate once the CSR is approved and the certificate is issued. Some
signers store multiple certificates into the status.certificate
field. In
that case, the documentation for the signer should specify the meaning of
additional certificates; for example, this might be the certificate plus
intermediates to be presented during TLS handshakes.
The PKCS#10 signing request format does not have a standard mechanism to specify a
certificate expiration or lifetime. The expiration or lifetime therefore has to be set
through the spec.expirationSeconds
field of the CSR object. The built-in signers
use the ClusterSigningDuration
configuration option, which defaults to 1 year,
(the --cluster-signing-duration
command-line flag of the kube-controller-manager)
as the default when no spec.expirationSeconds
is specified. When spec.expirationSeconds
is specified, the minimum of spec.expirationSeconds
and ClusterSigningDuration
is
used.
spec.expirationSeconds
field was added in Kubernetes v1.22. Earlier versions of Kubernetes do not honor this field.
Kubernetes API servers prior to v1.22 will silently drop this field when the object is created.
Kubernetes signers
Kubernetes provides built-in signers that each have a well-known signerName
:
-
kubernetes.io/kube-apiserver-client
: signs certificates that will be honored as client certificates by the API server. Never auto-approved by kube-controller-manager.- Trust distribution: signed certificates must be honored as client certificates by the API server. The CA bundle is not distributed by any other means.
- Permitted subjects - no subject restrictions, but approvers and signers may choose not to approve or sign.
Certain subjects like cluster-admin level users or groups vary between distributions and installations,
but deserve additional scrutiny before approval and signing.
The
CertificateSubjectRestriction
admission plugin is enabled by default to restrictsystem:masters
, but it is often not the only cluster-admin subject in a cluster. - Permitted x509 extensions - honors subjectAltName and key usage extensions and discards other extensions.
- Permitted key usages - must include
["client auth"]
. Must not include key usages beyond["digital signature", "key encipherment", "client auth"]
. - Expiration/certificate lifetime - for the kube-controller-manager implementation of this signer, set to the minimum
of the
--cluster-signing-duration
option or, if specified, thespec.expirationSeconds
field of the CSR object. - CA bit allowed/disallowed - not allowed.
-
kubernetes.io/kube-apiserver-client-kubelet
: signs client certificates that will be honored as client certificates by the API server. May be auto-approved by kube-controller-manager.- Trust distribution: signed certificates must be honored as client certificates by the API server. The CA bundle is not distributed by any other means.
- Permitted subjects - organizations are exactly
["system:nodes"]
, common name starts with "system:node:
". - Permitted x509 extensions - honors key usage extensions, forbids subjectAltName extensions and drops other extensions.
- Permitted key usages - exactly
["key encipherment", "digital signature", "client auth"]
. - Expiration/certificate lifetime - for the kube-controller-manager implementation of this signer, set to the minimum
of the
--cluster-signing-duration
option or, if specified, thespec.expirationSeconds
field of the CSR object. - CA bit allowed/disallowed - not allowed.
-
kubernetes.io/kubelet-serving
: signs serving certificates that are honored as a valid kubelet serving certificate by the API server, but has no other guarantees. Never auto-approved by kube-controller-manager.- Trust distribution: signed certificates must be honored by the API server as valid to terminate connections to a kubelet. The CA bundle is not distributed by any other means.
- Permitted subjects - organizations are exactly
["system:nodes"]
, common name starts with "system:node:
". - Permitted x509 extensions - honors key usage and DNSName/IPAddress subjectAltName extensions, forbids EmailAddress and URI subjectAltName extensions, drops other extensions. At least one DNS or IP subjectAltName must be present.
- Permitted key usages - exactly
["key encipherment", "digital signature", "server auth"]
. - Expiration/certificate lifetime - for the kube-controller-manager implementation of this signer, set to the minimum
of the
--cluster-signing-duration
option or, if specified, thespec.expirationSeconds
field of the CSR object. - CA bit allowed/disallowed - not allowed.
-
kubernetes.io/legacy-unknown
: has no guarantees for trust at all. Some third-party distributions of Kubernetes may honor client certificates signed by it. The stable CertificateSigningRequest API (versioncertificates.k8s.io/v1
and later) does not allow to set thesignerName
askubernetes.io/legacy-unknown
. Never auto-approved by kube-controller-manager.- Trust distribution: None. There is no standard trust or distribution for this signer in a Kubernetes cluster.
- Permitted subjects - any
- Permitted x509 extensions - honors subjectAltName and key usage extensions and discards other extensions.
- Permitted key usages - any
- Expiration/certificate lifetime - for the kube-controller-manager implementation of this signer, set to the minimum
of the
--cluster-signing-duration
option or, if specified, thespec.expirationSeconds
field of the CSR object. - CA bit allowed/disallowed - not allowed.
spec.expirationSeconds
field was added in Kubernetes v1.22. Earlier versions of Kubernetes do not honor this field.
Kubernetes API servers prior to v1.22 will silently drop this field when the object is created.
Distribution of trust happens out of band for these signers. Any trust outside of those described above are strictly
coincidental. For instance, some distributions may honor kubernetes.io/legacy-unknown
as client certificates for the
kube-apiserver, but this is not a standard.
None of these usages are related to ServiceAccount token secrets .data[ca.crt]
in any way. That CA bundle is only
guaranteed to verify a connection to the API server using the default service (kubernetes.default.svc
).
Authorization
To allow creating a CertificateSigningRequest and retrieving any CertificateSigningRequest:
- Verbs:
create
,get
,list
,watch
, group:certificates.k8s.io
, resource:certificatesigningrequests
For example:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: csr-creator
rules:
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- create
- get
- list
- watch
To allow approving a CertificateSigningRequest:
- Verbs:
get
,list
,watch
, group:certificates.k8s.io
, resource:certificatesigningrequests
- Verbs:
update
, group:certificates.k8s.io
, resource:certificatesigningrequests/approval
- Verbs:
approve
, group:certificates.k8s.io
, resource:signers
, resourceName:<signerNameDomain>/<signerNamePath>
or<signerNameDomain>/*
For example:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: csr-approver
rules:
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- get
- list
- watch
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests/approval
verbs:
- update
- apiGroups:
- certificates.k8s.io
resources:
- signers
resourceNames:
- example.com/my-signer-name # example.com/* can be used to authorize for all signers in the 'example.com' domain
verbs:
- approve
To allow signing a CertificateSigningRequest:
- Verbs:
get
,list
,watch
, group:certificates.k8s.io
, resource:certificatesigningrequests
- Verbs:
update
, group:certificates.k8s.io
, resource:certificatesigningrequests/status
- Verbs:
sign
, group:certificates.k8s.io
, resource:signers
, resourceName:<signerNameDomain>/<signerNamePath>
or<signerNameDomain>/*
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: csr-signer
rules:
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests
verbs:
- get
- list
- watch
- apiGroups:
- certificates.k8s.io
resources:
- certificatesigningrequests/status
verbs:
- update
- apiGroups:
- certificates.k8s.io
resources:
- signers
resourceNames:
- example.com/my-signer-name # example.com/* can be used to authorize for all signers in the 'example.com' domain
verbs:
- sign
Normal user
A few steps are required in order to get a normal user to be able to authenticate and invoke an API. First, this user must have a certificate issued by the Kubernetes cluster, and then present that certificate to the Kubernetes API.
Create private key
The following scripts show how to generate PKI private key and CSR. It is important to set CN and O attribute of the CSR. CN is the name of the user and O is the group that this user will belong to. You can refer to RBAC for standard groups.
openssl genrsa -out myuser.key 2048
openssl req -new -key myuser.key -out myuser.csr
Create CertificateSigningRequest
Create a CertificateSigningRequest and submit it to a Kubernetes Cluster via kubectl. Below is a script to generate the CertificateSigningRequest.
cat <<EOF | kubectl apply -f -
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
name: myuser
spec:
request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ1ZqQ0NBVDRDQVFBd0VURVBNQTBHQTFVRUF3d0dZVzVuWld4aE1JSUJJakFOQmdrcWhraUc5dzBCQVFFRgpBQU9DQVE4QU1JSUJDZ0tDQVFFQTByczhJTHRHdTYxakx2dHhWTTJSVlRWMDNHWlJTWWw0dWluVWo4RElaWjBOCnR2MUZtRVFSd3VoaUZsOFEzcWl0Qm0wMUFSMkNJVXBGd2ZzSjZ4MXF3ckJzVkhZbGlBNVhwRVpZM3ExcGswSDQKM3Z3aGJlK1o2MVNrVHF5SVBYUUwrTWM5T1Nsbm0xb0R2N0NtSkZNMUlMRVI3QTVGZnZKOEdFRjJ6dHBoaUlFMwpub1dtdHNZb3JuT2wzc2lHQ2ZGZzR4Zmd4eW8ybmlneFNVekl1bXNnVm9PM2ttT0x1RVF6cXpkakJ3TFJXbWlECklmMXBMWnoyalVnald4UkhCM1gyWnVVV1d1T09PZnpXM01LaE8ybHEvZi9DdS8wYk83c0x0MCt3U2ZMSU91TFcKcW90blZtRmxMMytqTy82WDNDKzBERHk5aUtwbXJjVDBnWGZLemE1dHJRSURBUUFCb0FBd0RRWUpLb1pJaHZjTgpBUUVMQlFBRGdnRUJBR05WdmVIOGR4ZzNvK21VeVRkbmFjVmQ1N24zSkExdnZEU1JWREkyQTZ1eXN3ZFp1L1BVCkkwZXpZWFV0RVNnSk1IRmQycVVNMjNuNVJsSXJ3R0xuUXFISUh5VStWWHhsdnZsRnpNOVpEWllSTmU3QlJvYXgKQVlEdUI5STZXT3FYbkFvczFqRmxNUG5NbFpqdU5kSGxpT1BjTU1oNndLaTZzZFhpVStHYTJ2RUVLY01jSVUyRgpvU2djUWdMYTk0aEpacGk3ZnNMdm1OQUxoT045UHdNMGM1dVJVejV4T0dGMUtCbWRSeEgvbUNOS2JKYjFRQm1HCkkwYitEUEdaTktXTU0xMzhIQXdoV0tkNjVoVHdYOWl4V3ZHMkh4TG1WQzg0L1BHT0tWQW9FNkpsYWFHdTlQVmkKdjlOSjVaZlZrcXdCd0hKbzZXdk9xVlA3SVFjZmg3d0drWm89Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=
signerName: kubernetes.io/kube-apiserver-client
expirationSeconds: 86400 # one day
usages:
- client auth
EOF
Some points to note:
usages
has to be 'client auth
'expirationSeconds
could be made longer (i.e.864000
for ten days) or shorter (i.e.3600
for one hour)request
is the base64 encoded value of the CSR file content. You can get the content using this command:cat myuser.csr | base64 | tr -d "\n"
Approve certificate signing request
Use kubectl to create a CSR and approve it.
Get the list of CSRs:
kubectl get csr
Approve the CSR:
kubectl certificate approve myuser
Get the certificate
Retrieve the certificate from the CSR:
kubectl get csr/myuser -o yaml
The certificate value is in Base64-encoded format under status.certificate
.
Export the issued certificate from the CertificateSigningRequest.
kubectl get csr myuser -o jsonpath='{.status.certificate}'| base64 -d > myuser.crt
Create Role and RoleBinding
With the certificate created it is time to define the Role and RoleBinding for this user to access Kubernetes cluster resources.
This is a sample command to create a Role for this new user:
kubectl create role developer --verb=create --verb=get --verb=list --verb=update --verb=delete --resource=pods
This is a sample command to create a RoleBinding for this new user:
kubectl create rolebinding developer-binding-myuser --role=developer --user=myuser
Add to kubeconfig
The last step is to add this user into the kubeconfig file.
First, you need to add new credentials:
kubectl config set-credentials myuser --client-key=myuser.key --client-certificate=myuser.crt --embed-certs=true
Then, you need to add the context:
kubectl config set-context myuser --cluster=kubernetes --user=myuser
To test it, change the context to myuser
:
kubectl config use-context myuser
Approval or rejection
Control plane automated approval
The kube-controller-manager ships with a built-in approver for certificates with
a signerName of kubernetes.io/kube-apiserver-client-kubelet
that delegates various
permissions on CSRs for node credentials to authorization.
The kube-controller-manager POSTs SubjectAccessReview resources to the API server
in order to check authorization for certificate approval.
Approval or rejection using kubectl
A Kubernetes administrator (with appropriate permissions) can manually approve
(or deny) CertificateSigningRequests by using the kubectl certificate approve
and kubectl certificate deny
commands.
To approve a CSR with kubectl:
kubectl certificate approve <certificate-signing-request-name>
Likewise, to deny a CSR:
kubectl certificate deny <certificate-signing-request-name>
Approval or rejection using the Kubernetes API
Users of the REST API can approve CSRs by submitting an UPDATE request to the approval
subresource of the CSR to be approved. For example, you could write an
operator that watches for a particular
kind of CSR and then sends an UPDATE to approve them.
When you make an approval or rejection request, set either the Approved
or Denied
status condition based on the state you determine:
For Approved
CSRs:
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
...
status:
conditions:
- lastUpdateTime: "2020-02-08T11:37:35Z"
lastTransitionTime: "2020-02-08T11:37:35Z"
message: Approved by my custom approver controller
reason: ApprovedByMyPolicy # You can set this to any string
type: Approved
For Denied
CSRs:
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
...
status:
conditions:
- lastUpdateTime: "2020-02-08T11:37:35Z"
lastTransitionTime: "2020-02-08T11:37:35Z"
message: Denied by my custom approver controller
reason: DeniedByMyPolicy # You can set this to any string
type: Denied
It's usual to set status.conditions.reason
to a machine-friendly reason
code using TitleCase; this is a convention but you can set it to anything
you like. If you want to add a note for human consumption, use the
status.conditions.message
field.
Signing
Control plane signer
The Kubernetes control plane implements each of the Kubernetes signers, as part of the kube-controller-manager.
spec.expirationSeconds
field was added in Kubernetes v1.22. Earlier versions of Kubernetes do not honor this field.
Kubernetes API servers prior to v1.22 will silently drop this field when the object is created.
API-based signers
Users of the REST API can sign CSRs by submitting an UPDATE request to the status
subresource of the CSR to be signed.
As part of this request, the status.certificate
field should be set to contain the
signed certificate. This field contains one or more PEM-encoded certificates.
All PEM blocks must have the "CERTIFICATE" label, contain no headers, and the encoded data must be a BER-encoded ASN.1 Certificate structure as described in section 4 of RFC5280.
Example certificate content:
-----BEGIN CERTIFICATE-----
MIIDgjCCAmqgAwIBAgIUC1N1EJ4Qnsd322BhDPRwmg3b/oAwDQYJKoZIhvcNAQEL
BQAwXDELMAkGA1UEBhMCeHgxCjAIBgNVBAgMAXgxCjAIBgNVBAcMAXgxCjAIBgNV
BAoMAXgxCjAIBgNVBAsMAXgxCzAJBgNVBAMMAmNhMRAwDgYJKoZIhvcNAQkBFgF4
MB4XDTIwMDcwNjIyMDcwMFoXDTI1MDcwNTIyMDcwMFowNzEVMBMGA1UEChMMc3lz
dGVtOm5vZGVzMR4wHAYDVQQDExVzeXN0ZW06bm9kZToxMjcuMC4wLjEwggEiMA0G
CSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDne5X2eQ1JcLZkKvhzCR4Hxl9+ZmU3
+e1zfOywLdoQxrPi+o4hVsUH3q0y52BMa7u1yehHDRSaq9u62cmi5ekgXhXHzGmm
kmW5n0itRECv3SFsSm2DSghRKf0mm6iTYHWDHzUXKdm9lPPWoSOxoR5oqOsm3JEh
Q7Et13wrvTJqBMJo1GTwQuF+HYOku0NF/DLqbZIcpI08yQKyrBgYz2uO51/oNp8a
sTCsV4OUfyHhx2BBLUo4g4SptHFySTBwlpRWBnSjZPOhmN74JcpTLB4J5f4iEeA7
2QytZfADckG4wVkhH3C2EJUmRtFIBVirwDn39GXkSGlnvnMgF3uLZ6zNAgMBAAGj
YTBfMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggrBgEFBQcDAjAMBgNVHRMB
Af8EAjAAMB0GA1UdDgQWBBTREl2hW54lkQBDeVCcd2f2VSlB1DALBgNVHREEBDAC
ggAwDQYJKoZIhvcNAQELBQADggEBABpZjuIKTq8pCaX8dMEGPWtAykgLsTcD2jYr
L0/TCrqmuaaliUa42jQTt2OVsVP/L8ofFunj/KjpQU0bvKJPLMRKtmxbhXuQCQi1
qCRkp8o93mHvEz3mTUN+D1cfQ2fpsBENLnpS0F4G/JyY2Vrh19/X8+mImMEK5eOy
o0BMby7byUj98WmcUvNCiXbC6F45QTmkwEhMqWns0JZQY+/XeDhEcg+lJvz9Eyo2
aGgPsye1o3DpyXnyfJWAWMhOz7cikS5X2adesbgI86PhEHBXPIJ1v13ZdfCExmdd
M1fLPhLyR54fGaY+7/X8P9AZzPefAkwizeXwe9ii6/a08vWoiE4=
-----END CERTIFICATE-----
Non-PEM content may appear before or after the CERTIFICATE PEM blocks and is unvalidated, to allow for explanatory text as described in section 5.2 of RFC7468.
When encoded in JSON or YAML, this field is base-64 encoded. A CertificateSigningRequest containing the example certificate above would look like this:
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
...
status:
certificate: "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JS..."
What's next
- Read Manage TLS Certificates in a Cluster
- View the source code for the kube-controller-manager built in signer
- View the source code for the kube-controller-manager built in approver
- For details of X.509 itself, refer to RFC 5280 section 3.1
- For information on the syntax of PKCS#10 certificate signing requests, refer to RFC 2986
6.3.4 - Using Admission Controllers
This page provides an overview of Admission Controllers.
What are they?
An admission controller is a piece of code that intercepts requests to the
Kubernetes API server prior to persistence of the object, but after the request
is authenticated and authorized. The controllers consist of the
list below, are compiled into the
kube-apiserver
binary, and may only be configured by the cluster
administrator. In that list, there are two special controllers:
MutatingAdmissionWebhook and ValidatingAdmissionWebhook. These execute the
mutating and validating (respectively)
admission control webhooks
which are configured in the API.
Admission controllers may be "validating", "mutating", or both. Mutating controllers may modify related objects to the requests they admit; validating controllers may not.
Admission controllers limit requests to create, delete, modify objects or connect to proxy. They do not limit requests to read objects.
The admission control process proceeds in two phases. In the first phase, mutating admission controllers are run. In the second phase, validating admission controllers are run. Note again that some of the controllers are both.
If any of the controllers in either phase reject the request, the entire request is rejected immediately and an error is returned to the end-user.
Finally, in addition to sometimes mutating the object in question, admission controllers may sometimes have side effects, that is, mutate related resources as part of request processing. Incrementing quota usage is the canonical example of why this is necessary. Any such side-effect needs a corresponding reclamation or reconciliation process, as a given admission controller does not know for sure that a given request will pass all of the other admission controllers.
Why do I need them?
Many advanced features in Kubernetes require an admission controller to be enabled in order to properly support the feature. As a result, a Kubernetes API server that is not properly configured with the right set of admission controllers is an incomplete server and will not support all the features you expect.
How do I turn on an admission controller?
The Kubernetes API server flag enable-admission-plugins
takes a comma-delimited list of admission control plugins to invoke prior to modifying objects in the cluster.
For example, the following command line enables the NamespaceLifecycle
and the LimitRanger
admission control plugins:
kube-apiserver --enable-admission-plugins=NamespaceLifecycle,LimitRanger ...
How do I turn off an admission controller?
The Kubernetes API server flag disable-admission-plugins
takes a comma-delimited list of admission control plugins to be disabled, even if they are in the list of plugins enabled by default.
kube-apiserver --disable-admission-plugins=PodNodeSelector,AlwaysDeny ...
Which plugins are enabled by default?
To see which admission plugins are enabled:
kube-apiserver -h | grep enable-admission-plugins
In the current version, the default ones are:
CertificateApproval, CertificateSigning, CertificateSubjectRestriction, DefaultIngressClass, DefaultStorageClass, DefaultTolerationSeconds, LimitRanger, MutatingAdmissionWebhook, NamespaceLifecycle, PersistentVolumeClaimResize, Priority, ResourceQuota, RuntimeClass, ServiceAccount, StorageObjectInUseProtection, TaintNodesByCondition, ValidatingAdmissionWebhook
What does each admission controller do?
AlwaysAdmit
Kubernetes v1.13 [deprecated]
This admission controller allows all pods into the cluster. It is deprecated because its behavior is the same as if there were no admission controller at all.
AlwaysDeny
Kubernetes v1.13 [deprecated]
Rejects all requests. AlwaysDeny is DEPRECATED as it has no real meaning.
AlwaysPullImages
This admission controller modifies every new Pod to force the image pull policy to Always. This is useful in a multitenant cluster so that users can be assured that their private images can only be used by those who have the credentials to pull them. Without this admission controller, once an image has been pulled to a node, any pod from any user can use it by knowing the image's name (assuming the Pod is scheduled onto the right node), without any authorization check against the image. When this admission controller is enabled, images are always pulled prior to starting containers, which means valid credentials are required.
CertificateApproval
This admission controller observes requests to 'approve' CertificateSigningRequest resources and performs additional
authorization checks to ensure the approving user has permission to approve
certificate requests with the
spec.signerName
requested on the CertificateSigningRequest resource.
See Certificate Signing Requests for more information on the permissions required to perform different actions on CertificateSigningRequest resources.
CertificateSigning
This admission controller observes updates to the status.certificate
field of CertificateSigningRequest resources
and performs an additional authorization checks to ensure the signing user has permission to sign
certificate
requests with the spec.signerName
requested on the CertificateSigningRequest resource.
See Certificate Signing Requests for more information on the permissions required to perform different actions on CertificateSigningRequest resources.
CertificateSubjectRestrictions
This admission controller observes creation of CertificateSigningRequest resources that have a spec.signerName
of kubernetes.io/kube-apiserver-client
. It rejects any request that specifies a 'group' (or 'organization attribute')
of system:masters
.
DefaultIngressClass
This admission controller observes creation of Ingress
objects that do not request any specific
ingress class and automatically adds a default ingress class to them. This way, users that do not
request any special ingress class do not need to care about them at all and they will get the
default one.
This admission controller does not do anything when no default ingress class is configured. When more than one ingress
class is marked as default, it rejects any creation of Ingress
with an error and an administrator
must revisit their IngressClass
objects and mark only one as default (with the annotation
"ingressclass.kubernetes.io/is-default-class"). This admission controller ignores any Ingress
updates; it acts only on creation.
See the ingress documentation for more about ingress classes and how to mark one as default.
DefaultStorageClass
This admission controller observes creation of PersistentVolumeClaim
objects that do not request any specific storage class
and automatically adds a default storage class to them.
This way, users that do not request any special storage class do not need to care about them at all and they
will get the default one.
This admission controller does not do anything when no default storage class is configured. When more than one storage
class is marked as default, it rejects any creation of PersistentVolumeClaim
with an error and an administrator
must revisit their StorageClass
objects and mark only one as default.
This admission controller ignores any PersistentVolumeClaim
updates; it acts only on creation.
See persistent volume documentation about persistent volume claims and storage classes and how to mark a storage class as default.
DefaultTolerationSeconds
This admission controller sets the default forgiveness toleration for pods to tolerate
the taints notready:NoExecute
and unreachable:NoExecute
based on the k8s-apiserver input parameters
default-not-ready-toleration-seconds
and default-unreachable-toleration-seconds
if the pods don't already
have toleration for taints node.kubernetes.io/not-ready:NoExecute
or
node.kubernetes.io/unreachable:NoExecute
.
The default value for default-not-ready-toleration-seconds
and default-unreachable-toleration-seconds
is 5 minutes.
DenyEscalatingExec
Kubernetes v1.13 [deprecated]
This admission controller will deny exec and attach commands to pods that run with escalated privileges that allow host access. This includes pods that run as privileged, have access to the host IPC namespace, and have access to the host PID namespace.
The DenyEscalatingExec admission plugin is deprecated.
Use of a policy-based admission plugin (like PodSecurityPolicy or a custom admission plugin) which can be targeted at specific users or Namespaces and also protects against creation of overly privileged Pods is recommended instead.
DenyExecOnPrivileged
Kubernetes v1.13 [deprecated]
This admission controller will intercept all requests to exec a command in a pod if that pod has a privileged container.
This functionality has been merged into DenyEscalatingExec. The DenyExecOnPrivileged admission plugin is deprecated.
Use of a policy-based admission plugin (like PodSecurityPolicy or a custom admission plugin) which can be targeted at specific users or Namespaces and also protects against creation of overly privileged Pods is recommended instead.
DenyServiceExternalIPs
This admission controller rejects all net-new usage of the Service
field externalIPs
. This
feature is very powerful (allows network traffic interception) and not well
controlled by policy. When enabled, users of the cluster may not create new
Services which use externalIPs
and may not add new values to externalIPs
on
existing Service
objects. Existing uses of externalIPs
are not affected,
and users may remove values from externalIPs
on existing Service
objects.
Most users do not need this feature at all, and cluster admins should consider disabling it. Clusters that do need to use this feature should consider using some custom policy to manage usage of it.
EventRateLimit
Kubernetes v1.13 [alpha]
This admission controller mitigates the problem where the API server gets flooded by event requests. The cluster admin can specify event rate limits by:
- Enabling the
EventRateLimit
admission controller; - Referencing an
EventRateLimit
configuration file from the file provided to the API server's command line flag--admission-control-config-file
:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: EventRateLimit
path: eventconfig.yaml
...
# Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1
apiVersion: apiserver.k8s.io/v1alpha1
kind: AdmissionConfiguration
plugins:
- name: EventRateLimit
path: eventconfig.yaml
...
There are four types of limits that can be specified in the configuration:
Server
: All event requests received by the API server share a single bucket.Namespace
: Each namespace has a dedicated bucket.User
: Each user is allocated a bucket.SourceAndObject
: A bucket is assigned by each combination of source and involved object of the event.
Below is a sample eventconfig.yaml
for such a configuration:
apiVersion: eventratelimit.admission.k8s.io/v1alpha1
kind: Configuration
limits:
- type: Namespace
qps: 50
burst: 100
cacheSize: 2000
- type: User
qps: 10
burst: 50
See the EventRateLimit proposal for more details.
ExtendedResourceToleration
This plug-in facilitates creation of dedicated nodes with extended resources. If operators want to create dedicated nodes with extended resources (like GPUs, FPGAs etc.), they are expected to taint the node with the extended resource name as the key. This admission controller, if enabled, automatically adds tolerations for such taints to pods requesting extended resources, so users don't have to manually add these tolerations.
ImagePolicyWebhook
The ImagePolicyWebhook admission controller allows a backend webhook to make admission decisions.
Configuration File Format
ImagePolicyWebhook uses a configuration file to set options for the behavior of the backend. This file may be json or yaml and has the following format:
imagePolicy:
kubeConfigFile: /path/to/kubeconfig/for/backend
# time in s to cache approval
allowTTL: 50
# time in s to cache denial
denyTTL: 50
# time in ms to wait between retries
retryBackoff: 500
# determines behavior if the webhook backend fails
defaultAllow: true
Reference the ImagePolicyWebhook configuration file from the file provided to the API server's command line flag --admission-control-config-file
:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ImagePolicyWebhook
path: imagepolicyconfig.yaml
...
# Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1
apiVersion: apiserver.k8s.io/v1alpha1
kind: AdmissionConfiguration
plugins:
- name: ImagePolicyWebhook
path: imagepolicyconfig.yaml
...
Alternatively, you can embed the configuration directly in the file:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ImagePolicyWebhook
configuration:
imagePolicy:
kubeConfigFile: <path-to-kubeconfig-file>
allowTTL: 50
denyTTL: 50
retryBackoff: 500
defaultAllow: true
# Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1
apiVersion: apiserver.k8s.io/v1alpha1
kind: AdmissionConfiguration
plugins:
- name: ImagePolicyWebhook
configuration:
imagePolicy:
kubeConfigFile: <path-to-kubeconfig-file>
allowTTL: 50
denyTTL: 50
retryBackoff: 500
defaultAllow: true
The ImagePolicyWebhook config file must reference a kubeconfig formatted file which sets up the connection to the backend. It is required that the backend communicate over TLS.
The kubeconfig file's cluster field must point to the remote service, and the user field must contain the returned authorizer.
# clusters refers to the remote service.
clusters:
- name: name-of-remote-imagepolicy-service
cluster:
certificate-authority: /path/to/ca.pem # CA for verifying the remote service.
server: https://images.example.com/policy # URL of remote service to query. Must use 'https'.
# users refers to the API server's webhook configuration.
users:
- name: name-of-api-server
user:
client-certificate: /path/to/cert.pem # cert for the webhook admission controller to use
client-key: /path/to/key.pem # key matching the cert
For additional HTTP configuration, refer to the kubeconfig documentation.
Request Payloads
When faced with an admission decision, the API Server POSTs a JSON serialized imagepolicy.k8s.io/v1alpha1
ImageReview
object describing the action. This object contains fields describing the containers being admitted, as well as any pod annotations that match *.image-policy.k8s.io/*
.
Note that webhook API objects are subject to the same versioning compatibility rules as other Kubernetes API objects. Implementers should be aware of looser compatibility promises for alpha objects and check the "apiVersion" field of the request to ensure correct deserialization. Additionally, the API Server must enable the imagepolicy.k8s.io/v1alpha1 API extensions group (--runtime-config=imagepolicy.k8s.io/v1alpha1=true
).
An example request body:
{
"apiVersion":"imagepolicy.k8s.io/v1alpha1",
"kind":"ImageReview",
"spec":{
"containers":[
{
"image":"myrepo/myimage:v1"
},
{
"image":"myrepo/myimage@sha256:beb6bd6a68f114c1dc2ea4b28db81bdf91de202a9014972bec5e4d9171d90ed"
}
],
"annotations":{
"mycluster.image-policy.k8s.io/ticket-1234": "break-glass"
},
"namespace":"mynamespace"
}
}
The remote service is expected to fill the ImageReviewStatus
field of the request and respond to either allow or disallow access. The response body's "spec" field is ignored and may be omitted. A permissive response would return:
{
"apiVersion": "imagepolicy.k8s.io/v1alpha1",
"kind": "ImageReview",
"status": {
"allowed": true
}
}
To disallow access, the service would return:
{
"apiVersion": "imagepolicy.k8s.io/v1alpha1",
"kind": "ImageReview",
"status": {
"allowed": false,
"reason": "image currently blacklisted"
}
}
For further documentation refer to the imagepolicy.v1alpha1
API objects and plugin/pkg/admission/imagepolicy/admission.go
.
Extending with Annotations
All annotations on a Pod that match *.image-policy.k8s.io/*
are sent to the webhook. Sending annotations allows users who are aware of the image policy backend to send extra information to it, and for different backends implementations to accept different information.
Examples of information you might put here are:
- request to "break glass" to override a policy, in case of emergency.
- a ticket number from a ticket system that documents the break-glass request
- provide a hint to the policy server as to the imageID of the image being provided, to save it a lookup
In any case, the annotations are provided by the user and are not validated by Kubernetes in any way. In the future, if an annotation is determined to be widely useful, it may be promoted to a named field of ImageReviewSpec
.
LimitPodHardAntiAffinityTopology
This admission controller denies any pod that defines AntiAffinity
topology key other than
kubernetes.io/hostname
in requiredDuringSchedulingRequiredDuringExecution
.
LimitRanger
This admission controller will observe the incoming request and ensure that it does not violate any of the constraints
enumerated in the LimitRange
object in a Namespace
. If you are using LimitRange
objects in
your Kubernetes deployment, you MUST use this admission controller to enforce those constraints. LimitRanger can also
be used to apply default resource requests to Pods that don't specify any; currently, the default LimitRanger
applies a 0.1 CPU requirement to all Pods in the default
namespace.
See the limitRange design doc and the example of Limit Range for more details.
MutatingAdmissionWebhook
This admission controller calls any mutating webhooks which match the request. Matching webhooks are called in serial; each one may modify the object if it desires.
This admission controller (as implied by the name) only runs in the mutating phase.
If a webhook called by this has side effects (for example, decrementing quota) it must have a reconciliation system, as it is not guaranteed that subsequent webhooks or validating admission controllers will permit the request to finish.
If you disable the MutatingAdmissionWebhook, you must also disable the
MutatingWebhookConfiguration
object in the admissionregistration.k8s.io/v1
group/version via the --runtime-config
flag (both are on by default in
versions >= 1.9).
Use caution when authoring and installing mutating webhooks
- Users may be confused when the objects they try to create are different from what they get back.
- Built in control loops may break when the objects they try to create are
different when read back.
- Setting originally unset fields is less likely to cause problems than overwriting fields set in the original request. Avoid doing the latter.
- Future changes to control loops for built-in resources or third-party resources may break webhooks that work well today. Even when the webhook installation API is finalized, not all possible webhook behaviors will be guaranteed to be supported indefinitely.
NamespaceAutoProvision
This admission controller examines all incoming requests on namespaced resources and checks if the referenced namespace does exist. It creates a namespace if it cannot be found. This admission controller is useful in deployments that do not want to restrict creation of a namespace prior to its usage.
NamespaceExists
This admission controller checks all requests on namespaced resources other than Namespace
itself.
If the namespace referenced from a request doesn't exist, the request is rejected.
NamespaceLifecycle
This admission controller enforces that a Namespace
that is undergoing termination cannot have new objects created in it,
and ensures that requests in a non-existent Namespace
are rejected. This admission controller also prevents deletion of
three system reserved namespaces default
, kube-system
, kube-public
.
A Namespace
deletion kicks off a sequence of operations that remove all objects (pods, services, etc.) in that
namespace. In order to enforce integrity of that process, we strongly recommend running this admission controller.
NodeRestriction
This admission controller limits the Node
and Pod
objects a kubelet can modify. In order to be limited by this admission controller,
kubelets must use credentials in the system:nodes
group, with a username in the form system:node:<nodeName>
.
Such kubelets will only be allowed to modify their own Node
API object, and only modify Pod
API objects that are bound to their node.
In Kubernetes 1.11+, kubelets are not allowed to update or remove taints from their Node
API object.
In Kubernetes 1.13+, the NodeRestriction
admission plugin prevents kubelets from deleting their Node
API object,
and enforces kubelet modification of labels under the kubernetes.io/
or k8s.io/
prefixes as follows:
- Prevents kubelets from adding/removing/updating labels with a
node-restriction.kubernetes.io/
prefix. This label prefix is reserved for administrators to label theirNode
objects for workload isolation purposes, and kubelets will not be allowed to modify labels with that prefix. - Allows kubelets to add/remove/update these labels and label prefixes:
kubernetes.io/hostname
kubernetes.io/arch
kubernetes.io/os
beta.kubernetes.io/instance-type
node.kubernetes.io/instance-type
failure-domain.beta.kubernetes.io/region
(deprecated)failure-domain.beta.kubernetes.io/zone
(deprecated)topology.kubernetes.io/region
topology.kubernetes.io/zone
kubelet.kubernetes.io/
-prefixed labelsnode.kubernetes.io/
-prefixed labels
Use of any other labels under the kubernetes.io
or k8s.io
prefixes by kubelets is reserved, and may be disallowed or allowed by the NodeRestriction
admission plugin in the future.
Future versions may add additional restrictions to ensure kubelets have the minimal set of permissions required to operate correctly.
OwnerReferencesPermissionEnforcement
This admission controller protects the access to the metadata.ownerReferences
of an object
so that only users with "delete" permission to the object can change it.
This admission controller also protects the access to metadata.ownerReferences[x].blockOwnerDeletion
of an object, so that only users with "update" permission to the finalizers
subresource of the referenced owner can change it.
PersistentVolumeClaimResize
This admission controller implements additional validations for checking incoming PersistentVolumeClaim
resize requests.
ExpandPersistentVolumes
is set
to true
to enable resizing.
After enabling the ExpandPersistentVolumes
feature gate, enabling the PersistentVolumeClaimResize
admission
controller is recommended, too. This admission controller prevents resizing of all claims by default unless a claim's StorageClass
explicitly enables resizing by setting allowVolumeExpansion
to true
.
For example: all PersistentVolumeClaim
s created from the following StorageClass
support volume expansion:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gluster-vol-default
provisioner: kubernetes.io/glusterfs
parameters:
resturl: "http://192.168.10.100:8080"
restuser: ""
secretNamespace: ""
secretName: ""
allowVolumeExpansion: true
For more information about persistent volume claims, see PersistentVolumeClaims.
PersistentVolumeLabel
Kubernetes v1.13 [deprecated]
This admission controller automatically attaches region or zone labels to PersistentVolumes as defined by the cloud provider (for example, GCE or AWS). It helps ensure the Pods and the PersistentVolumes mounted are in the same region and/or zone. If the admission controller doesn't support automatic labelling your PersistentVolumes, you may need to add the labels manually to prevent pods from mounting volumes from a different zone. PersistentVolumeLabel is DEPRECATED and labeling persistent volumes has been taken over by the cloud-controller-manager. Starting from 1.11, this admission controller is disabled by default.
PodNodeSelector
Kubernetes v1.5 [alpha]
This admission controller defaults and limits what node selectors may be used within a namespace by reading a namespace annotation and a global configuration.
Configuration File Format
PodNodeSelector
uses a configuration file to set options for the behavior of the backend.
Note that the configuration file format will move to a versioned file in a future release.
This file may be json or yaml and has the following format:
podNodeSelectorPluginConfig:
clusterDefaultNodeSelector: name-of-node-selector
namespace1: name-of-node-selector
namespace2: name-of-node-selector
Reference the PodNodeSelector
configuration file from the file provided to the API server's command line flag --admission-control-config-file
:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodNodeSelector
path: podnodeselector.yaml
...
# Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1
apiVersion: apiserver.k8s.io/v1alpha1
kind: AdmissionConfiguration
plugins:
- name: PodNodeSelector
path: podnodeselector.yaml
...
Configuration Annotation Format
PodNodeSelector
uses the annotation key scheduler.alpha.kubernetes.io/node-selector
to assign node selectors to namespaces.
apiVersion: v1
kind: Namespace
metadata:
annotations:
scheduler.alpha.kubernetes.io/node-selector: name-of-node-selector
name: namespace3
Internal Behavior
This admission controller has the following behavior:
- If the
Namespace
has an annotation with a keyscheduler.alpha.kubernetes.io/node-selector
, use its value as the node selector. - If the namespace lacks such an annotation, use the
clusterDefaultNodeSelector
defined in thePodNodeSelector
plugin configuration file as the node selector. - Evaluate the pod's node selector against the namespace node selector for conflicts. Conflicts result in rejection.
- Evaluate the pod's node selector against the namespace-specific allowed selector defined the plugin configuration file. Conflicts result in rejection.
PodSecurity
Kubernetes v1.23 [beta]
This is the replacement for the deprecated PodSecurityPolicy admission controller defined in the next section. This admission controller acts on creation and modification of the pod and determines if it should be admitted based on the requested security context and the Pod Security Standards.
See the Pod Security Admission documentation for more information.
PodSecurityPolicy
Kubernetes v1.21 [deprecated]
This admission controller acts on creation and modification of the pod and determines if it should be admitted based on the requested security context and the available Pod Security Policies.
See also Pod Security Policy documentation for more information.
PodTolerationRestriction
Kubernetes v1.7 [alpha]
The PodTolerationRestriction admission controller verifies any conflict between tolerations of a pod and the tolerations of its namespace. It rejects the pod request if there is a conflict. It then merges the tolerations annotated on the namespace into the tolerations of the pod. The resulting tolerations are checked against a list of allowed tolerations annotated to the namespace. If the check succeeds, the pod request is admitted otherwise it is rejected.
If the namespace of the pod does not have any associated default tolerations or allowed tolerations annotated, the cluster-level default tolerations or cluster-level list of allowed tolerations are used instead if they are specified.
Tolerations to a namespace are assigned via the scheduler.alpha.kubernetes.io/defaultTolerations
annotation key.
The list of allowed tolerations can be added via the scheduler.alpha.kubernetes.io/tolerationsWhitelist
annotation key.
Example for namespace annotations:
apiVersion: v1
kind: Namespace
metadata:
name: apps-that-need-nodes-exclusively
annotations:
scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Exists", "effect": "NoSchedule", "key": "dedicated-node"}]'
scheduler.alpha.kubernetes.io/tolerationsWhitelist: '[{"operator": "Exists", "effect": "NoSchedule", "key": "dedicated-node"}]'
Priority
The priority admission controller uses the priorityClassName
field and populates the integer value of the priority. If the priority class is not found, the Pod is rejected.
ResourceQuota
This admission controller will observe the incoming request and ensure that it does not violate any of the constraints
enumerated in the ResourceQuota
object in a Namespace
. If you are using ResourceQuota
objects in your Kubernetes deployment, you MUST use this admission controller to enforce quota constraints.
See the resourceQuota design doc and the example of Resource Quota for more details.
RuntimeClass
Kubernetes v1.20 [stable]
If you enable the PodOverhead
feature gate, and define a RuntimeClass with Pod overhead configured, this admission controller checks incoming
Pods. When enabled, this admission controller rejects any Pod create requests that have the overhead already set.
For Pods that have a RuntimeClass is configured and selected in their .spec
, this admission controller sets .spec.overhead
in the Pod based on the value defined in the corresponding RuntimeClass.
.spec.overhead
field for Pod and the .overhead
field for RuntimeClass are both in beta. If you do not enable the PodOverhead
feature gate, all Pods are treated as if .spec.overhead
is unset.
See also Pod Overhead for more information.
SecurityContextDeny
This admission controller will deny any pod that attempts to set certain escalating SecurityContext fields, as shown in the Configure a Security Context for a Pod or Container task. This should be enabled if a cluster doesn't utilize pod security policies to restrict the set of values a security context can take.
ServiceAccount
This admission controller implements automation for
serviceAccounts.
We strongly recommend using this admission controller if you intend to make use of Kubernetes ServiceAccount
objects.
StorageObjectInUseProtection
The StorageObjectInUseProtection
plugin adds the kubernetes.io/pvc-protection
or kubernetes.io/pv-protection
finalizers to newly created Persistent Volume Claims (PVCs) or Persistent Volumes (PV).
In case a user deletes a PVC or PV the PVC or PV is not removed until the finalizer is removed
from the PVC or PV by PVC or PV Protection Controller.
Refer to the
Storage Object in Use Protection
for more detailed information.
TaintNodesByCondition
Kubernetes v1.17 [stable]
This admission controller taints newly created Nodes as NotReady
and NoSchedule
. That tainting avoids a race condition that could cause Pods to be scheduled on new Nodes before their taints were updated to accurately reflect their reported conditions.
ValidatingAdmissionWebhook
This admission controller calls any validating webhooks which match the request. Matching
webhooks are called in parallel; if any of them rejects the request, the request
fails. This admission controller only runs in the validation phase; the webhooks it calls may not
mutate the object, as opposed to the webhooks called by the MutatingAdmissionWebhook
admission controller.
If a webhook called by this has side effects (for example, decrementing quota) it must have a reconciliation system, as it is not guaranteed that subsequent webhooks or other validating admission controllers will permit the request to finish.
If you disable the ValidatingAdmissionWebhook, you must also disable the
ValidatingWebhookConfiguration
object in the admissionregistration.k8s.io/v1
group/version via the --runtime-config
flag (both are on by default in
versions 1.9 and later).
Is there a recommended set of admission controllers to use?
Yes. The recommended admission controllers are enabled by default (shown here), so you do not need to explicitly specify them. You can enable additional admission controllers beyond the default set using the --enable-admission-plugins
flag (order doesn't matter).
--admission-control
was deprecated in 1.10 and replaced with --enable-admission-plugins
.
6.3.5 - Dynamic Admission Control
In addition to compiled-in admission plugins, admission plugins can be developed as extensions and run as webhooks configured at runtime. This page describes how to build, configure, use, and monitor admission webhooks.
What are admission webhooks?
Admission webhooks are HTTP callbacks that receive admission requests and do something with them. You can define two types of admission webhooks, validating admission webhook and mutating admission webhook. Mutating admission webhooks are invoked first, and can modify objects sent to the API server to enforce custom defaults. After all object modifications are complete, and after the incoming object is validated by the API server, validating admission webhooks are invoked and can reject requests to enforce custom policies.
Experimenting with admission webhooks
Admission webhooks are essentially part of the cluster control-plane. You should write and deploy them with great caution. Please read the user guides for instructions if you intend to write/deploy production-grade admission webhooks. In the following, we describe how to quickly experiment with admission webhooks.
Prerequisites
-
Ensure that the Kubernetes cluster is at least as new as v1.16 (to use
admissionregistration.k8s.io/v1
), or v1.9 (to useadmissionregistration.k8s.io/v1beta1
). -
Ensure that MutatingAdmissionWebhook and ValidatingAdmissionWebhook admission controllers are enabled. Here is a recommended set of admission controllers to enable in general.
-
Ensure that the
admissionregistration.k8s.io/v1
oradmissionregistration.k8s.io/v1beta1
API is enabled.
Write an admission webhook server
Please refer to the implementation of the admission webhook
server
that is validated in a Kubernetes e2e test. The webhook handles the
AdmissionReview
request sent by the apiservers, and sends back its decision
as an AdmissionReview
object in the same version it received.
See the webhook request section for details on the data sent to webhooks.
See the webhook response section for the data expected from webhooks.
The example admission webhook server leaves the ClientAuth
field
empty,
which defaults to NoClientCert
. This means that the webhook server does not
authenticate the identity of the clients, supposedly apiservers. If you need
mutual TLS or other ways to authenticate the clients, see
how to authenticate apiservers.
Deploy the admission webhook service
The webhook server in the e2e test is deployed in the Kubernetes cluster, via the deployment API. The test also creates a service as the front-end of the webhook server. See code.
You may also deploy your webhooks outside of the cluster. You will need to update your webhook configurations accordingly.
Configure admission webhooks on the fly
You can dynamically configure what resources are subject to what admission webhooks via ValidatingWebhookConfiguration or MutatingWebhookConfiguration.
The following is an example ValidatingWebhookConfiguration
, a mutating webhook configuration is similar.
See the webhook configuration section for details about each config field.
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: "pod-policy.example.com"
webhooks:
- name: "pod-policy.example.com"
rules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE"]
resources: ["pods"]
scope: "Namespaced"
clientConfig:
service:
namespace: "example-namespace"
name: "example-service"
caBundle: "Ci0tLS0tQk...<`caBundle` is a PEM encoded CA bundle which will be used to validate the webhook's server certificate.>...tLS0K"
admissionReviewVersions: ["v1", "v1beta1"]
sideEffects: None
timeoutSeconds: 5
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
metadata:
name: "pod-policy.example.com"
webhooks:
- name: "pod-policy.example.com"
rules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE"]
resources: ["pods"]
scope: "Namespaced"
clientConfig:
service:
namespace: "example-namespace"
name: "example-service"
caBundle: "Ci0tLS0tQk...<`caBundle` is a PEM encoded CA bundle which will be used to validate the webhook's server certificate>...tLS0K"
admissionReviewVersions: ["v1beta1"]
timeoutSeconds: 5
The scope field specifies if only cluster-scoped resources ("Cluster") or namespace-scoped resources ("Namespaced") will match this rule. "∗" means that there are no scope restrictions.
clientConfig.service
, the server cert must be valid for
<svc_name>.<svc_namespace>.svc
.
admissionregistration.k8s.io/v1
,
and 30 seconds for webhooks created using admissionregistration.k8s.io/v1beta1
. Starting in kubernetes 1.14 you
can set the timeout and it is encouraged to use a small timeout for webhooks.
If the webhook call times out, the request is handled according to the webhook's
failure policy.
When an apiserver receives a request that matches one of the rules
, the
apiserver sends an admissionReview
request to webhook as specified in the
clientConfig
.
After you create the webhook configuration, the system will take a few seconds to honor the new configuration.
Authenticate apiservers
If your admission webhooks require authentication, you can configure the apiservers to use basic auth, bearer token, or a cert to authenticate itself to the webhooks. There are three steps to complete the configuration.
-
When starting the apiserver, specify the location of the admission control configuration file via the
--admission-control-config-file
flag. -
In the admission control configuration file, specify where the MutatingAdmissionWebhook controller and ValidatingAdmissionWebhook controller should read the credentials. The credentials are stored in kubeConfig files (yes, the same schema that's used by kubectl), so the field name is
kubeConfigFile
. Here is an example admission control configuration file:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionWebhook
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: WebhookAdmissionConfiguration
kubeConfigFile: "<path-to-kubeconfig-file>"
- name: MutatingAdmissionWebhook
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: WebhookAdmissionConfiguration
kubeConfigFile: "<path-to-kubeconfig-file>"
# Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1
apiVersion: apiserver.k8s.io/v1alpha1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionWebhook
configuration:
# Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1, kind=WebhookAdmissionConfiguration
apiVersion: apiserver.config.k8s.io/v1alpha1
kind: WebhookAdmission
kubeConfigFile: "<path-to-kubeconfig-file>"
- name: MutatingAdmissionWebhook
configuration:
# Deprecated in v1.17 in favor of apiserver.config.k8s.io/v1, kind=WebhookAdmissionConfiguration
apiVersion: apiserver.config.k8s.io/v1alpha1
kind: WebhookAdmission
kubeConfigFile: "<path-to-kubeconfig-file>"
For more information about AdmissionConfiguration
, see the
AdmissionConfiguration (v1) reference.
See the webhook configuration section for details about each config field.
-
In the kubeConfig file, provide the credentials:
apiVersion: v1 kind: Config users: # name should be set to the DNS name of the service or the host (including port) of the URL the webhook is configured to speak to. # If a non-443 port is used for services, it must be included in the name when configuring 1.16+ API servers. # # For a webhook configured to speak to a service on the default port (443), specify the DNS name of the service: # - name: webhook1.ns1.svc # user: ... # # For a webhook configured to speak to a service on non-default port (e.g. 8443), specify the DNS name and port of the service in 1.16+: # - name: webhook1.ns1.svc:8443 # user: ... # and optionally create a second stanza using only the DNS name of the service for compatibility with 1.15 API servers: # - name: webhook1.ns1.svc # user: ... # # For webhooks configured to speak to a URL, match the host (and port) specified in the webhook's URL. Examples: # A webhook with `url: https://www.example.com`: # - name: www.example.com # user: ... # # A webhook with `url: https://www.example.com:443`: # - name: www.example.com:443 # user: ... # # A webhook with `url: https://www.example.com:8443`: # - name: www.example.com:8443 # user: ... # - name: 'webhook1.ns1.svc' user: client-certificate-data: "<pem encoded certificate>" client-key-data: "<pem encoded key>" # The `name` supports using * to wildcard-match prefixing segments. - name: '*.webhook-company.org' user: password: "<password>" username: "<name>" # '*' is the default match. - name: '*' user: token: "<token>"
Of course you need to set up the webhook server to handle these authentications.
Webhook request and response
Request
Webhooks are sent as POST requests, with Content-Type: application/json
,
with an AdmissionReview
API object in the admission.k8s.io
API group
serialized to JSON as the body.
Webhooks can specify what versions of AdmissionReview
objects they accept
with the admissionReviewVersions
field in their configuration:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
admissionReviewVersions: ["v1", "v1beta1"]
...
admissionReviewVersions
is a required field when creating
admissionregistration.k8s.io/v1
webhook configurations.
Webhooks are required to support at least one AdmissionReview
version understood by the current and previous API server.
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
admissionReviewVersions: ["v1beta1"]
...
If no admissionReviewVersions
are specified, the default when creating
admissionregistration.k8s.io/v1beta1
webhook configurations is v1beta1
.
API servers send the first AdmissionReview
version in the admissionReviewVersions
list they support.
If none of the versions in the list are supported by the API server, the configuration will not be allowed to be created.
If an API server encounters a webhook configuration that was previously created and does not support any of the AdmissionReview
versions the API server knows how to send, attempts to call to the webhook will fail and be subject to the failure policy.
This example shows the data contained in an AdmissionReview
object
for a request to update the scale
subresource of an apps/v1
Deployment
:
{
"apiVersion": "admission.k8s.io/v1",
"kind": "AdmissionReview",
"request": {
# Random uid uniquely identifying this admission call
"uid": "705ab4f5-6393-11e8-b7cc-42010a800002",
# Fully-qualified group/version/kind of the incoming object
"kind": {"group":"autoscaling","version":"v1","kind":"Scale"},
# Fully-qualified group/version/kind of the resource being modified
"resource": {"group":"apps","version":"v1","resource":"deployments"},
# subresource, if the request is to a subresource
"subResource": "scale",
# Fully-qualified group/version/kind of the incoming object in the original request to the API server.
# This only differs from `kind` if the webhook specified `matchPolicy: Equivalent` and the
# original request to the API server was converted to a version the webhook registered for.
"requestKind": {"group":"autoscaling","version":"v1","kind":"Scale"},
# Fully-qualified group/version/kind of the resource being modified in the original request to the API server.
# This only differs from `resource` if the webhook specified `matchPolicy: Equivalent` and the
# original request to the API server was converted to a version the webhook registered for.
"requestResource": {"group":"apps","version":"v1","resource":"deployments"},
# subresource, if the request is to a subresource
# This only differs from `subResource` if the webhook specified `matchPolicy: Equivalent` and the
# original request to the API server was converted to a version the webhook registered for.
"requestSubResource": "scale",
# Name of the resource being modified
"name": "my-deployment",
# Namespace of the resource being modified, if the resource is namespaced (or is a Namespace object)
"namespace": "my-namespace",
# operation can be CREATE, UPDATE, DELETE, or CONNECT
"operation": "UPDATE",
"userInfo": {
# Username of the authenticated user making the request to the API server
"username": "admin",
# UID of the authenticated user making the request to the API server
"uid": "014fbff9a07c",
# Group memberships of the authenticated user making the request to the API server
"groups": ["system:authenticated","my-admin-group"],
# Arbitrary extra info associated with the user making the request to the API server.
# This is populated by the API server authentication layer and should be included
# if any SubjectAccessReview checks are performed by the webhook.
"extra": {
"some-key":["some-value1", "some-value2"]
}
},
# object is the new object being admitted.
# It is null for DELETE operations.
"object": {"apiVersion":"autoscaling/v1","kind":"Scale",...},
# oldObject is the existing object.
# It is null for CREATE and CONNECT operations.
"oldObject": {"apiVersion":"autoscaling/v1","kind":"Scale",...},
# options contains the options for the operation being admitted, like meta.k8s.io/v1 CreateOptions, UpdateOptions, or DeleteOptions.
# It is null for CONNECT operations.
"options": {"apiVersion":"meta.k8s.io/v1","kind":"UpdateOptions",...},
# dryRun indicates the API request is running in dry run mode and will not be persisted.
# Webhooks with side effects should avoid actuating those side effects when dryRun is true.
# See http://k8s.io/docs/reference/using-api/api-concepts/#make-a-dry-run-request for more details.
"dryRun": false
}
}
{
# Deprecated in v1.16 in favor of admission.k8s.io/v1
"apiVersion": "admission.k8s.io/v1beta1",
"kind": "AdmissionReview",
"request": {
# Random uid uniquely identifying this admission call
"uid": "705ab4f5-6393-11e8-b7cc-42010a800002",
# Fully-qualified group/version/kind of the incoming object
"kind": {"group":"autoscaling","version":"v1","kind":"Scale"},
# Fully-qualified group/version/kind of the resource being modified
"resource": {"group":"apps","version":"v1","resource":"deployments"},
# subresource, if the request is to a subresource
"subResource": "scale",
# Fully-qualified group/version/kind of the incoming object in the original request to the API server.
# This only differs from `kind` if the webhook specified `matchPolicy: Equivalent` and the
# original request to the API server was converted to a version the webhook registered for.
# Only sent by v1.15+ API servers.
"requestKind": {"group":"autoscaling","version":"v1","kind":"Scale"},
# Fully-qualified group/version/kind of the resource being modified in the original request to the API server.
# This only differs from `resource` if the webhook specified `matchPolicy: Equivalent` and the
# original request to the API server was converted to a version the webhook registered for.
# Only sent by v1.15+ API servers.
"requestResource": {"group":"apps","version":"v1","resource":"deployments"},
# subresource, if the request is to a subresource
# This only differs from `subResource` if the webhook specified `matchPolicy: Equivalent` and the
# original request to the API server was converted to a version the webhook registered for.
# Only sent by v1.15+ API servers.
"requestSubResource": "scale",
# Name of the resource being modified
"name": "my-deployment",
# Namespace of the resource being modified, if the resource is namespaced (or is a Namespace object)
"namespace": "my-namespace",
# operation can be CREATE, UPDATE, DELETE, or CONNECT
"operation": "UPDATE",
"userInfo": {
# Username of the authenticated user making the request to the API server
"username": "admin",
# UID of the authenticated user making the request to the API server
"uid": "014fbff9a07c",
# Group memberships of the authenticated user making the request to the API server
"groups": ["system:authenticated","my-admin-group"],
# Arbitrary extra info associated with the user making the request to the API server.
# This is populated by the API server authentication layer and should be included
# if any SubjectAccessReview checks are performed by the webhook.
"extra": {
"some-key":["some-value1", "some-value2"]
}
},
# object is the new object being admitted.
# It is null for DELETE operations.
"object": {"apiVersion":"autoscaling/v1","kind":"Scale",...},
# oldObject is the existing object.
# It is null for CREATE and CONNECT operations (and for DELETE operations in API servers prior to v1.15.0)
"oldObject": {"apiVersion":"autoscaling/v1","kind":"Scale",...},
# options contains the options for the operation being admitted, like meta.k8s.io/v1 CreateOptions, UpdateOptions, or DeleteOptions.
# It is null for CONNECT operations.
# Only sent by v1.15+ API servers.
"options": {"apiVersion":"meta.k8s.io/v1","kind":"UpdateOptions",...},
# dryRun indicates the API request is running in dry run mode and will not be persisted.
# Webhooks with side effects should avoid actuating those side effects when dryRun is true.
# See http://k8s.io/docs/reference/using-api/api-concepts/#make-a-dry-run-request for more details.
"dryRun": false
}
}
Response
Webhooks respond with a 200 HTTP status code, Content-Type: application/json
,
and a body containing an AdmissionReview
object (in the same version they were sent),
with the response
stanza populated, serialized to JSON.
At a minimum, the response
stanza must contain the following fields:
uid
, copied from therequest.uid
sent to the webhookallowed
, either set totrue
orfalse
Example of a minimal response from a webhook to allow a request:
{
"apiVersion": "admission.k8s.io/v1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": true
}
}
{
"apiVersion": "admission.k8s.io/v1beta1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": true
}
}
Example of a minimal response from a webhook to forbid a request:
{
"apiVersion": "admission.k8s.io/v1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": false
}
}
{
"apiVersion": "admission.k8s.io/v1beta1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": false
}
}
When rejecting a request, the webhook can customize the http code and message returned to the user using the status
field.
The specified status object is returned to the user.
See the API documentation for details about the status type.
Example of a response to forbid a request, customizing the HTTP status code and message presented to the user:
{
"apiVersion": "admission.k8s.io/v1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": false,
"status": {
"code": 403,
"message": "You cannot do this because it is Tuesday and your name starts with A"
}
}
}
{
"apiVersion": "admission.k8s.io/v1beta1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": false,
"status": {
"code": 403,
"message": "You cannot do this because it is Tuesday and your name starts with A"
}
}
}
When allowing a request, a mutating admission webhook may optionally modify the incoming object as well.
This is done using the patch
and patchType
fields in the response.
The only currently supported patchType
is JSONPatch
.
See JSON patch documentation for more details.
For patchType: JSONPatch
, the patch
field contains a base64-encoded array of JSON patch operations.
As an example, a single patch operation that would set spec.replicas
would be [{"op": "add", "path": "/spec/replicas", "value": 3}]
Base64-encoded, this would be W3sib3AiOiAiYWRkIiwgInBhdGgiOiAiL3NwZWMvcmVwbGljYXMiLCAidmFsdWUiOiAzfV0=
So a webhook response to add that label would be:
{
"apiVersion": "admission.k8s.io/v1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": true,
"patchType": "JSONPatch",
"patch": "W3sib3AiOiAiYWRkIiwgInBhdGgiOiAiL3NwZWMvcmVwbGljYXMiLCAidmFsdWUiOiAzfV0="
}
}
{
"apiVersion": "admission.k8s.io/v1beta1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": true,
"patchType": "JSONPatch",
"patch": "W3sib3AiOiAiYWRkIiwgInBhdGgiOiAiL3NwZWMvcmVwbGljYXMiLCAidmFsdWUiOiAzfV0="
}
}
Starting in v1.19, admission webhooks can optionally return warning messages that are returned to the requesting client
in HTTP Warning
headers with a warning code of 299. Warnings can be sent with allowed or rejected admission responses.
If you're implementing a webhook that returns a warning:
- Don't include a "Warning:" prefix in the message
- Use warning messages to describe problems the client making the API request should correct or be aware of
- Limit warnings to 120 characters if possible
{
"apiVersion": "admission.k8s.io/v1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": true,
"warnings": [
"duplicate envvar entries specified with name MY_ENV",
"memory request less than 4MB specified for container mycontainer, which will not start successfully"
]
}
}
{
"apiVersion": "admission.k8s.io/v1beta1",
"kind": "AdmissionReview",
"response": {
"uid": "<value from request.uid>",
"allowed": true,
"warnings": [
"duplicate envvar entries specified with name MY_ENV",
"memory request less than 4MB specified for container mycontainer, which will not start successfully"
]
}
}
Webhook configuration
To register admission webhooks, create MutatingWebhookConfiguration
or ValidatingWebhookConfiguration
API objects.
The name of a MutatingWebhookConfiguration
or a ValidatingWebhookConfiguration
object must be a valid
DNS subdomain name.
Each configuration can contain one or more webhooks.
If multiple webhooks are specified in a single configuration, each should be given a unique name.
This is required in admissionregistration.k8s.io/v1
, but strongly recommended when using admissionregistration.k8s.io/v1beta1
,
in order to make resulting audit logs and metrics easier to match up to active configurations.
Each webhook defines the following things.
Matching requests: rules
Each webhook must specify a list of rules used to determine if a request to the API server should be sent to the webhook. Each rule specifies one or more operations, apiGroups, apiVersions, and resources, and a resource scope:
operations
lists one or more operations to match. Can be"CREATE"
,"UPDATE"
,"DELETE"
,"CONNECT"
, or"*"
to match all.apiGroups
lists one or more API groups to match.""
is the core API group."*"
matches all API groups.apiVersions
lists one or more API versions to match."*"
matches all API versions.resources
lists one or more resources to match."*"
matches all resources, but not subresources."*/*"
matches all resources and subresources."pods/*"
matches all subresources of pods."*/status"
matches all status subresources.
scope
specifies a scope to match. Valid values are"Cluster"
,"Namespaced"
, and"*"
. Subresources match the scope of their parent resource. Supported in v1.14+. Default is"*"
, matching pre-1.14 behavior."Cluster"
means that only cluster-scoped resources will match this rule (Namespace API objects are cluster-scoped)."Namespaced"
means that only namespaced resources will match this rule."*"
means that there are no scope restrictions.
If an incoming request matches one of the specified operations, groups, versions, resources, and scope for any of a webhook's rules, the request is sent to the webhook.
Here are other examples of rules that could be used to specify which resources should be intercepted.
Match CREATE
or UPDATE
requests to apps/v1
and apps/v1beta1
deployments
and replicasets
:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: ["apps"]
apiVersions: ["v1", "v1beta1"]
resources: ["deployments", "replicasets"]
scope: "Namespaced"
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: ["apps"]
apiVersions: ["v1", "v1beta1"]
resources: ["deployments", "replicasets"]
scope: "Namespaced"
...
Match create requests for all resources (but not subresources) in all API groups and versions:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
rules:
- operations: ["CREATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*"]
scope: "*"
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
rules:
- operations: ["CREATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*"]
scope: "*"
...
Match update requests for all status
subresources in all API groups and versions:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
rules:
- operations: ["UPDATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*/status"]
scope: "*"
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
rules:
- operations: ["UPDATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*/status"]
scope: "*"
...
Matching requests: objectSelector
In v1.15+, webhooks may optionally limit which requests are intercepted based on the labels of the
objects they would be sent, by specifying an objectSelector
. If specified, the objectSelector
is evaluated against both the object and oldObject that would be sent to the webhook,
and is considered to match if either object matches the selector.
A null object (oldObject in the case of create, or newObject in the case of delete),
or an object that cannot have labels (like a DeploymentRollback
or a PodProxyOptions
object)
is not considered to match.
Use the object selector only if the webhook is opt-in, because end users may skip the admission webhook by setting the labels.
This example shows a mutating webhook that would match a CREATE
of any resource with the label foo: bar
:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
objectSelector:
matchLabels:
foo: bar
rules:
- operations: ["CREATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*"]
scope: "*"
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
objectSelector:
matchLabels:
foo: bar
rules:
- operations: ["CREATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*"]
scope: "*"
...
See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels for more examples of label selectors.
Matching requests: namespaceSelector
Webhooks may optionally limit which requests for namespaced resources are intercepted,
based on the labels of the containing namespace, by specifying a namespaceSelector
.
The namespaceSelector
decides whether to run the webhook on a request for a namespaced resource
(or a Namespace object), based on whether the namespace's labels match the selector.
If the object itself is a namespace, the matching is performed on object.metadata.labels.
If the object is a cluster scoped resource other than a Namespace, namespaceSelector
has no effect.
This example shows a mutating webhook that matches a CREATE
of any namespaced resource inside a namespace
that does not have a "runlevel" label of "0" or "1":
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
namespaceSelector:
matchExpressions:
- key: runlevel
operator: NotIn
values: ["0","1"]
rules:
- operations: ["CREATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*"]
scope: "Namespaced"
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
namespaceSelector:
matchExpressions:
- key: runlevel
operator: NotIn
values: ["0","1"]
rules:
- operations: ["CREATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*"]
scope: "Namespaced"
...
This example shows a validating webhook that matches a CREATE
of any namespaced resource inside a namespace
that is associated with the "environment" of "prod" or "staging":
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
namespaceSelector:
matchExpressions:
- key: environment
operator: In
values: ["prod","staging"]
rules:
- operations: ["CREATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*"]
scope: "Namespaced"
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
namespaceSelector:
matchExpressions:
- key: environment
operator: In
values: ["prod","staging"]
rules:
- operations: ["CREATE"]
apiGroups: ["*"]
apiVersions: ["*"]
resources: ["*"]
scope: "Namespaced"
...
See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels for more examples of label selectors.
Matching requests: matchPolicy
API servers can make objects available via multiple API groups or versions.
For example, the Kubernetes API server may allow creating and modifying Deployment
objects
via extensions/v1beta1
, apps/v1beta1
, apps/v1beta2
, and apps/v1
APIs.
For example, if a webhook only specified a rule for some API groups/versions (like apiGroups:["apps"], apiVersions:["v1","v1beta1"]
),
and a request was made to modify the resource via another API group/version (like extensions/v1beta1
),
the request would not be sent to the webhook.
In v1.15+, matchPolicy
lets a webhook define how its rules
are used to match incoming requests.
Allowed values are Exact
or Equivalent
.
Exact
means a request should be intercepted only if it exactly matches a specified rule.Equivalent
means a request should be intercepted if modifies a resource listed inrules
, even via another API group or version.
In the example given above, the webhook that only registered for apps/v1
could use matchPolicy
:
matchPolicy: Exact
would mean theextensions/v1beta1
request would not be sent to the webhookmatchPolicy: Equivalent
means theextensions/v1beta1
request would be sent to the webhook (with the objects converted to a version the webhook had specified:apps/v1
)
Specifying Equivalent
is recommended, and ensures that webhooks continue to intercept the
resources they expect when upgrades enable new versions of the resource in the API server.
When a resource stops being served by the API server, it is no longer considered equivalent to other versions of that resource that are still served.
For example, extensions/v1beta1
deployments were first deprecated and then removed (in Kubernetes v1.16).
Since that removal, a webhook with a apiGroups:["extensions"], apiVersions:["v1beta1"], resources:["deployments"]
rule
does not intercept deployments created via apps/v1
APIs. For that reason, webhooks should prefer registering
for stable versions of resources.
This example shows a validating webhook that intercepts modifications to deployments (no matter the API group or version),
and is always sent an apps/v1
Deployment
object:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
matchPolicy: Equivalent
rules:
- operations: ["CREATE","UPDATE","DELETE"]
apiGroups: ["apps"]
apiVersions: ["v1"]
resources: ["deployments"]
scope: "Namespaced"
...
Admission webhooks created using admissionregistration.k8s.io/v1
default to Equivalent
.
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
matchPolicy: Equivalent
rules:
- operations: ["CREATE","UPDATE","DELETE"]
apiGroups: ["apps"]
apiVersions: ["v1"]
resources: ["deployments"]
scope: "Namespaced"
...
Admission webhooks created using admissionregistration.k8s.io/v1beta1
default to Exact
.
Contacting the webhook
Once the API server has determined a request should be sent to a webhook,
it needs to know how to contact the webhook. This is specified in the clientConfig
stanza of the webhook configuration.
Webhooks can either be called via a URL or a service reference, and can optionally include a custom CA bundle to use to verify the TLS connection.
URL
url
gives the location of the webhook, in standard URL form
(scheme://host:port/path
).
The host
should not refer to a service running in the cluster; use
a service reference by specifying the service
field instead.
The host might be resolved via external DNS in some apiservers
(e.g., kube-apiserver
cannot resolve in-cluster DNS as that would
be a layering violation). host
may also be an IP address.
Please note that using localhost
or 127.0.0.1
as a host
is
risky unless you take great care to run this webhook on all hosts
which run an apiserver which might need to make calls to this
webhook. Such installations are likely to be non-portable or not readily
run in a new cluster.
The scheme must be "https"; the URL must begin with "https://".
Attempting to use a user or basic auth (for example "user:password@") is not allowed. Fragments ("#...") and query parameters ("?...") are also not allowed.
Here is an example of a mutating webhook configured to call a URL (and expects the TLS certificate to be verified using system trust roots, so does not specify a caBundle):
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
clientConfig:
url: "https://my-webhook.example.com:9443/my-webhook-path"
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
clientConfig:
url: "https://my-webhook.example.com:9443/my-webhook-path"
...
Service reference
The service
stanza inside clientConfig
is a reference to the service for this webhook.
If the webhook is running within the cluster, then you should use service
instead of url
.
The service namespace and name are required. The port is optional and defaults to 443.
The path is optional and defaults to "/".
Here is an example of a mutating webhook configured to call a service on port "1234"
at the subpath "/my-path", and to verify the TLS connection against the ServerName
my-service-name.my-service-namespace.svc
using a custom CA bundle:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
clientConfig:
caBundle: "Ci0tLS0tQk...<base64-encoded PEM bundle containing the CA that signed the webhook's serving certificate>...tLS0K"
service:
namespace: my-service-namespace
name: my-service-name
path: /my-path
port: 1234
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
clientConfig:
caBundle: "Ci0tLS0tQk...<`caBundle` is a PEM encoded CA bundle which will be used to validate the webhook's server certificate>...tLS0K"
service:
namespace: my-service-namespace
name: my-service-name
path: /my-path
port: 1234
...
Side effects
Webhooks typically operate only on the content of the AdmissionReview
sent to them.
Some webhooks, however, make out-of-band changes as part of processing admission requests.
Webhooks that make out-of-band changes ("side effects") must also have a reconciliation mechanism (like a controller) that periodically determines the actual state of the world, and adjusts the out-of-band data modified by the admission webhook to reflect reality. This is because a call to an admission webhook does not guarantee the admitted object will be persisted as is, or at all. Later webhooks can modify the content of the object, a conflict could be encountered while writing to storage, or the server could power off before persisting the object.
Additionally, webhooks with side effects must skip those side-effects when dryRun: true
admission requests are handled.
A webhook must explicitly indicate that it will not have side-effects when run with dryRun
,
or the dry-run request will not be sent to the webhook and the API request will fail instead.
Webhooks indicate whether they have side effects using the sideEffects
field in the webhook configuration:
Unknown
: no information is known about the side effects of calling the webhook. If a request withdryRun: true
would trigger a call to this webhook, the request will instead fail, and the webhook will not be called.None
: calling the webhook will have no side effects.Some
: calling the webhook will possibly have side effects. If a request with the dry-run attribute would trigger a call to this webhook, the request will instead fail, and the webhook will not be called.NoneOnDryRun
: calling the webhook will possibly have side effects, but if a request withdryRun: true
is sent to the webhook, the webhook will suppress the side effects (the webhook isdryRun
-aware).
Allowed values:
- In
admissionregistration.k8s.io/v1beta1
,sideEffects
may be set toUnknown
,None
,Some
, orNoneOnDryRun
, and defaults toUnknown
. - In
admissionregistration.k8s.io/v1
,sideEffects
must be set toNone
orNoneOnDryRun
.
Here is an example of a validating webhook indicating it has no side effects on dryRun: true
requests:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
sideEffects: NoneOnDryRun
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
sideEffects: NoneOnDryRun
...
Timeouts
Because webhooks add to API request latency, they should evaluate as quickly as possible.
timeoutSeconds
allows configuring how long the API server should wait for a webhook to respond
before treating the call as a failure.
If the timeout expires before the webhook responds, the webhook call will be ignored or the API call will be rejected based on the failure policy.
The timeout value must be between 1 and 30 seconds.
Here is an example of a validating webhook with a custom timeout of 2 seconds:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
timeoutSeconds: 2
...
Admission webhooks created using admissionregistration.k8s.io/v1
default timeouts to 10 seconds.
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
timeoutSeconds: 2
...
Admission webhooks created using admissionregistration.k8s.io/v1
default timeouts to 30 seconds.
Reinvocation policy
A single ordering of mutating admissions plugins (including webhooks) does not work for all cases
(see https://issue.k8s.io/64333 as an example). A mutating webhook can add a new sub-structure
to the object (like adding a container
to a pod
), and other mutating plugins which have already
run may have opinions on those new structures (like setting an imagePullPolicy
on all containers).
In v1.15+, to allow mutating admission plugins to observe changes made by other plugins,
built-in mutating admission plugins are re-run if a mutating webhook modifies an object,
and mutating webhooks can specify a reinvocationPolicy
to control whether they are reinvoked as well.
reinvocationPolicy
may be set to Never
or IfNeeded
. It defaults to Never
.
Never
: the webhook must not be called more than once in a single admission evaluationIfNeeded
: the webhook may be called again as part of the admission evaluation if the object being admitted is modified by other admission plugins after the initial webhook call.
The important elements to note are:
- The number of additional invocations is not guaranteed to be exactly one.
- If additional invocations result in further modifications to the object, webhooks are not guaranteed to be invoked again.
- Webhooks that use this option may be reordered to minimize the number of additional invocations.
- To validate an object after all mutations are guaranteed complete, use a validating admission webhook instead (recommended for webhooks with side-effects).
Here is an example of a mutating webhook opting into being re-invoked if later admission plugins modify the object:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
reinvocationPolicy: IfNeeded
...
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
reinvocationPolicy: IfNeeded
...
Mutating webhooks must be idempotent, able to successfully process an object they have already admitted and potentially modified. This is true for all mutating admission webhooks, since any change they can make in an object could already exist in the user-provided object, but it is essential for webhooks that opt into reinvocation.
Failure policy
failurePolicy
defines how unrecognized errors and timeout errors from the admission webhook
are handled. Allowed values are Ignore
or Fail
.
Ignore
means that an error calling the webhook is ignored and the API request is allowed to continue.Fail
means that an error calling the webhook causes the admission to fail and the API request to be rejected.
Here is a mutating webhook configured to reject an API request if errors are encountered calling the admission webhook:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
failurePolicy: Fail
...
Admission webhooks created using admissionregistration.k8s.io/v1
default failurePolicy
to Fail
.
# Deprecated in v1.16 in favor of admissionregistration.k8s.io/v1
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
...
webhooks:
- name: my-webhook.example.com
failurePolicy: Fail
...
Admission webhooks created using admissionregistration.k8s.io/v1beta1
default failurePolicy
to Ignore
.
Monitoring admission webhooks
The API server provides ways to monitor admission webhook behaviors. These monitoring mechanisms help cluster admins to answer questions like:
-
Which mutating webhook mutated the object in a API request?
-
What change did the mutating webhook applied to the object?
-
Which webhooks are frequently rejecting API requests? What's the reason for a rejection?
Mutating webhook auditing annotations
Sometimes it's useful to know which mutating webhook mutated the object in a API request, and what change did the webhook apply.
In v1.16+, kube-apiserver performs auditing on each mutating webhook invocation. Each invocation generates an auditing annotation capturing if a request object is mutated by the invocation, and optionally generates an annotation capturing the applied patch from the webhook admission response. The annotations are set in the audit event for given request on given stage of its execution, which is then pre-processed according to a certain policy and written to a backend.
The audit level of a event determines which annotations get recorded:
- At
Metadata
audit level or higher, an annotation with keymutation.webhook.admission.k8s.io/round_{round idx}_index_{order idx}
gets logged with JSON payload indicating a webhook gets invoked for given request and whether it mutated the object or not.
For example, the following annotation gets recorded for a webhook being reinvoked. The webhook is ordered the third in the mutating webhook chain, and didn't mutated the request object during the invocation.
# the audit event recorded
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"annotations": {
"mutation.webhook.admission.k8s.io/round_1_index_2": "{\"configuration\":\"my-mutating-webhook-configuration.example.com\",\"webhook\":\"my-webhook.example.com\",\"mutated\": false}"
# other annotations
...
}
# other fields
...
}
# the annotation value deserialized
{
"configuration": "my-mutating-webhook-configuration.example.com",
"webhook": "my-webhook.example.com",
"mutated": false
}
The following annotation gets recorded for a webhook being invoked in the first round. The webhook is ordered the first in
the mutating webhook chain, and mutated the request object during the invocation.
# the audit event recorded
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"annotations": {
"mutation.webhook.admission.k8s.io/round_0_index_0": "{\"configuration\":\"my-mutating-webhook-configuration.example.com\",\"webhook\":\"my-webhook-always-mutate.example.com\",\"mutated\": true}"
# other annotations
...
}
# other fields
...
}
# the annotation value deserialized
{
"configuration": "my-mutating-webhook-configuration.example.com",
"webhook": "my-webhook-always-mutate.example.com",
"mutated": true
}
- At
Request
audit level or higher, an annotation with keypatch.webhook.admission.k8s.io/round_{round idx}_index_{order idx}
gets logged with JSON payload indicating a webhook gets invoked for given request and what patch gets applied to the request object.
For example, the following annotation gets recorded for a webhook being reinvoked. The webhook is ordered the fourth in the mutating webhook chain, and responded with a JSON patch which got applied to the request object.
# the audit event recorded
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"annotations": {
"patch.webhook.admission.k8s.io/round_1_index_3": "{\"configuration\":\"my-other-mutating-webhook-configuration.example.com\",\"webhook\":\"my-webhook-always-mutate.example.com\",\"patch\":[{\"op\":\"add\",\"path\":\"/data/mutation-stage\",\"value\":\"yes\"}],\"patchType\":\"JSONPatch\"}"
# other annotations
...
}
# other fields
...
}
# the annotation value deserialized
{
"configuration": "my-other-mutating-webhook-configuration.example.com",
"webhook": "my-webhook-always-mutate.example.com",
"patchType": "JSONPatch",
"patch": [
{
"op": "add",
"path": "/data/mutation-stage",
"value": "yes"
}
]
}
Admission webhook metrics
Kube-apiserver exposes Prometheus metrics from the /metrics
endpoint, which can be used for monitoring and
diagnosing API server status. The following metrics record status related to admission webhooks.
API server admission webhook rejection count
Sometimes it's useful to know which admission webhooks are frequently rejecting API requests, and the reason for a rejection.
In v1.16+, kube-apiserver exposes a Prometheus counter metric recording admission webhook rejections. The metrics are labelled to identify the causes of webhook rejection(s):
name
: the name of the webhook that rejected a request.operation
: the operation type of the request, can be one ofCREATE
,UPDATE
,DELETE
andCONNECT
.type
: the admission webhook type, can be one ofadmit
andvalidating
.error_type
: identifies if an error occurred during the webhook invocation that caused the rejection. Its value can be one of:calling_webhook_error
: unrecognized errors or timeout errors from the admission webhook happened and the webhook's Failure policy is set toFail
.no_error
: no error occurred. The webhook rejected the request withallowed: false
in the admission response. The metrics labelrejection_code
records the.status.code
set in the admission response.apiserver_internal_error
: an API server internal error happened.
rejection_code
: the HTTP status code set in the admission response when a webhook rejected a request.
Example of the rejection count metrics:
# HELP apiserver_admission_webhook_rejection_count [ALPHA] Admission webhook rejection count, identified by name and broken out for each admission type (validating or admit) and operation. Additional labels specify an error type (calling_webhook_error or apiserver_internal_error if an error occurred; no_error otherwise) and optionally a non-zero rejection code if the webhook rejects the request with an HTTP status code (honored by the apiserver when the code is greater or equal to 400). Codes greater than 600 are truncated to 600, to keep the metrics cardinality bounded.
# TYPE apiserver_admission_webhook_rejection_count counter
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error",name="always-timeout-webhook.example.com",operation="CREATE",rejection_code="0",type="validating"} 1
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error",name="invalid-admission-response-webhook.example.com",operation="CREATE",rejection_code="0",type="validating"} 1
apiserver_admission_webhook_rejection_count{error_type="no_error",name="deny-unwanted-configmap-data.example.com",operation="CREATE",rejection_code="400",type="validating"} 13
Best practices and warnings
Idempotence
An idempotent mutating admission webhook is able to successfully process an object it has already admitted and potentially modified. The admission can be applied multiple times without changing the result beyond the initial application.
Example of idempotent mutating admission webhooks:
-
For a
CREATE
pod request, set the field.spec.securityContext.runAsNonRoot
of the pod to true, to enforce security best practices. -
For a
CREATE
pod request, if the field.spec.containers[].resources.limits
of a container is not set, set default resource limits. -
For a
CREATE
pod request, inject a sidecar container with namefoo-sidecar
if no container with the namefoo-sidecar
already exists.
In the cases above, the webhook can be safely reinvoked, or admit an object that already has the fields set.
Example of non-idempotent mutating admission webhooks:
-
For a
CREATE
pod request, inject a sidecar container with namefoo-sidecar
suffixed with the current timestamp (e.g.foo-sidecar-19700101-000000
). -
For a
CREATE
/UPDATE
pod request, reject if the pod has label"env"
set, otherwise add an"env": "prod"
label to the pod. -
For a
CREATE
pod request, blindly append a sidecar container namedfoo-sidecar
without looking to see if there is already afoo-sidecar
container in the pod.
In the first case above, reinvoking the webhook can result in the same sidecar being injected multiple times to a pod, each time with a different container name. Similarly the webhook can inject duplicated containers if the sidecar already exists in a user-provided pod.
In the second case above, reinvoking the webhook will result in the webhook failing on its own output.
In the third case above, reinvoking the webhook will result in duplicated containers in the pod spec, which makes the request invalid and rejected by the API server.
Intercepting all versions of an object
It is recommended that admission webhooks should always intercept all versions of an object by setting .webhooks[].matchPolicy
to Equivalent
. It is also recommended that admission webhooks should prefer registering for stable versions of resources.
Failure to intercept all versions of an object can result in admission policies not being enforced for requests in certain
versions. See Matching requests: matchPolicy for examples.
Availability
It is recommended that admission webhooks should evaluate as quickly as possible (typically in milliseconds), since they add to API request latency. It is encouraged to use a small timeout for webhooks. See Timeouts for more detail.
It is recommended that admission webhooks should leverage some format of load-balancing, to provide high availability and performance benefits. If a webhook is running within the cluster, you can run multiple webhook backends behind a service to leverage the load-balancing that service supports.
Guaranteeing the final state of the object is seen
Admission webhooks that need to guarantee they see the final state of the object in order to enforce policy should use a validating admission webhook, since objects can be modified after being seen by mutating webhooks.
For example, a mutating admission webhook is configured to inject a sidecar container with name "foo-sidecar" on every
CREATE
pod request. If the sidecar must be present, a validating admisson webhook should also be configured to intercept CREATE
pod requests, and validate
that a container with name "foo-sidecar" with the expected configuration exists in the to-be-created object.
Avoiding deadlocks in self-hosted webhooks
A webhook running inside the cluster might cause deadlocks for its own deployment if it is configured to intercept resources required to start its own pods.
For example, a mutating admission webhook is configured to admit CREATE
pod requests only if a certain label is set in the
pod (e.g. "env": "prod"
). The webhook server runs in a deployment which doesn't set the "env"
label.
When a node that runs the webhook server pods
becomes unhealthy, the webhook deployment will try to reschedule the pods to another node. However the requests will
get rejected by the existing webhook server since the "env"
label is unset, and the migration cannot happen.
It is recommended to exclude the namespace where your webhook is running with a namespaceSelector.
Side effects
It is recommended that admission webhooks should avoid side effects if possible, which means the webhooks operate only on the
content of the AdmissionReview
sent to them, and do not make out-of-band changes. The .webhooks[].sideEffects
field should
be set to None
if a webhook doesn't have any side effect.
If side effects are required during the admission evaluation, they must be suppressed when processing an
AdmissionReview
object with dryRun
set to true
, and the .webhooks[].sideEffects
field should be
set to NoneOnDryRun
. See Side effects for more detail.
Avoiding operating on the kube-system namespace
The kube-system
namespace contains objects created by the Kubernetes system,
e.g. service accounts for the control plane components, pods like kube-dns
.
Accidentally mutating or rejecting requests in the kube-system
namespace may
cause the control plane components to stop functioning or introduce unknown behavior.
If your admission webhooks don't intend to modify the behavior of the Kubernetes control
plane, exclude the kube-system
namespace from being intercepted using a
namespaceSelector
.
6.3.6 - Managing Service Accounts
This is a Cluster Administrator guide to service accounts. You should be familiar with configuring Kubernetes service accounts.
Support for authorization and user accounts is planned but incomplete. Sometimes incomplete features are referred to in order to better describe service accounts.
User accounts versus service accounts
Kubernetes distinguishes between the concept of a user account and a service account for a number of reasons:
- User accounts are for humans. Service accounts are for processes, which run in pods.
- User accounts are intended to be global. Names must be unique across all namespaces of a cluster. Service accounts are namespaced.
- Typically, a cluster's user accounts might be synced from a corporate database, where new user account creation requires special privileges and is tied to complex business processes. Service account creation is intended to be more lightweight, allowing cluster users to create service accounts for specific tasks by following the principle of least privilege.
- Auditing considerations for humans and service accounts may differ.
- A config bundle for a complex system may include definition of various service accounts for components of that system. Because service accounts can be created without many constraints and have namespaced names, such config is portable.
Service account automation
Three separate components cooperate to implement the automation around service accounts:
- A
ServiceAccount
admission controller - A Token controller
- A
ServiceAccount
controller
ServiceAccount Admission Controller
The modification of pods is implemented via a plugin called an Admission Controller. It is part of the API server. It acts synchronously to modify pods as they are created or updated. When this plugin is active (and it is by default on most distributions), then it does the following when a pod is created or modified:
- If the pod does not have a
ServiceAccount
set, it sets theServiceAccount
todefault
. - It ensures that the
ServiceAccount
referenced by the pod exists, and otherwise rejects it. - It adds a
volume
to the pod which contains a token for API access if neither the ServiceAccountautomountServiceAccountToken
nor the Pod'sautomountServiceAccountToken
is set tofalse
. - It adds a
volumeSource
to each container of the pod mounted at/var/run/secrets/kubernetes.io/serviceaccount
, if the previous step has created a volume for ServiceAccount token. - If the pod does not contain any
imagePullSecrets
, thenimagePullSecrets
of theServiceAccount
are added to the pod.
Bound Service Account Token Volume
Kubernetes v1.22 [stable]
The ServiceAccount admission controller will add the following projected volume instead of a Secret-based volume for the non-expiring service account token created by Token Controller.
- name: kube-api-access-<random-suffix>
projected:
defaultMode: 420 # 0644
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
This projected volume consists of three sources:
- A ServiceAccountToken acquired from kube-apiserver via TokenRequest API. It will expire after 1 hour by default or when the pod is deleted. It is bound to the pod and has kube-apiserver as the audience.
- A ConfigMap containing a CA bundle used for verifying connections to the kube-apiserver. This feature depends on the
RootCAConfigMap
feature gate, which publishes a "kube-root-ca.crt" ConfigMap to every namespace.RootCAConfigMap
feature gate is graduated to GA in 1.21 and default to true. (This flag will be removed from --feature-gate arg in 1.22) - A DownwardAPI that references the namespace of the pod.
See more details about projected volumes.
Token Controller
TokenController runs as part of kube-controller-manager
. It acts asynchronously. It:
- watches ServiceAccount creation and creates a corresponding ServiceAccount token Secret to allow API access.
- watches ServiceAccount deletion and deletes all corresponding ServiceAccount token Secrets.
- watches ServiceAccount token Secret addition, and ensures the referenced ServiceAccount exists, and adds a token to the Secret if needed.
- watches Secret deletion and removes a reference from the corresponding ServiceAccount if needed.
You must pass a service account private key file to the token controller in
the kube-controller-manager
using the --service-account-private-key-file
flag. The private key is used to sign generated service account tokens.
Similarly, you must pass the corresponding public key to the kube-apiserver
using the --service-account-key-file
flag. The public key will be used to
verify the tokens during authentication.
To create additional API tokens
A controller loop ensures a Secret with an API token exists for each
ServiceAccount. To create additional API tokens for a ServiceAccount, create a
Secret of type kubernetes.io/service-account-token
with an annotation
referencing the ServiceAccount, and the controller will update it with a
generated token:
Below is a sample configuration for such a Secret:
apiVersion: v1
kind: Secret
metadata:
name: mysecretname
annotations:
kubernetes.io/service-account.name: myserviceaccount
type: kubernetes.io/service-account-token
kubectl create -f ./secret.yaml
kubectl describe secret mysecretname
To delete/invalidate a ServiceAccount token Secret
kubectl delete secret mysecretname
ServiceAccount controller
A ServiceAccount controller manages the ServiceAccounts inside namespaces, and ensures a ServiceAccount named "default" exists in every active namespace.
6.3.7 - Authorization Overview
Learn more about Kubernetes authorization, including details about creating policies using the supported authorization modules.
In Kubernetes, you must be authenticated (logged in) before your request can be authorized (granted permission to access). For information about authentication, see Controlling Access to the Kubernetes API.
Kubernetes expects attributes that are common to REST API requests. This means that Kubernetes authorization works with existing organization-wide or cloud-provider-wide access control systems which may handle other APIs besides the Kubernetes API.
Determine Whether a Request is Allowed or Denied
Kubernetes authorizes API requests using the API server. It evaluates all of the request attributes against all policies and allows or denies the request. All parts of an API request must be allowed by some policy in order to proceed. This means that permissions are denied by default.
(Although Kubernetes uses the API server, access controls and policies that depend on specific fields of specific kinds of objects are handled by Admission Controllers.)
When multiple authorization modules are configured, each is checked in sequence. If any authorizer approves or denies a request, that decision is immediately returned and no other authorizer is consulted. If all modules have no opinion on the request, then the request is denied. A deny returns an HTTP status code 403.
Review Your Request Attributes
Kubernetes reviews only the following API request attributes:
- user - The
user
string provided during authentication. - group - The list of group names to which the authenticated user belongs.
- extra - A map of arbitrary string keys to string values, provided by the authentication layer.
- API - Indicates whether the request is for an API resource.
- Request path - Path to miscellaneous non-resource endpoints like
/api
or/healthz
. - API request verb - API verbs like
get
,list
,create
,update
,patch
,watch
,delete
, anddeletecollection
are used for resource requests. To determine the request verb for a resource API endpoint, see Determine the request verb. - HTTP request verb - Lowercased HTTP methods like
get
,post
,put
, anddelete
are used for non-resource requests. - Resource - The ID or name of the resource that is being accessed (for resource requests only) -- For resource requests using
get
,update
,patch
, anddelete
verbs, you must provide the resource name. - Subresource - The subresource that is being accessed (for resource requests only).
- Namespace - The namespace of the object that is being accessed (for namespaced resource requests only).
- API group - The API Group being accessed (for resource requests only). An empty string designates the core API group.
Determine the Request Verb
Non-resource requests
Requests to endpoints other than /api/v1/...
or /apis/<group>/<version>/...
are considered "non-resource requests", and use the lower-cased HTTP method of the request as the verb.
For example, a GET
request to endpoints like /api
or /healthz
would use get
as the verb.
Resource requests To determine the request verb for a resource API endpoint, review the HTTP verb used and whether or not the request acts on an individual resource or a collection of resources:
HTTP verb | request verb |
---|---|
POST | create |
GET, HEAD | get (for individual resources), list (for collections, including full object content), watch (for watching an individual resource or collection of resources) |
PUT | update |
PATCH | patch |
DELETE | delete (for individual resources), deletecollection (for collections) |
Kubernetes sometimes checks authorization for additional permissions using specialized verbs. For example:
- PodSecurityPolicy
use
verb onpodsecuritypolicies
resources in thepolicy
API group.
- RBAC
bind
andescalate
verbs onroles
andclusterroles
resources in therbac.authorization.k8s.io
API group.
- Authentication
impersonate
verb onusers
,groups
, andserviceaccounts
in the core API group, and theuserextras
in theauthentication.k8s.io
API group.
Authorization Modes
The Kubernetes API server may authorize a request using one of several authorization modes:
- Node - A special-purpose authorization mode that grants permissions to kubelets based on the pods they are scheduled to run. To learn more about using the Node authorization mode, see Node Authorization.
- ABAC - Attribute-based access control (ABAC) defines an access control paradigm whereby access rights are granted to users through the use of policies which combine attributes together. The policies can use any type of attributes (user attributes, resource attributes, object, environment attributes, etc). To learn more about using the ABAC mode, see ABAC Mode.
- RBAC - Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within an enterprise. In this context, access is the ability of an individual user to perform a specific task, such as view, create, or modify a file. To learn more about using the RBAC mode, see RBAC Mode
- When specified RBAC (Role-Based Access Control) uses the
rbac.authorization.k8s.io
API group to drive authorization decisions, allowing admins to dynamically configure permission policies through the Kubernetes API. - To enable RBAC, start the apiserver with
--authorization-mode=RBAC
.
- When specified RBAC (Role-Based Access Control) uses the
- Webhook - A WebHook is an HTTP callback: an HTTP POST that occurs when something happens; a simple event-notification via HTTP POST. A web application implementing WebHooks will POST a message to a URL when certain things happen. To learn more about using the Webhook mode, see Webhook Mode.
Checking API Access
kubectl
provides the auth can-i
subcommand for quickly querying the API authorization layer.
The command uses the SelfSubjectAccessReview
API to determine if the current user can perform
a given action, and works regardless of the authorization mode used.
kubectl auth can-i create deployments --namespace dev
The output is similar to this:
yes
kubectl auth can-i create deployments --namespace prod
The output is similar to this:
no
Administrators can combine this with user impersonation to determine what action other users can perform.
kubectl auth can-i list secrets --namespace dev --as dave
The output is similar to this:
no
Similarly, to check whether a ServiceAccount named dev-sa
in Namespace dev
can list Pods in the Namespace target
:
kubectl auth can-i list pods \
--namespace target \
--as system:serviceaccount:dev:dev-sa
The output is similar to this:
yes
SelfSubjectAccessReview
is part of the authorization.k8s.io
API group, which
exposes the API server authorization to external services. Other resources in
this group include:
SubjectAccessReview
- Access review for any user, not only the current one. Useful for delegating authorization decisions to the API server. For example, the kubelet and extension API servers use this to determine user access to their own APIs.LocalSubjectAccessReview
- LikeSubjectAccessReview
but restricted to a specific namespace.SelfSubjectRulesReview
- A review which returns the set of actions a user can perform within a namespace. Useful for users to quickly summarize their own access, or for UIs to hide/show actions.
These APIs can be queried by creating normal Kubernetes resources, where the response "status" field of the returned object is the result of the query.
kubectl create -f - -o yaml << EOF
apiVersion: authorization.k8s.io/v1
kind: SelfSubjectAccessReview
spec:
resourceAttributes:
group: apps
resource: deployments
verb: create
namespace: dev
EOF
The generated SelfSubjectAccessReview
is:
apiVersion: authorization.k8s.io/v1
kind: SelfSubjectAccessReview
metadata:
creationTimestamp: null
spec:
resourceAttributes:
group: apps
resource: deployments
namespace: dev
verb: create
status:
allowed: true
denied: false
Using Flags for Your Authorization Module
You must include a flag in your policy to indicate which authorization module your policies include:
The following flags can be used:
--authorization-mode=ABAC
Attribute-Based Access Control (ABAC) mode allows you to configure policies using local files.--authorization-mode=RBAC
Role-based access control (RBAC) mode allows you to create and store policies using the Kubernetes API.--authorization-mode=Webhook
WebHook is an HTTP callback mode that allows you to manage authorization using a remote REST endpoint.--authorization-mode=Node
Node authorization is a special-purpose authorization mode that specifically authorizes API requests made by kubelets.--authorization-mode=AlwaysDeny
This flag blocks all requests. Use this flag only for testing.--authorization-mode=AlwaysAllow
This flag allows all requests. Use this flag only if you do not require authorization for your API requests.
You can choose more than one authorization module. Modules are checked in order so an earlier module has higher priority to allow or deny a request.
Privilege escalation via workload creation or edits
Users who can create/edit pods in a namespace, either directly or through a controller such as an operator, could escalate their privileges in that namespace.
Escalation paths
- Mounting arbitrary secrets in that namespace
- Can be used to access secrets meant for other workloads
- Can be used to obtain a more privileged service account's service account token
- Using arbitrary Service Accounts in that namespace
- Can perform Kubernetes API actions as another workload (impersonation)
- Can perform any privileged actions that Service Account has
- Mounting configmaps meant for other workloads in that namespace
- Can be used to obtain information meant for other workloads, such as DB host names.
- Mounting volumes meant for other workloads in that namespace
- Can be used to obtain information meant for other workloads, and change it.
What's next
- To learn more about Authentication, see Authentication in Controlling Access to the Kubernetes API.
- To learn more about Admission Control, see Using Admission Controllers.
6.3.8 - Using RBAC Authorization
Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within your organization.
RBAC authorization uses the rbac.authorization.k8s.io
API group to drive authorization
decisions, allowing you to dynamically configure policies through the Kubernetes API.
To enable RBAC, start the API server
with the --authorization-mode
flag set to a comma-separated list that includes RBAC
;
for example:
kube-apiserver --authorization-mode=Example,RBAC --other-options --more-options
API objects
The RBAC API declares four kinds of Kubernetes object: Role, ClusterRole,
RoleBinding and ClusterRoleBinding. You can
describe objects,
or amend them, using tools such as kubectl,
just like any other Kubernetes object.
Role and ClusterRole
An RBAC Role or ClusterRole contains rules that represent a set of permissions. Permissions are purely additive (there are no "deny" rules).
A Role always sets permissions within a particular namespace; when you create a Role, you have to specify the namespace it belongs in.
ClusterRole, by contrast, is a non-namespaced resource. The resources have different names (Role and ClusterRole) because a Kubernetes object always has to be either namespaced or not namespaced; it can't be both.
ClusterRoles have several uses. You can use a ClusterRole to:
- define permissions on namespaced resources and be granted within individual namespace(s)
- define permissions on namespaced resources and be granted across all namespaces
- define permissions on cluster-scoped resources
If you want to define a role within a namespace, use a Role; if you want to define a role cluster-wide, use a ClusterRole.
Role example
Here's an example Role in the "default" namespace that can be used to grant read access to pods:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: pod-reader
rules:
- apiGroups: [""] # "" indicates the core API group
resources: ["pods"]
verbs: ["get", "watch", "list"]
ClusterRole example
A ClusterRole can be used to grant the same permissions as a Role. Because ClusterRoles are cluster-scoped, you can also use them to grant access to:
-
cluster-scoped resources (like nodes)
-
non-resource endpoints (like
/healthz
) -
namespaced resources (like Pods), across all namespaces
For example: you can use a ClusterRole to allow a particular user to run
kubectl get pods --all-namespaces
Here is an example of a ClusterRole that can be used to grant read access to secrets in any particular namespace, or across all namespaces (depending on how it is bound):
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
# "namespace" omitted since ClusterRoles are not namespaced
name: secret-reader
rules:
- apiGroups: [""]
#
# at the HTTP level, the name of the resource for accessing Secret
# objects is "secrets"
resources: ["secrets"]
verbs: ["get", "watch", "list"]
The name of a Role or a ClusterRole object must be a valid path segment name.
RoleBinding and ClusterRoleBinding
A role binding grants the permissions defined in a role to a user or set of users. It holds a list of subjects (users, groups, or service accounts), and a reference to the role being granted. A RoleBinding grants permissions within a specific namespace whereas a ClusterRoleBinding grants that access cluster-wide.
A RoleBinding may reference any Role in the same namespace. Alternatively, a RoleBinding can reference a ClusterRole and bind that ClusterRole to the namespace of the RoleBinding. If you want to bind a ClusterRole to all the namespaces in your cluster, you use a ClusterRoleBinding.
The name of a RoleBinding or ClusterRoleBinding object must be a valid path segment name.
RoleBinding examples
Here is an example of a RoleBinding that grants the "pod-reader" Role to the user "jane" within the "default" namespace. This allows "jane" to read pods in the "default" namespace.
apiVersion: rbac.authorization.k8s.io/v1
# This role binding allows "jane" to read pods in the "default" namespace.
# You need to already have a Role named "pod-reader" in that namespace.
kind: RoleBinding
metadata:
name: read-pods
namespace: default
subjects:
# You can specify more than one "subject"
- kind: User
name: jane # "name" is case sensitive
apiGroup: rbac.authorization.k8s.io
roleRef:
# "roleRef" specifies the binding to a Role / ClusterRole
kind: Role #this must be Role or ClusterRole
name: pod-reader # this must match the name of the Role or ClusterRole you wish to bind to
apiGroup: rbac.authorization.k8s.io
A RoleBinding can also reference a ClusterRole to grant the permissions defined in that ClusterRole to resources inside the RoleBinding's namespace. This kind of reference lets you define a set of common roles across your cluster, then reuse them within multiple namespaces.
For instance, even though the following RoleBinding refers to a ClusterRole, "dave" (the subject, case sensitive) will only be able to read Secrets in the "development" namespace, because the RoleBinding's namespace (in its metadata) is "development".
apiVersion: rbac.authorization.k8s.io/v1
# This role binding allows "dave" to read secrets in the "development" namespace.
# You need to already have a ClusterRole named "secret-reader".
kind: RoleBinding
metadata:
name: read-secrets
#
# The namespace of the RoleBinding determines where the permissions are granted.
# This only grants permissions within the "development" namespace.
namespace: development
subjects:
- kind: User
name: dave # Name is case sensitive
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: secret-reader
apiGroup: rbac.authorization.k8s.io
ClusterRoleBinding example
To grant permissions across a whole cluster, you can use a ClusterRoleBinding. The following ClusterRoleBinding allows any user in the group "manager" to read secrets in any namespace.
apiVersion: rbac.authorization.k8s.io/v1
# This cluster role binding allows anyone in the "manager" group to read secrets in any namespace.
kind: ClusterRoleBinding
metadata:
name: read-secrets-global
subjects:
- kind: Group
name: manager # Name is case sensitive
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: secret-reader
apiGroup: rbac.authorization.k8s.io
After you create a binding, you cannot change the Role or ClusterRole that it refers to.
If you try to change a binding's roleRef
, you get a validation error. If you do want
to change the roleRef
for a binding, you need to remove the binding object and create
a replacement.
There are two reasons for this restriction:
- Making
roleRef
immutable allows granting someoneupdate
permission on an existing binding object, so that they can manage the list of subjects, without being able to change the role that is granted to those subjects. - A binding to a different role is a fundamentally different binding.
Requiring a binding to be deleted/recreated in order to change the
roleRef
ensures the full list of subjects in the binding is intended to be granted the new role (as opposed to enabling or accidentally modifying only the roleRef without verifying all of the existing subjects should be given the new role's permissions).
The kubectl auth reconcile
command-line utility creates or updates a manifest file containing RBAC objects,
and handles deleting and recreating binding objects if required to change the role they refer to.
See command usage and examples for more information.
Referring to resources
In the Kubernetes API, most resources are represented and accessed using a string representation of
their object name, such as pods
for a Pod. RBAC refers to resources using exactly the same
name that appears in the URL for the relevant API endpoint.
Some Kubernetes APIs involve a
subresource, such as the logs for a Pod. A request for a Pod's logs looks like:
GET /api/v1/namespaces/{namespace}/pods/{name}/log
In this case, pods
is the namespaced resource for Pod resources, and log
is a
subresource of pods
. To represent this in an RBAC role, use a slash (/
) to
delimit the resource and subresource. To allow a subject to read pods
and
also access the log
subresource for each of those Pods, you write:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: pod-and-pod-logs-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list"]
You can also refer to resources by name for certain requests through the resourceNames
list.
When specified, requests can be restricted to individual instances of a resource.
Here is an example that restricts its subject to only get
or update
a
ConfigMap named my-configmap
:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: configmap-updater
rules:
- apiGroups: [""]
#
# at the HTTP level, the name of the resource for accessing ConfigMap
# objects is "configmaps"
resources: ["configmaps"]
resourceNames: ["my-configmap"]
verbs: ["update", "get"]
create
or deletecollection
requests by their resource name.
For create
, this limitation is because the name of the new object may not be known at authorization time.
If you restrict list
or watch
by resourceName, clients must include a metadata.name
field selector in their list
or watch
request that matches the specified resourceName in order to be authorized.
For example, kubectl get configmaps --field-selector=metadata.name=my-configmap
Aggregated ClusterRoles
You can aggregate several ClusterRoles into one combined ClusterRole.
A controller, running as part of the cluster control plane, watches for ClusterRole
objects with an aggregationRule
set. The aggregationRule
defines a label
selector that the controller
uses to match other ClusterRole objects that should be combined into the rules
field of this one.
Here is an example aggregated ClusterRole:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.example.com/aggregate-to-monitoring: "true"
rules: [] # The control plane automatically fills in the rules
If you create a new ClusterRole that matches the label selector of an existing aggregated ClusterRole,
that change triggers adding the new rules into the aggregated ClusterRole.
Here is an example that adds rules to the "monitoring" ClusterRole, by creating another
ClusterRole labeled rbac.example.com/aggregate-to-monitoring: true
.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-endpoints
labels:
rbac.example.com/aggregate-to-monitoring: "true"
# When you create the "monitoring-endpoints" ClusterRole,
# the rules below will be added to the "monitoring" ClusterRole.
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
The default user-facing roles use ClusterRole aggregation. This lets you, as a cluster administrator, include rules for custom resources, such as those served by CustomResourceDefinitions or aggregated API servers, to extend the default roles.
For example: the following ClusterRoles let the "admin" and "edit" default roles manage the custom resource
named CronTab, whereas the "view" role can perform only read actions on CronTab resources.
You can assume that CronTab objects are named "crontabs"
in URLs as seen by the API server.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: aggregate-cron-tabs-edit
labels:
# Add these permissions to the "admin" and "edit" default roles.
rbac.authorization.k8s.io/aggregate-to-admin: "true"
rbac.authorization.k8s.io/aggregate-to-edit: "true"
rules:
- apiGroups: ["stable.example.com"]
resources: ["crontabs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: aggregate-cron-tabs-view
labels:
# Add these permissions to the "view" default role.
rbac.authorization.k8s.io/aggregate-to-view: "true"
rules:
- apiGroups: ["stable.example.com"]
resources: ["crontabs"]
verbs: ["get", "list", "watch"]
Role examples
The following examples are excerpts from Role or ClusterRole objects, showing only
the rules
section.
Allow reading "pods"
resources in the core
API Group:
rules:
- apiGroups: [""]
#
# at the HTTP level, the name of the resource for accessing Pod
# objects is "pods"
resources: ["pods"]
verbs: ["get", "list", "watch"]
Allow reading/writing Deployments (at the HTTP level: objects with "deployments"
in the resource part of their URL) in both the "extensions"
and "apps"
API groups:
rules:
- apiGroups: ["extensions", "apps"]
#
# at the HTTP level, the name of the resource for accessing Deployment
# objects is "deployments"
resources: ["deployments"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
Allow reading Pods in the core API group, as well as reading or writing Job
resources in the "batch"
or "extensions"
API groups:
rules:
- apiGroups: [""]
#
# at the HTTP level, the name of the resource for accessing Pod
# objects is "pods"
resources: ["pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch", "extensions"]
#
# at the HTTP level, the name of the resource for accessing Job
# objects is "jobs"
resources: ["jobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
Allow reading a ConfigMap named "my-config" (must be bound with a RoleBinding to limit to a single ConfigMap in a single namespace):
rules:
- apiGroups: [""]
#
# at the HTTP level, the name of the resource for accessing ConfigMap
# objects is "configmaps"
resources: ["configmaps"]
resourceNames: ["my-config"]
verbs: ["get"]
Allow reading the resource "nodes"
in the core group (because a
Node is cluster-scoped, this must be in a ClusterRole bound with a
ClusterRoleBinding to be effective):
rules:
- apiGroups: [""]
#
# at the HTTP level, the name of the resource for accessing Node
# objects is "nodes"
resources: ["nodes"]
verbs: ["get", "list", "watch"]
Allow GET and POST requests to the non-resource endpoint /healthz
and
all subpaths (must be in a ClusterRole bound with a ClusterRoleBinding
to be effective):
rules:
- nonResourceURLs: ["/healthz", "/healthz/*"] # '*' in a nonResourceURL is a suffix glob match
verbs: ["get", "post"]
Referring to subjects
A RoleBinding or ClusterRoleBinding binds a role to subjects. Subjects can be groups, users or ServiceAccounts.
Kubernetes represents usernames as strings. These can be: plain names, such as "alice"; email-style names, like "bob@example.com"; or numeric user IDs represented as a string. It is up to you as a cluster administrator to configure the authentication modules so that authentication produces usernames in the format you want.
system:
is reserved for Kubernetes system use, so you should ensure
that you don't have users or groups with names that start with system:
by
accident.
Other than this special prefix, the RBAC authorization system does not require any format
for usernames.
In Kubernetes, Authenticator modules provide group information.
Groups, like users, are represented as strings, and that string has no format requirements,
other than that the prefix system:
is reserved.
ServiceAccounts have names prefixed
with system:serviceaccount:
, and belong to groups that have names prefixed with system:serviceaccounts:
.
system:serviceaccount:
(singular) is the prefix for service account usernames.system:serviceaccounts:
(plural) is the prefix for service account groups.
RoleBinding examples
The following examples are RoleBinding
excerpts that only
show the subjects
section.
For a user named alice@example.com
:
subjects:
- kind: User
name: "alice@example.com"
apiGroup: rbac.authorization.k8s.io
For a group named frontend-admins
:
subjects:
- kind: Group
name: "frontend-admins"
apiGroup: rbac.authorization.k8s.io
For the default service account in the "kube-system" namespace:
subjects:
- kind: ServiceAccount
name: default
namespace: kube-system
For all service accounts in the "qa" group in any namespace:
subjects:
- kind: Group
name: system:serviceaccounts:qa
apiGroup: rbac.authorization.k8s.io
For all service accounts in the "dev" group in the "development" namespace:
subjects:
- kind: Group
name: system:serviceaccounts:dev
apiGroup: rbac.authorization.k8s.io
namespace: development
For all service accounts in any namespace:
subjects:
- kind: Group
name: system:serviceaccounts
apiGroup: rbac.authorization.k8s.io
For all authenticated users:
subjects:
- kind: Group
name: system:authenticated
apiGroup: rbac.authorization.k8s.io
For all unauthenticated users:
subjects:
- kind: Group
name: system:unauthenticated
apiGroup: rbac.authorization.k8s.io
For all users:
subjects:
- kind: Group
name: system:authenticated
apiGroup: rbac.authorization.k8s.io
- kind: Group
name: system:unauthenticated
apiGroup: rbac.authorization.k8s.io
Default roles and role bindings
API servers create a set of default ClusterRole and ClusterRoleBinding objects.
Many of these are system:
prefixed, which indicates that the resource is directly
managed by the cluster control plane.
All of the default ClusterRoles and ClusterRoleBindings are labeled with kubernetes.io/bootstrapping=rbac-defaults
.
system:
prefix.
Modifications to these resources can result in non-functional clusters.
Auto-reconciliation
At each start-up, the API server updates default cluster roles with any missing permissions, and updates default cluster role bindings with any missing subjects. This allows the cluster to repair accidental modifications, and helps to keep roles and role bindings up-to-date as permissions and subjects change in new Kubernetes releases.
To opt out of this reconciliation, set the rbac.authorization.kubernetes.io/autoupdate
annotation on a default cluster role or rolebinding to false
.
Be aware that missing default permissions and subjects can result in non-functional clusters.
Auto-reconciliation is enabled by default if the RBAC authorizer is active.
API discovery roles
Default role bindings authorize unauthenticated and authenticated users to read API information that is deemed safe to be publicly accessible (including CustomResourceDefinitions). To disable anonymous unauthenticated access, add --anonymous-auth=false
to the API server configuration.
To view the configuration of these roles via kubectl
run:
kubectl get clusterroles system:discovery -o yaml
Default ClusterRole | Default ClusterRoleBinding | Description |
---|---|---|
system:basic-user | system:authenticated group | Allows a user read-only access to basic information about themselves. Prior to v1.14, this role was also bound to system:unauthenticated by default. |
system:discovery | system:authenticated group | Allows read-only access to API discovery endpoints needed to discover and negotiate an API level. Prior to v1.14, this role was also bound to system:unauthenticated by default. |
system:public-info-viewer | system:authenticated and system:unauthenticated groups | Allows read-only access to non-sensitive information about the cluster. Introduced in Kubernetes v1.14. |
User-facing roles
Some of the default ClusterRoles are not system:
prefixed. These are intended to be user-facing roles.
They include super-user roles (cluster-admin
), roles intended to be granted cluster-wide
using ClusterRoleBindings, and roles intended to be granted within particular
namespaces using RoleBindings (admin
, edit
, view
).
User-facing ClusterRoles use ClusterRole aggregation to allow admins to include
rules for custom resources on these ClusterRoles. To add rules to the admin
, edit
, or view
roles, create
a ClusterRole with one or more of the following labels:
metadata:
labels:
rbac.authorization.k8s.io/aggregate-to-admin: "true"
rbac.authorization.k8s.io/aggregate-to-edit: "true"
rbac.authorization.k8s.io/aggregate-to-view: "true"
Default ClusterRole | Default ClusterRoleBinding | Description |
---|---|---|
cluster-admin | system:masters group | Allows super-user access to perform any action on any resource. When used in a ClusterRoleBinding, it gives full control over every resource in the cluster and in all namespaces. When used in a RoleBinding, it gives full control over every resource in the role binding's namespace, including the namespace itself. |
admin | None | Allows admin access, intended to be granted within a namespace using a RoleBinding.
If used in a RoleBinding, allows read/write access to most resources in a namespace, including the ability to create roles and role bindings within the namespace. This role does not allow write access to resource quota or to the namespace itself. This role also does not allow write access to Endpoints in clusters created using Kubernetes v1.22+. More information is available in the "Write Access for Endpoints" section. |
edit | None | Allows read/write access to most objects in a namespace.
This role does not allow viewing or modifying roles or role bindings. However, this role allows accessing Secrets and running Pods as any ServiceAccount in the namespace, so it can be used to gain the API access levels of any ServiceAccount in the namespace. This role also does not allow write access to Endpoints in clusters created using Kubernetes v1.22+. More information is available in the "Write Access for Endpoints" section. |
view | None | Allows read-only access to see most objects in a namespace.
It does not allow viewing roles or role bindings.
This role does not allow viewing Secrets, since reading the contents of Secrets enables access to ServiceAccount credentials in the namespace, which would allow API access as any ServiceAccount in the namespace (a form of privilege escalation). |
Core component roles
Default ClusterRole | Default ClusterRoleBinding | Description |
---|---|---|
system:kube-scheduler | system:kube-scheduler user | Allows access to the resources required by the scheduler component. |
system:volume-scheduler | system:kube-scheduler user | Allows access to the volume resources required by the kube-scheduler component. |
system:kube-controller-manager | system:kube-controller-manager user | Allows access to the resources required by the controller manager component. The permissions required by individual controllers are detailed in the controller roles. |
system:node | None | Allows access to resources required by the kubelet, including read access to all secrets, and write access to all pod status objects.
You should use the Node authorizer and NodeRestriction admission plugin instead of the system:node role, and allow granting API access to kubelets based on the Pods scheduled to run on them. The system:node role only exists for compatibility with Kubernetes clusters upgraded from versions prior to v1.8. |
system:node-proxier | system:kube-proxy user | Allows access to the resources required by the kube-proxy component. |
Other component roles
Default ClusterRole | Default ClusterRoleBinding | Description |
---|---|---|
system:auth-delegator | None | Allows delegated authentication and authorization checks. This is commonly used by add-on API servers for unified authentication and authorization. |
system:heapster | None | Role for the Heapster component (deprecated). |
system:kube-aggregator | None | Role for the kube-aggregator component. |
system:kube-dns | kube-dns service account in the kube-system namespace | Role for the kube-dns component. |
system:kubelet-api-admin | None | Allows full access to the kubelet API. |
system:node-bootstrapper | None | Allows access to the resources required to perform kubelet TLS bootstrapping. |
system:node-problem-detector | None | Role for the node-problem-detector component. |
system:persistent-volume-provisioner | None | Allows access to the resources required by most dynamic volume provisioners. |
system:monitoring | system:monitoring group | Allows read access to control-plane monitoring endpoints (i.e. kube-apiserver liveness and readiness endpoints (/healthz, /livez, /readyz), the individual health-check endpoints (/healthz/*, /livez/*, /readyz/*), and /metrics). Note that individual health check endpoints and the metric endpoint may expose sensitive information. |
Roles for built-in controllers
The Kubernetes controller manager runs
controllers that are built in to the Kubernetes
control plane.
When invoked with --use-service-account-credentials
, kube-controller-manager starts each controller
using a separate service account.
Corresponding roles exist for each built-in controller, prefixed with system:controller:
.
If the controller manager is not started with --use-service-account-credentials
, it runs all control loops
using its own credential, which must be granted all the relevant roles.
These roles include:
system:controller:attachdetach-controller
system:controller:certificate-controller
system:controller:clusterrole-aggregation-controller
system:controller:cronjob-controller
system:controller:daemon-set-controller
system:controller:deployment-controller
system:controller:disruption-controller
system:controller:endpoint-controller
system:controller:expand-controller
system:controller:generic-garbage-collector
system:controller:horizontal-pod-autoscaler
system:controller:job-controller
system:controller:namespace-controller
system:controller:node-controller
system:controller:persistent-volume-binder
system:controller:pod-garbage-collector
system:controller:pv-protection-controller
system:controller:pvc-protection-controller
system:controller:replicaset-controller
system:controller:replication-controller
system:controller:resourcequota-controller
system:controller:root-ca-cert-publisher
system:controller:route-controller
system:controller:service-account-controller
system:controller:service-controller
system:controller:statefulset-controller
system:controller:ttl-controller
Privilege escalation prevention and bootstrapping
The RBAC API prevents users from escalating privileges by editing roles or role bindings. Because this is enforced at the API level, it applies even when the RBAC authorizer is not in use.
Restrictions on role creation or update
You can only create/update a role if at least one of the following things is true:
- You already have all the permissions contained in the role, at the same scope as the object being modified (cluster-wide for a ClusterRole, within the same namespace or cluster-wide for a Role).
- You are granted explicit permission to perform the
escalate
verb on theroles
orclusterroles
resource in therbac.authorization.k8s.io
API group.
For example, if user-1
does not have the ability to list Secrets cluster-wide, they cannot create a ClusterRole
containing that permission. To allow a user to create/update roles:
- Grant them a role that allows them to create/update Role or ClusterRole objects, as desired.
- Grant them permission to include specific permissions in the roles they create/update:
- implicitly, by giving them those permissions (if they attempt to create or modify a Role or ClusterRole with permissions they themselves have not been granted, the API request will be forbidden)
- or explicitly allow specifying any permission in a
Role
orClusterRole
by giving them permission to perform theescalate
verb onroles
orclusterroles
resources in therbac.authorization.k8s.io
API group
Restrictions on role binding creation or update
You can only create/update a role binding if you already have all the permissions contained in the referenced role
(at the same scope as the role binding) or if you have been authorized to perform the bind
verb on the referenced role.
For example, if user-1
does not have the ability to list Secrets cluster-wide, they cannot create a ClusterRoleBinding
to a role that grants that permission. To allow a user to create/update role bindings:
- Grant them a role that allows them to create/update RoleBinding or ClusterRoleBinding objects, as desired.
- Grant them permissions needed to bind a particular role:
- implicitly, by giving them the permissions contained in the role.
- explicitly, by giving them permission to perform the
bind
verb on the particular Role (or ClusterRole).
For example, this ClusterRole and RoleBinding would allow user-1
to grant other users the admin
, edit
, and view
roles in the namespace user-1-namespace
:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: role-grantor
rules:
- apiGroups: ["rbac.authorization.k8s.io"]
resources: ["rolebindings"]
verbs: ["create"]
- apiGroups: ["rbac.authorization.k8s.io"]
resources: ["clusterroles"]
verbs: ["bind"]
# omit resourceNames to allow binding any ClusterRole
resourceNames: ["admin","edit","view"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: role-grantor-binding
namespace: user-1-namespace
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: role-grantor
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: user-1
When bootstrapping the first roles and role bindings, it is necessary for the initial user to grant permissions they do not yet have. To bootstrap initial roles and role bindings:
- Use a credential with the "system:masters" group, which is bound to the "cluster-admin" super-user role by the default bindings.
- If your API server runs with the insecure port enabled (
--insecure-port
), you can also make API calls via that port, which does not enforce authentication or authorization.
Command-line utilities
kubectl create role
Creates a Role object defining permissions within a single namespace. Examples:
-
Create a Role named "pod-reader" that allows users to perform
get
,watch
andlist
on pods:kubectl create role pod-reader --verb=get --verb=list --verb=watch --resource=pods
-
Create a Role named "pod-reader" with resourceNames specified:
kubectl create role pod-reader --verb=get --resource=pods --resource-name=readablepod --resource-name=anotherpod
-
Create a Role named "foo" with apiGroups specified:
kubectl create role foo --verb=get,list,watch --resource=replicasets.apps
-
Create a Role named "foo" with subresource permissions:
kubectl create role foo --verb=get,list,watch --resource=pods,pods/status
-
Create a Role named "my-component-lease-holder" with permissions to get/update a resource with a specific name:
kubectl create role my-component-lease-holder --verb=get,list,watch,update --resource=lease --resource-name=my-component
kubectl create clusterrole
Creates a ClusterRole. Examples:
-
Create a ClusterRole named "pod-reader" that allows user to perform
get
,watch
andlist
on pods:kubectl create clusterrole pod-reader --verb=get,list,watch --resource=pods
-
Create a ClusterRole named "pod-reader" with resourceNames specified:
kubectl create clusterrole pod-reader --verb=get --resource=pods --resource-name=readablepod --resource-name=anotherpod
-
Create a ClusterRole named "foo" with apiGroups specified:
kubectl create clusterrole foo --verb=get,list,watch --resource=replicasets.apps
-
Create a ClusterRole named "foo" with subresource permissions:
kubectl create clusterrole foo --verb=get,list,watch --resource=pods,pods/status
-
Create a ClusterRole named "foo" with nonResourceURL specified:
kubectl create clusterrole "foo" --verb=get --non-resource-url=/logs/*
-
Create a ClusterRole named "monitoring" with an aggregationRule specified:
kubectl create clusterrole monitoring --aggregation-rule="rbac.example.com/aggregate-to-monitoring=true"
kubectl create rolebinding
Grants a Role or ClusterRole within a specific namespace. Examples:
-
Within the namespace "acme", grant the permissions in the "admin" ClusterRole to a user named "bob":
kubectl create rolebinding bob-admin-binding --clusterrole=admin --user=bob --namespace=acme
-
Within the namespace "acme", grant the permissions in the "view" ClusterRole to the service account in the namespace "acme" named "myapp":
kubectl create rolebinding myapp-view-binding --clusterrole=view --serviceaccount=acme:myapp --namespace=acme
-
Within the namespace "acme", grant the permissions in the "view" ClusterRole to a service account in the namespace "myappnamespace" named "myapp":
kubectl create rolebinding myappnamespace-myapp-view-binding --clusterrole=view --serviceaccount=myappnamespace:myapp --namespace=acme
kubectl create clusterrolebinding
Grants a ClusterRole across the entire cluster (all namespaces). Examples:
-
Across the entire cluster, grant the permissions in the "cluster-admin" ClusterRole to a user named "root":
kubectl create clusterrolebinding root-cluster-admin-binding --clusterrole=cluster-admin --user=root
-
Across the entire cluster, grant the permissions in the "system:node-proxier" ClusterRole to a user named "system:kube-proxy":
kubectl create clusterrolebinding kube-proxy-binding --clusterrole=system:node-proxier --user=system:kube-proxy
-
Across the entire cluster, grant the permissions in the "view" ClusterRole to a service account named "myapp" in the namespace "acme":
kubectl create clusterrolebinding myapp-view-binding --clusterrole=view --serviceaccount=acme:myapp
kubectl auth reconcile
Creates or updates rbac.authorization.k8s.io/v1
API objects from a manifest file.
Missing objects are created, and the containing namespace is created for namespaced objects, if required.
Existing roles are updated to include the permissions in the input objects,
and remove extra permissions if --remove-extra-permissions
is specified.
Existing bindings are updated to include the subjects in the input objects,
and remove extra subjects if --remove-extra-subjects
is specified.
Examples:
-
Test applying a manifest file of RBAC objects, displaying changes that would be made:
kubectl auth reconcile -f my-rbac-rules.yaml --dry-run=client
-
Apply a manifest file of RBAC objects, preserving any extra permissions (in roles) and any extra subjects (in bindings):
kubectl auth reconcile -f my-rbac-rules.yaml
-
Apply a manifest file of RBAC objects, removing any extra permissions (in roles) and any extra subjects (in bindings):
kubectl auth reconcile -f my-rbac-rules.yaml --remove-extra-subjects --remove-extra-permissions
ServiceAccount permissions
Default RBAC policies grant scoped permissions to control-plane components, nodes,
and controllers, but grant no permissions to service accounts outside the kube-system
namespace
(beyond discovery permissions given to all authenticated users).
This allows you to grant particular roles to particular ServiceAccounts as needed. Fine-grained role bindings provide greater security, but require more effort to administrate. Broader grants can give unnecessary (and potentially escalating) API access to ServiceAccounts, but are easier to administrate.
In order from most secure to least secure, the approaches are:
-
Grant a role to an application-specific service account (best practice)
This requires the application to specify a
serviceAccountName
in its pod spec, and for the service account to be created (via the API, application manifest,kubectl create serviceaccount
, etc.).For example, grant read-only permission within "my-namespace" to the "my-sa" service account:
kubectl create rolebinding my-sa-view \ --clusterrole=view \ --serviceaccount=my-namespace:my-sa \ --namespace=my-namespace
-
Grant a role to the "default" service account in a namespace
If an application does not specify a
serviceAccountName
, it uses the "default" service account.Note: Permissions given to the "default" service account are available to any pod in the namespace that does not specify aserviceAccountName
.For example, grant read-only permission within "my-namespace" to the "default" service account:
kubectl create rolebinding default-view \ --clusterrole=view \ --serviceaccount=my-namespace:default \ --namespace=my-namespace
Many add-ons run as the "default" service account in the
kube-system
namespace. To allow those add-ons to run with super-user access, grant cluster-admin permissions to the "default" service account in thekube-system
namespace.Caution: Enabling this means thekube-system
namespace contains Secrets that grant super-user access to your cluster's API.kubectl create clusterrolebinding add-on-cluster-admin \ --clusterrole=cluster-admin \ --serviceaccount=kube-system:default
-
Grant a role to all service accounts in a namespace
If you want all applications in a namespace to have a role, no matter what service account they use, you can grant a role to the service account group for that namespace.
For example, grant read-only permission within "my-namespace" to all service accounts in that namespace:
kubectl create rolebinding serviceaccounts-view \ --clusterrole=view \ --group=system:serviceaccounts:my-namespace \ --namespace=my-namespace
-
Grant a limited role to all service accounts cluster-wide (discouraged)
If you don't want to manage permissions per-namespace, you can grant a cluster-wide role to all service accounts.
For example, grant read-only permission across all namespaces to all service accounts in the cluster:
kubectl create clusterrolebinding serviceaccounts-view \ --clusterrole=view \ --group=system:serviceaccounts
-
Grant super-user access to all service accounts cluster-wide (strongly discouraged)
If you don't care about partitioning permissions at all, you can grant super-user access to all service accounts.
Warning: This allows any application full access to your cluster, and also grants any user with read access to Secrets (or the ability to create any pod) full access to your cluster.kubectl create clusterrolebinding serviceaccounts-cluster-admin \ --clusterrole=cluster-admin \ --group=system:serviceaccounts
Write access for Endpoints
Kubernetes clusters created before Kubernetes v1.22 include write access to Endpoints in the aggregated "edit" and "admin" roles. As a mitigation for CVE-2021-25740, this access is not part of the aggregated roles in clusters that you create using Kubernetes v1.22 or later.
Existing clusters that have been upgraded to Kubernetes v1.22 will not be subject to this change. The CVE announcement includes guidance for restricting this access in existing clusters.
If you want new clusters to retain this level of access in the aggregated roles, you can create the following ClusterRole:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
kubernetes.io/description: |-
Add endpoints write permissions to the edit and admin roles. This was
removed by default in 1.22 because of CVE-2021-25740. See
https://issue.k8s.io/103675. This can allow writers to direct LoadBalancer
or Ingress implementations to expose backend IPs that would not otherwise
be accessible, and can circumvent network policies or security controls
intended to prevent/isolate access to those backends.
labels:
rbac.authorization.k8s.io/aggregate-to-edit: "true"
name: custom:aggregate-to-edit:endpoints # you can change this if you wish
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["create", "delete", "deletecollection", "patch", "update"]
Upgrading from ABAC
Clusters that originally ran older Kubernetes versions often used permissive ABAC policies, including granting full API access to all service accounts.
Default RBAC policies grant scoped permissions to control-plane components, nodes,
and controllers, but grant no permissions to service accounts outside the kube-system
namespace
(beyond discovery permissions given to all authenticated users).
While far more secure, this can be disruptive to existing workloads expecting to automatically receive API permissions. Here are two approaches for managing this transition:
Parallel authorizers
Run both the RBAC and ABAC authorizers, and specify a policy file that contains the legacy ABAC policy:
--authorization-mode=...,RBAC,ABAC --authorization-policy-file=mypolicy.json
To explain that first command line option in detail: if earlier authorizers, such as Node, deny a request, then the RBAC authorizer attempts to authorize the API request. If RBAC also denies that API request, the ABAC authorizer is then run. This means that any request allowed by either the RBAC or ABAC policies is allowed.
When the kube-apiserver is run with a log level of 5 or higher for the RBAC component
(--vmodule=rbac*=5
or --v=5
), you can see RBAC denials in the API server log
(prefixed with RBAC
).
You can use that information to determine which roles need to be granted to which users, groups, or service accounts.
Once you have granted roles to service accounts and workloads are running with no RBAC denial messages in the server logs, you can remove the ABAC authorizer.
Permissive RBAC permissions
You can replicate a permissive ABAC policy using RBAC role bindings.
The following policy allows ALL service accounts to act as cluster administrators. Any application running in a container receives service account credentials automatically, and could perform any action against the API, including viewing secrets and modifying permissions. This is not a recommended policy.
kubectl create clusterrolebinding permissive-binding \
--clusterrole=cluster-admin \
--user=admin \
--user=kubelet \
--group=system:serviceaccounts
After you have transitioned to use RBAC, you should adjust the access controls for your cluster to ensure that these meet your information security needs.
6.3.9 - Using ABAC Authorization
Attribute-based access control (ABAC) defines an access control paradigm whereby access rights are granted to users through the use of policies which combine attributes together.
Policy File Format
To enable ABAC
mode, specify --authorization-policy-file=SOME_FILENAME
and --authorization-mode=ABAC
on startup.
The file format is one JSON object per line. There should be no enclosing list or map, only one map per line.
Each line is a "policy object", where each such object is a map with the following properties:
- Versioning properties:
apiVersion
, type string; valid values are "abac.authorization.kubernetes.io/v1beta1". Allows versioning and conversion of the policy format.kind
, type string: valid values are "Policy". Allows versioning and conversion of the policy format.
spec
property set to a map with the following properties:- Subject-matching properties:
user
, type string; the user-string from--token-auth-file
. If you specifyuser
, it must match the username of the authenticated user.group
, type string; if you specifygroup
, it must match one of the groups of the authenticated user.system:authenticated
matches all authenticated requests.system:unauthenticated
matches all unauthenticated requests.
- Resource-matching properties:
apiGroup
, type string; an API group.- Ex:
extensions
- Wildcard:
*
matches all API groups.
- Ex:
namespace
, type string; a namespace.- Ex:
kube-system
- Wildcard:
*
matches all resource requests.
- Ex:
resource
, type string; a resource type- Ex:
pods
- Wildcard:
*
matches all resource requests.
- Ex:
- Non-resource-matching properties:
nonResourcePath
, type string; non-resource request paths.- Ex:
/version
or/apis
- Wildcard:
*
matches all non-resource requests./foo/*
matches all subpaths of/foo/
.
- Ex:
readonly
, type boolean, when true, means that the Resource-matching policy only applies to get, list, and watch operations, Non-resource-matching policy only applies to get operation.
- Subject-matching properties:
An unset property is the same as a property set to the zero value for its type (e.g. empty string, 0, false). However, unset should be preferred for readability.
In the future, policies may be expressed in a JSON format, and managed via a REST interface.
Authorization Algorithm
A request has attributes which correspond to the properties of a policy object.
When a request is received, the attributes are determined. Unknown attributes are set to the zero value of its type (e.g. empty string, 0, false).
A property set to "*"
will match any value of the corresponding attribute.
The tuple of attributes is checked for a match against every policy in the policy file. If at least one line matches the request attributes, then the request is authorized (but may fail later validation).
To permit any authenticated user to do something, write a policy with the
group property set to "system:authenticated"
.
To permit any unauthenticated user to do something, write a policy with the
group property set to "system:unauthenticated"
.
To permit a user to do anything, write a policy with the apiGroup, namespace,
resource, and nonResourcePath properties set to "*"
.
Kubectl
Kubectl uses the /api
and /apis
endpoints of api-server to discover
served resource types, and validates objects sent to the API by create/update
operations using schema information located at /openapi/v2
.
When using ABAC authorization, those special resources have to be explicitly
exposed via the nonResourcePath
property in a policy (see examples below):
/api
,/api/*
,/apis
, and/apis/*
for API version negotiation./version
for retrieving the server version viakubectl version
./swaggerapi/*
for create/update operations.
To inspect the HTTP calls involved in a specific kubectl operation you can turn up the verbosity:
kubectl --v=8 version
Examples
-
Alice can do anything to all resources:
{"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "alice", "namespace": "*", "resource": "*", "apiGroup": "*"}}
-
The Kubelet can read any pods:
{"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "kubelet", "namespace": "*", "resource": "pods", "readonly": true}}
-
The Kubelet can read and write events:
{"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "kubelet", "namespace": "*", "resource": "events"}}
-
Bob can just read pods in namespace "projectCaribou":
{"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "bob", "namespace": "projectCaribou", "resource": "pods", "readonly": true}}
-
Anyone can make read-only requests to all non-resource paths:
{"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"group": "system:authenticated", "readonly": true, "nonResourcePath": "*"}} {"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"group": "system:unauthenticated", "readonly": true, "nonResourcePath": "*"}}
A quick note on service accounts
Every service account has a corresponding ABAC username, and that service account's user name is generated according to the naming convention:
system:serviceaccount:<namespace>:<serviceaccountname>
Creating a new namespace leads to the creation of a new service account in the following format:
system:serviceaccount:<namespace>:default
For example, if you wanted to grant the default service account (in the kube-system
namespace) full
privilege to the API using ABAC, you would add this line to your policy file:
{"apiVersion":"abac.authorization.kubernetes.io/v1beta1","kind":"Policy","spec":{"user":"system:serviceaccount:kube-system:default","namespace":"*","resource":"*","apiGroup":"*"}}
The apiserver will need to be restarted to pickup the new policy lines.
6.3.10 - Using Node Authorization
Node authorization is a special-purpose authorization mode that specifically authorizes API requests made by kubelets.
Overview
The Node authorizer allows a kubelet to perform API operations. This includes:
Read operations:
- services
- endpoints
- nodes
- pods
- secrets, configmaps, persistent volume claims and persistent volumes related to pods bound to the kubelet's node
Write operations:
- nodes and node status (enable the
NodeRestriction
admission plugin to limit a kubelet to modify its own node) - pods and pod status (enable the
NodeRestriction
admission plugin to limit a kubelet to modify pods bound to itself) - events
Auth-related operations:
- read/write access to the CertificateSigningRequests API for TLS bootstrapping
- the ability to create TokenReviews and SubjectAccessReviews for delegated authentication/authorization checks
In future releases, the node authorizer may add or remove permissions to ensure kubelets have the minimal set of permissions required to operate correctly.
In order to be authorized by the Node authorizer, kubelets must use a credential that identifies them as
being in the system:nodes
group, with a username of system:node:<nodeName>
.
This group and user name format match the identity created for each kubelet as part of
kubelet TLS bootstrapping.
The value of <nodeName>
must match precisely the name of the node as registered by the kubelet. By default, this is the host name as provided by hostname
, or overridden via the kubelet option --hostname-override
. However, when using the --cloud-provider
kubelet option, the specific hostname may be determined by the cloud provider, ignoring the local hostname
and the --hostname-override
option.
For specifics about how the kubelet determines the hostname, see the kubelet options reference.
To enable the Node authorizer, start the apiserver with --authorization-mode=Node
.
To limit the API objects kubelets are able to write, enable the NodeRestriction admission plugin by starting the apiserver with --enable-admission-plugins=...,NodeRestriction,...
Migration considerations
Kubelets outside the system:nodes
group
Kubelets outside the system:nodes
group would not be authorized by the Node
authorization mode,
and would need to continue to be authorized via whatever mechanism currently authorizes them.
The node admission plugin would not restrict requests from these kubelets.
Kubelets with undifferentiated usernames
In some deployments, kubelets have credentials that place them in the system:nodes
group,
but do not identify the particular node they are associated with,
because they do not have a username in the system:node:...
format.
These kubelets would not be authorized by the Node
authorization mode,
and would need to continue to be authorized via whatever mechanism currently authorizes them.
The NodeRestriction
admission plugin would ignore requests from these kubelets,
since the default node identifier implementation would not consider that a node identity.
Upgrades from previous versions using RBAC
Upgraded pre-1.7 clusters using RBAC will continue functioning as-is because the system:nodes
group binding will already exist.
If a cluster admin wishes to start using the Node
authorizer and NodeRestriction
admission plugin
to limit node access to the API, that can be done non-disruptively:
- Enable the
Node
authorization mode (--authorization-mode=Node,RBAC
) and theNodeRestriction
admission plugin - Ensure all kubelets' credentials conform to the group/username requirements
- Audit apiserver logs to ensure the
Node
authorizer is not rejecting requests from kubelets (no persistentNODE DENY
messages logged) - Delete the
system:node
cluster role binding
RBAC Node Permissions
In 1.6, the system:node
cluster role was automatically bound to the system:nodes
group when using the RBAC Authorization mode.
In 1.7, the automatic binding of the system:nodes
group to the system:node
role is deprecated
because the node authorizer accomplishes the same purpose with the benefit of additional restrictions
on secret and configmap access. If the Node
and RBAC
authorization modes are both enabled,
the automatic binding of the system:nodes
group to the system:node
role is not created in 1.7.
In 1.8, the binding will not be created at all.
When using RBAC, the system:node
cluster role will continue to be created,
for compatibility with deployment methods that bind other users or groups to that role.
6.3.11 - Mapping PodSecurityPolicies to Pod Security Standards
The tables below enumerate the configuration parameters on PodSecurityPolicy objects, whether the field mutates and/or validates pods, and how the configuration values map to the Pod Security Standards.
For each applicable parameter, the allowed values for the Baseline and Restricted profiles are listed. Anything outside the allowed values for those profiles would fall under the Privileged profile. "No opinion" means all values are allowed under all Pod Security Standards.
For a step-by-step migration guide, see Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller.
PodSecurityPolicy Spec
The fields enumerated in this table are part of the PodSecurityPolicySpec
, which is specified
under the .spec
field path.
PodSecurityPolicySpec |
Type | Pod Security Standards Equivalent |
---|---|---|
privileged |
Validating | Baseline & Restricted: false / undefined / nil |
defaultAddCapabilities |
Mutating & Validating | Requirements match allowedCapabilities below. |
allowedCapabilities |
Validating |
Baseline: subset of
Restricted: empty / undefined / nil OR a list containing only |
requiredDropCapabilities |
Mutating & Validating |
Baseline: no opinion Restricted: must include |
volumes |
Validating |
Baseline: anything except
Restricted: subset of
|
hostNetwork |
Validating | Baseline & Restricted: false / undefined / nil |
hostPorts |
Validating | Baseline & Restricted: undefined / nil / empty |
hostPID |
Validating | Baseline & Restricted: false / undefined / nil |
hostIPC |
Validating | Baseline & Restricted: false / undefined / nil |
seLinux |
Mutating & Validating |
Baseline & Restricted:
|
runAsUser |
Mutating & Validating |
Baseline: Anything Restricted: |
runAsGroup |
Mutating (MustRunAs) & Validating | No opinion |
supplementalGroups |
Mutating & Validating | No opinion |
fsGroup |
Mutating & Validating | No opinion |
readOnlyRootFilesystem |
Mutating & Validating | No opinion |
defaultAllowPrivilegeEscalation |
Mutating | No opinion (non-validating) |
allowPrivilegeEscalation |
Mutating & Validating |
Only mutating if set to Baseline: No opinion Restricted: |
allowedHostPaths |
Validating | No opinion (volumes takes precedence) |
allowedFlexVolumes |
Validating | No opinion (volumes takes precedence) |
allowedCSIDrivers |
Validating | No opinion (volumes takes precedence) |
allowedUnsafeSysctls |
Validating | Baseline & Restricted: undefined / nil / empty |
forbiddenSysctls |
Validating | No opinion |
allowedProcMountTypes (alpha feature) |
Validating | Baseline & Restricted: ["Default"] OR undefined / nil / empty |
runtimeClass .defaultRuntimeClassName |
Mutating | No opinion |
runtimeClass .allowedRuntimeClassNames |
Validating | No opinion |
PodSecurityPolicy annotations
The annotations enumerated in this
table can be specified under .metadata.annotations
on the PodSecurityPolicy object.
PSP Annotation |
Type | Pod Security Standards Equivalent |
---|---|---|
seccomp.security.alpha.kubernetes.io /defaultProfileName |
Mutating | No opinion |
seccomp.security.alpha.kubernetes.io /allowedProfileNames |
Validating |
Baseline: Restricted:
|
apparmor.security.beta.kubernetes.io /defaultProfileName |
Mutating | No opinion |
apparmor.security.beta.kubernetes.io /allowedProfileNames |
Validating |
Baseline: Restricted:
|
6.3.12 - Webhook Mode
A WebHook is an HTTP callback: an HTTP POST that occurs when something happens; a simple event-notification via HTTP POST. A web application implementing WebHooks will POST a message to a URL when certain things happen.
When specified, mode Webhook
causes Kubernetes to query an outside REST
service when determining user privileges.
Configuration File Format
Mode Webhook
requires a file for HTTP configuration, specify by the
--authorization-webhook-config-file=SOME_FILENAME
flag.
The configuration file uses the kubeconfig file format. Within the file "users" refers to the API Server webhook and "clusters" refers to the remote service.
A configuration example which uses HTTPS client auth:
# Kubernetes API version
apiVersion: v1
# kind of the API object
kind: Config
# clusters refers to the remote service.
clusters:
- name: name-of-remote-authz-service
cluster:
# CA for verifying the remote service.
certificate-authority: /path/to/ca.pem
# URL of remote service to query. Must use 'https'. May not include parameters.
server: https://authz.example.com/authorize
# users refers to the API Server's webhook configuration.
users:
- name: name-of-api-server
user:
client-certificate: /path/to/cert.pem # cert for the webhook plugin to use
client-key: /path/to/key.pem # key matching the cert
# kubeconfig files require a context. Provide one for the API Server.
current-context: webhook
contexts:
- context:
cluster: name-of-remote-authz-service
user: name-of-api-server
name: webhook
Request Payloads
When faced with an authorization decision, the API Server POSTs a JSON-
serialized authorization.k8s.io/v1beta1
SubjectAccessReview
object describing the
action. This object contains fields describing the user attempting to make the
request, and either details about the resource being accessed or requests
attributes.
Note that webhook API objects are subject to the same versioning compatibility rules
as other Kubernetes API objects. Implementers should be aware of looser
compatibility promises for beta objects and check the "apiVersion" field of the
request to ensure correct deserialization. Additionally, the API Server must
enable the authorization.k8s.io/v1beta1
API extensions group (--runtime-config=authorization.k8s.io/v1beta1=true
).
An example request body:
{
"apiVersion": "authorization.k8s.io/v1beta1",
"kind": "SubjectAccessReview",
"spec": {
"resourceAttributes": {
"namespace": "kittensandponies",
"verb": "get",
"group": "unicorn.example.org",
"resource": "pods"
},
"user": "jane",
"group": [
"group1",
"group2"
]
}
}
The remote service is expected to fill the status
field of
the request and respond to either allow or disallow access. The response body's
spec
field is ignored and may be omitted. A permissive response would return:
{
"apiVersion": "authorization.k8s.io/v1beta1",
"kind": "SubjectAccessReview",
"status": {
"allowed": true
}
}
For disallowing access there are two methods.
The first method is preferred in most cases, and indicates the authorization webhook does not allow, or has "no opinion" about the request, but if other authorizers are configured, they are given a chance to allow the request. If there are no other authorizers, or none of them allow the request, the request is forbidden. The webhook would return:
{
"apiVersion": "authorization.k8s.io/v1beta1",
"kind": "SubjectAccessReview",
"status": {
"allowed": false,
"reason": "user does not have read access to the namespace"
}
}
The second method denies immediately, short-circuiting evaluation by other configured authorizers. This should only be used by webhooks that have detailed knowledge of the full authorizer configuration of the cluster. The webhook would return:
{
"apiVersion": "authorization.k8s.io/v1beta1",
"kind": "SubjectAccessReview",
"status": {
"allowed": false,
"denied": true,
"reason": "user does not have read access to the namespace"
}
}
Access to non-resource paths are sent as:
{
"apiVersion": "authorization.k8s.io/v1beta1",
"kind": "SubjectAccessReview",
"spec": {
"nonResourceAttributes": {
"path": "/debug",
"verb": "get"
},
"user": "jane",
"group": [
"group1",
"group2"
]
}
}
Non-resource paths include: /api
, /apis
, /metrics
,
/logs
, /debug
, /healthz
, /livez
, /openapi/v2
, /readyz
, and
/version.
Clients require access to /api
, /api/*
, /apis
, /apis/*
,
and /version
to discover what resources and versions are present on the server.
Access to other non-resource paths can be disallowed without restricting access
to the REST api.
For further documentation refer to the authorization.v1beta1 API objects and webhook.go.
6.4 - Well-Known Labels, Annotations and Taints
Kubernetes reserves all labels and annotations in the kubernetes.io namespace.
This document serves both as a reference to the values and as a coordination point for assigning values.
Labels, annotations and taints used on API objects
kubernetes.io/arch
Example: kubernetes.io/arch=amd64
Used on: Node
The Kubelet populates this with runtime.GOARCH
as defined by Go. This can be handy if you are mixing arm and x86 nodes.
kubernetes.io/os
Example: kubernetes.io/os=linux
Used on: Node
The Kubelet populates this with runtime.GOOS
as defined by Go. This can be handy if you are mixing operating systems in your cluster (for example: mixing Linux and Windows nodes).
kubernetes.io/metadata.name
Example: kubernetes.io/metadata.name=mynamespace
Used on: Namespaces
The Kubernetes API server (part of the control plane) sets this label on all namespaces. The label value is set to the name of the namespace. You can't change this label's value.
This is useful if you want to target a specific namespace with a label selector.
beta.kubernetes.io/arch (deprecated)
This label has been deprecated. Please use kubernetes.io/arch
instead.
beta.kubernetes.io/os (deprecated)
This label has been deprecated. Please use kubernetes.io/os
instead.
kubernetes.io/hostname
Example: kubernetes.io/hostname=ip-172-20-114-199.ec2.internal
Used on: Node
The Kubelet populates this label with the hostname. Note that the hostname can be changed from the "actual" hostname by passing the --hostname-override
flag to the kubelet
.
This label is also used as part of the topology hierarchy. See topology.kubernetes.io/zone for more information.
kubernetes.io/change-cause
Example: kubernetes.io/change-cause=kubectl edit --record deployment foo
Used on: All Objects
This annotation is a best guess at why something was changed.
It is populated when adding --record
to a kubectl
command that may change an object.
kubernetes.io/description
Example: kubernetes.io/description: "Description of K8s object."
Used on: All Objects
This annotation is used for describing specific behaviour of given object.
kubernetes.io/enforce-mountable-secrets
Example: kubernetes.io/enforce-mountable-secrets: "true"
Used on: ServiceAccount
The value for this annotation must be true to take effect. This annotation indicates that pods running as this service account may only reference Secret API objects specified in the service account's secrets
field.
controller.kubernetes.io/pod-deletion-cost
Example: controller.kubernetes.io/pod-deletion-cost=10
Used on: Pod
This annotation is used to set Pod Deletion Cost
which allows users to influence ReplicaSet downscaling order. The annotation parses into an int32
type.
beta.kubernetes.io/instance-type (deprecated)
node.kubernetes.io/instance-type
Example: node.kubernetes.io/instance-type=m3.medium
Used on: Node
The Kubelet populates this with the instance type as defined by the cloudprovider
.
This will be set only if you are using a cloudprovider
. This setting is handy
if you want to target certain workloads to certain instance types, but typically you want
to rely on the Kubernetes scheduler to perform resource-based scheduling. You should aim to schedule based on properties rather than on instance types (for example: require a GPU, instead of requiring a g2.2xlarge
).
failure-domain.beta.kubernetes.io/region (deprecated)
See topology.kubernetes.io/region.
failure-domain.beta.kubernetes.io/zone (deprecated)
See topology.kubernetes.io/zone.
statefulset.kubernetes.io/pod-name
Example:
statefulset.kubernetes.io/pod-name=mystatefulset-7
When a StatefulSet controller creates a Pod for the StatefulSet, the control plane sets this label on that Pod. The value of the label is the name of the Pod being created.
See Pod Name Label in the StatefulSet topic for more details.
topology.kubernetes.io/region
Example:
topology.kubernetes.io/region=us-east-1
See topology.kubernetes.io/zone.
topology.kubernetes.io/zone
Example:
topology.kubernetes.io/zone=us-east-1c
Used on: Node, PersistentVolume
On Node: The kubelet
or the external cloud-controller-manager
populates this with the information as provided by the cloudprovider
. This will be set only if you are using a cloudprovider
. However, you should consider setting this on nodes if it makes sense in your topology.
On PersistentVolume: topology-aware volume provisioners will automatically set node affinity constraints on PersistentVolumes
.
A zone represents a logical failure domain. It is common for Kubernetes clusters to span multiple zones for increased availability. While the exact definition of a zone is left to infrastructure implementations, common properties of a zone include very low network latency within a zone, no-cost network traffic within a zone, and failure independence from other zones. For example, nodes within a zone might share a network switch, but nodes in different zones should not.
A region represents a larger domain, made up of one or more zones. It is uncommon for Kubernetes clusters to span multiple regions, While the exact definition of a zone or region is left to infrastructure implementations, common properties of a region include higher network latency between them than within them, non-zero cost for network traffic between them, and failure independence from other zones or regions. For example, nodes within a region might share power infrastructure (e.g. a UPS or generator), but nodes in different regions typically would not.
Kubernetes makes a few assumptions about the structure of zones and regions:
- regions and zones are hierarchical: zones are strict subsets of regions and no zone can be in 2 regions
- zone names are unique across regions; for example region "africa-east-1" might be comprised of zones "africa-east-1a" and "africa-east-1b"
It should be safe to assume that topology labels do not change. Even though labels are strictly mutable, consumers of them can assume that a given node is not going to be moved between zones without being destroyed and recreated.
Kubernetes can use this information in various ways. For example, the scheduler automatically tries to spread the Pods in a ReplicaSet across nodes in a single-zone cluster (to reduce the impact of node failures, see kubernetes.io/hostname). With multiple-zone clusters, this spreading behavior also applies to zones (to reduce the impact of zone failures). This is achieved via SelectorSpreadPriority.
SelectorSpreadPriority is a best effort placement. If the zones in your cluster are heterogeneous (for example: different numbers of nodes, different types of nodes, or different pod resource requirements), this placement might prevent equal spreading of your Pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.
The scheduler (through the VolumeZonePredicate predicate) also will ensure that Pods, that claim a given volume, are only placed into the same zone as that volume. Volumes cannot be attached across zones.
If PersistentVolumeLabel
does not support automatic labeling of your PersistentVolumes, you should consider
adding the labels manually (or adding support for PersistentVolumeLabel
). With PersistentVolumeLabel
, the scheduler prevents Pods from mounting volumes in a different zone. If your infrastructure doesn't have this constraint, you don't need to add the zone labels to the volumes at all.
volume.beta.kubernetes.io/storage-provisioner (deprecated)
Example: volume.beta.kubernetes.io/storage-provisioner: k8s.io/minikube-hostpath
Used on: PersistentVolumeClaim
This annotation has been deprecated.
volume.beta.kubernetes.io/mount-options (deprecated)
Example : volume.beta.kubernetes.io/mount-options: "ro,soft"
Used on: PersistentVolume
A Kubernetes administrator can specify additional mount options for when a PersistentVolume is mounted on a node.
This annotation has been deprecated.
volume.kubernetes.io/storage-provisioner
Used on: PersistentVolumeClaim
This annotation will be added to dynamic provisioning required PVC.
node.kubernetes.io/windows-build
Example: node.kubernetes.io/windows-build=10.0.17763
Used on: Node
When the kubelet is running on Microsoft Windows, it automatically labels its node to record the version of Windows Server in use.
The label's value is in the format "MajorVersion.MinorVersion.BuildNumber".
service.kubernetes.io/headless
Example: service.kubernetes.io/headless=""
Used on: Service
The control plane adds this label to an Endpoints object when the owning Service is headless.
kubernetes.io/service-name
Example: kubernetes.io/service-name="nginx"
Used on: Service
Kubernetes uses this label to differentiate multiple Services. Used currently for ELB
(Elastic Load Balancer) only.
endpointslice.kubernetes.io/managed-by
Example: endpointslice.kubernetes.io/managed-by="controller"
Used on: EndpointSlices
The label is used to indicate the controller or entity that manages an EndpointSlice. This label aims to enable different EndpointSlice objects to be managed by different controllers or entities within the same cluster.
endpointslice.kubernetes.io/skip-mirror
Example: endpointslice.kubernetes.io/skip-mirror="true"
Used on: Endpoints
The label can be set to "true"
on an Endpoints resource to indicate that the EndpointSliceMirroring controller should not mirror this resource with EndpointSlices.
service.kubernetes.io/service-proxy-name
Example: service.kubernetes.io/service-proxy-name="foo-bar"
Used on: Service
The kube-proxy has this label for custom proxy, which delegates service control to custom proxy.
experimental.windows.kubernetes.io/isolation-type (deprecated)
Example: experimental.windows.kubernetes.io/isolation-type: "hyperv"
Used on: Pod
The annotation is used to run Windows containers with Hyper-V isolation. To use Hyper-V isolation feature and create a Hyper-V isolated container, the kubelet should be started with feature gates HyperVContainer=true and the Pod should include the annotation experimental.windows.kubernetes.io/isolation-type=hyperv.
ingressclass.kubernetes.io/is-default-class
Example: ingressclass.kubernetes.io/is-default-class: "true"
Used on: IngressClass
When a single IngressClass resource has this annotation set to "true"
, new Ingress resource without a class specified will be assigned this default class.
kubernetes.io/ingress.class (deprecated)
spec.ingressClassName
.
storageclass.kubernetes.io/is-default-class
Example: storageclass.kubernetes.io/is-default-class=true
Used on: StorageClass
When a single StorageClass resource has this annotation set to "true"
, new PersistentVolumeClaim
resource without a class specified will be assigned this default class.
alpha.kubernetes.io/provided-node-ip
Example: alpha.kubernetes.io/provided-node-ip: "10.0.0.1"
Used on: Node
The kubelet can set this annotation on a Node to denote its configured IPv4 address.
When kubelet is started with the "external" cloud provider, it sets this annotation on the Node to denote an IP address set from the command line flag (--node-ip
). This IP is verified with the cloud provider as valid by the cloud-controller-manager.
batch.kubernetes.io/job-completion-index
Example: batch.kubernetes.io/job-completion-index: "3"
Used on: Pod
The Job controller in the kube-controller-manager sets this annotation for Pods created with Indexed completion mode.
kubectl.kubernetes.io/default-container
Example: kubectl.kubernetes.io/default-container: "front-end-app"
The value of the annotation is the container name that is default for this Pod. For example, kubectl logs
or kubectl exec
without -c
or --container
flag will use this default container.
endpoints.kubernetes.io/over-capacity
Example: endpoints.kubernetes.io/over-capacity:truncated
Used on: Endpoints
In Kubernetes clusters v1.22 (or later), the Endpoints controller adds this annotation to an Endpoints resource if it has more than 1000 endpoints. The annotation indicates that the Endpoints resource is over capacity and the number of endpoints has been truncated to 1000.
batch.kubernetes.io/job-tracking
Example: batch.kubernetes.io/job-tracking: ""
Used on: Jobs
The presence of this annotation on a Job indicates that the control plane is tracking the Job status using finalizers. You should not manually add or remove this annotation.
scheduler.alpha.kubernetes.io/preferAvoidPods (deprecated)
Used on: Nodes
This annotation requires the NodePreferAvoidPods scheduling plugin to be enabled. The plugin is deprecated since Kubernetes 1.22. Use Taints and Tolerations instead.
The taints listed below are always used on Nodes
node.kubernetes.io/not-ready
Example: node.kubernetes.io/not-ready:NoExecute
The node controller detects whether a node is ready by monitoring its health and adds or removes this taint accordingly.
node.kubernetes.io/unreachable
Example: node.kubernetes.io/unreachable:NoExecute
The node controller adds the taint to a node corresponding to the NodeCondition Ready
being Unknown
.
node.kubernetes.io/unschedulable
Example: node.kubernetes.io/unschedulable:NoSchedule
The taint will be added to a node when initializing the node to avoid race condition.
node.kubernetes.io/memory-pressure
Example: node.kubernetes.io/memory-pressure:NoSchedule
The kubelet detects memory pressure based on memory.available
and allocatableMemory.available
observed on a Node. The observed values are then compared to the corresponding thresholds that can be set on the kubelet to determine if the Node condition and taint should be added/removed.
node.kubernetes.io/disk-pressure
Example: node.kubernetes.io/disk-pressure:NoSchedule
The kubelet detects disk pressure based on imagefs.available
, imagefs.inodesFree
, nodefs.available
and nodefs.inodesFree
(Linux only) observed on a Node. The observed values are then compared to the corresponding thresholds that can be set on the kubelet to determine if the Node condition and taint should be added/removed.
node.kubernetes.io/network-unavailable
Example: node.kubernetes.io/network-unavailable:NoSchedule
This is initially set by the kubelet when the cloud provider used indicates a requirement for additional network configuration. Only when the route on the cloud is configured properly will the taint be removed by the cloud provider.
node.kubernetes.io/pid-pressure
Example: node.kubernetes.io/pid-pressure:NoSchedule
The kubelet checks D-value of the size of /proc/sys/kernel/pid_max
and the PIDs consumed by Kubernetes on a node to get the number of available PIDs that referred to as the pid.available
metric. The metric is then compared to the corresponding threshold that can be set on the kubelet to determine if the node condition and taint should be added/removed.
node.cloudprovider.kubernetes.io/uninitialized
Example: node.cloudprovider.kubernetes.io/uninitialized:NoSchedule
Sets this taint on a node to mark it as unusable, when kubelet is started with the "external" cloud provider, until a controller from the cloud-controller-manager initializes this node, and then removes the taint.
node.cloudprovider.kubernetes.io/shutdown
Example: node.cloudprovider.kubernetes.io/shutdown:NoSchedule
If a Node is in a cloud provider specified shutdown state, the Node gets tainted accordingly with node.cloudprovider.kubernetes.io/shutdown
and the taint effect of NoSchedule
.
pod-security.kubernetes.io/enforce
Example: pod-security.kubernetes.io/enforce: baseline
Used on: Namespace
Value must be one of privileged
, baseline
, or restricted
which correspond to
Pod Security Standard levels. Specifically,
the enforce
label prohibits the creation of any Pod in the labeled Namespace which does not meet
the requirements outlined in the indicated level.
See Enforcing Pod Security at the Namespace Level for more information.
pod-security.kubernetes.io/enforce-version
Example: pod-security.kubernetes.io/enforce-version: 1.23
Used on: Namespace
Value must be latest
or a valid Kubernetes version in the format v<MAJOR>.<MINOR>
.
This determines the version of the Pod Security Standard
policies to apply when validating a submitted Pod.
See Enforcing Pod Security at the Namespace Level for more information.
pod-security.kubernetes.io/audit
Example: pod-security.kubernetes.io/audit: baseline
Used on: Namespace
Value must be one of privileged
, baseline
, or restricted
which correspond to
Pod Security Standard levels. Specifically,
the audit
label does not prevent the creation of a Pod in the labeled Namespace which does not meet
the requirements outlined in the indicated level, but adds an audit annotation to that Pod.
See Enforcing Pod Security at the Namespace Level for more information.
pod-security.kubernetes.io/audit-version
Example: pod-security.kubernetes.io/audit-version: 1.23
Used on: Namespace
Value must be latest
or a valid Kubernetes version in the format v<MAJOR>.<MINOR>
.
This determines the version of the Pod Security Standard
policies to apply when validating a submitted Pod.
See Enforcing Pod Security at the Namespace Level for more information.
pod-security.kubernetes.io/warn
Example: pod-security.kubernetes.io/warn: baseline
Used on: Namespace
Value must be one of privileged
, baseline
, or restricted
which correspond to
Pod Security Standard levels. Specifically,
the warn
label does not prevent the creation of a Pod in the labeled Namespace which does not meet the
requirements outlined in the indicated level, but returns a warning to the user after doing so.
Note that warnings are also displayed when creating or updating objects that contain Pod templates,
such as Deployments, Jobs, StatefulSets, etc.
See Enforcing Pod Security at the Namespace Level for more information.
pod-security.kubernetes.io/warn-version
Example: pod-security.kubernetes.io/warn-version: 1.23
Used on: Namespace
Value must be latest
or a valid Kubernetes version in the format v<MAJOR>.<MINOR>
.
This determines the version of the Pod Security Standard
policies to apply when validating a submitted Pod. Note that warnings are also displayed when creating
or updating objects that contain Pod templates, such as Deployments, Jobs, StatefulSets, etc.
See Enforcing Pod Security at the Namespace Level for more information.
seccomp.security.alpha.kubernetes.io/pod (deprecated)
This annotation has been deprecated since Kubernetes v1.19 and will become non-functional in v1.25.
To specify security settings for a Pod, include the securityContext
field in the Pod specification.
The securityContext
field within a Pod's .spec
defines pod-level security attributes.
When you specify the security context for a Pod,
the settings you specify apply to all containers in that Pod.
container.seccomp.security.alpha.kubernetes.io/[NAME] (deprecated)
This annotation has been deprecated since Kubernetes v1.19 and will become non-functional in v1.25.
The tutorial Restrict a Container's Syscalls with seccomp takes
you through the steps you follow to apply a seccomp profile to a Pod or to one of
its containers. That tutorial covers the supported mechanism for configuring seccomp in Kubernetes,
based on setting securityContext
within the Pod's .spec
.
Annotations used for audit
authorization.k8s.io/decision
authorization.k8s.io/reason
pod-security.kubernetes.io/audit-violations
pod-security.kubernetes.io/enforce-policy
pod-security.kubernetes.io/exempt
See more details on the Audit Annotations page.
6.4.1 - Audit Annotations
This page serves as a reference for the audit annotations of the kubernetes.io
namespace. These annotations apply to Event
object from API group
audit.k8s.io
.
Event
from API group audit.k8s.io
.
The annotations apply to audit events. Audit events are different from objects in the
Event API (API group
events.k8s.io
).
pod-security.kubernetes.io/exempt
Example: pod-security.kubernetes.io/exempt: namespace
Value must be one of user
, namespace
, or runtimeClass
which correspond to
Pod Security Exemption
dimensions. This annotation indicates on which dimension was based the exemption
from the PodSecurity enforcement.
pod-security.kubernetes.io/enforce-policy
Example: pod-security.kubernetes.io/enforce-policy: restricted:latest
Value must be privileged:<version>
, baseline:<version>
,
restricted:<version>
which correspond to Pod Security
Standard levels accompanied by
a version which must be latest
or a valid Kubernetes version in the format
v<MAJOR>.<MINOR>
. This annotations informs about the enforcement level that
allowed or denied the pod during PodSecurity admission.
See Pod Security Standards for more information.
pod-security.kubernetes.io/audit-violations
Example: pod-security.kubernetes.io/audit-violations: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "example" must set securityContext.allowPrivilegeEscalation=false), ...
Value details an audit policy violation, it contains the Pod Security Standard level that was transgressed as well as the specific policies on the fields that were violated from the PodSecurity enforcement.
See Pod Security Standards for more information.
authorization.k8s.io/decision
Example: authorization.k8s.io/decision: "forbid"
This annotation indicates whether or not a request was authorized in Kubernetes audit logs.
See Auditing for more information.
authorization.k8s.io/reason
Example: authorization.k8s.io/decision: "Human-readable reason for the decision"
This annotation gives reason for the decision in Kubernetes audit logs.
See Auditing for more information.
6.5 - Kubernetes API
Kubernetes' API is the application that serves Kubernetes functionality through a RESTful interface and stores the state of the cluster.
Kubernetes resources and "records of intent" are all stored as API objects, and modified via RESTful calls to the API. The API allows configuration to be managed in a declarative way. Users can interact with the Kubernetes API directly, or via tools like kubectl
. The core Kubernetes API is flexible and can also be extended to support custom resources.
6.5.1 - Workload Resources
6.5.1.1 - Pod
apiVersion: v1
import "k8s.io/api/core/v1"
Pod
Pod is a collection of containers that can run on a host. This resource is created by clients and scheduled onto hosts.
-
apiVersion: v1
-
kind: Pod
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (PodSpec)
Specification of the desired behavior of the pod. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (PodStatus)
Most recently observed status of the pod. This data may not be up to date. Populated by the system. Read-only. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
PodSpec
PodSpec is a description of a pod.
Containers
-
containers ([]Container), required
Patch strategy: merge on key
name
List of containers belonging to the pod. Containers cannot currently be added or removed. There must be at least one container in a Pod. Cannot be updated.
-
initContainers ([]Container)
Patch strategy: merge on key
name
List of initialization containers belonging to the pod. Init containers are executed in order prior to containers being started. If any init container fails, the pod is considered to have failed and is handled according to its restartPolicy. The name for an init container or normal container must be unique among all containers. Init containers may not have Lifecycle actions, Readiness probes, Liveness probes, or Startup probes. The resourceRequirements of an init container are taken into account during scheduling by finding the highest request/limit for each resource type, and then using the max of of that value or the sum of the normal containers. Limits are applied to init containers in a similar fashion. Init containers cannot currently be added or removed. Cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
-
imagePullSecrets ([]LocalObjectReference)
Patch strategy: merge on key
name
ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec. If specified, these secrets will be passed to individual puller implementations for them to use. For example, in the case of docker, only DockerConfig type secrets are honored. More info: https://kubernetes.io/docs/concepts/containers/images#specifying-imagepullsecrets-on-a-pod
-
enableServiceLinks (boolean)
EnableServiceLinks indicates whether information about services should be injected into pod's environment variables, matching the syntax of Docker links. Optional: Defaults to true.
-
os (PodOS)
Specifies the OS of the containers in the pod. Some pod and container fields are restricted if this is set.
If the OS field is set to linux, the following fields must be unset: -securityContext.windowsOptions
If the OS field is set to windows, following fields must be unset: - spec.hostPID - spec.hostIPC - spec.securityContext.seLinuxOptions - spec.securityContext.seccompProfile - spec.securityContext.fsGroup - spec.securityContext.fsGroupChangePolicy - spec.securityContext.sysctls - spec.shareProcessNamespace - spec.securityContext.runAsUser - spec.securityContext.runAsGroup - spec.securityContext.supplementalGroups - spec.containers[].securityContext.seLinuxOptions - spec.containers[].securityContext.seccompProfile - spec.containers[].securityContext.capabilities - spec.containers[].securityContext.readOnlyRootFilesystem - spec.containers[].securityContext.privileged - spec.containers[].securityContext.allowPrivilegeEscalation - spec.containers[].securityContext.procMount - spec.containers[].securityContext.runAsUser - spec.containers[*].securityContext.runAsGroup This is an alpha field and requires the IdentifyPodOS feature
PodOS defines the OS parameters of a pod.
-
os.name (string), required
Name is the name of the operating system. The currently supported values are linux and windows. Additional value may be defined in future and can be one of: https://github.com/opencontainers/runtime-spec/blob/master/config.md#platform-specific-configuration Clients should expect to handle additional values and treat unrecognized values in this field as os: null
-
Volumes
-
volumes ([]Volume)
Patch strategies: retainKeys, merge on key
name
List of volumes that can be mounted by containers belonging to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes
Scheduling
-
nodeSelector (map[string]string)
NodeSelector is a selector which must be true for the pod to fit on a node. Selector which must match a node's labels for the pod to be scheduled on that node. More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
-
nodeName (string)
NodeName is a request to schedule this pod onto a specific node. If it is non-empty, the scheduler simply schedules this pod onto that node, assuming that it fits resource requirements.
-
affinity (Affinity)
If specified, the pod's scheduling constraints
Affinity is a group of affinity scheduling rules.
-
affinity.nodeAffinity (NodeAffinity)
Describes node affinity scheduling rules for the pod.
-
affinity.podAffinity (PodAffinity)
Describes pod affinity scheduling rules (e.g. co-locate this pod in the same node, zone, etc. as some other pod(s)).
-
affinity.podAntiAffinity (PodAntiAffinity)
Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod in the same node, zone, etc. as some other pod(s)).
-
-
tolerations ([]Toleration)
If specified, the pod's tolerations.
The pod this Toleration is attached to tolerates any taint that matches the triple <key,value,effect> using the matching operator
. -
tolerations.key (string)
Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty, operator must be Exists; this combination means to match all values and all keys.
-
tolerations.operator (string)
Operator represents a key's relationship to the value. Valid operators are Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.
Possible enum values:
"Equal"
"Exists"
-
tolerations.value (string)
Value is the taint value the toleration matches to. If the operator is Exists, the value should be empty, otherwise just a regular string.
-
tolerations.effect (string)
Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.
Possible enum values:
"NoExecute"
Evict any already-running pods that do not tolerate the taint. Currently enforced by NodeController."NoSchedule"
Do not allow new pods to schedule onto the node unless they tolerate the taint, but allow all pods submitted to Kubelet without going through the scheduler to start, and allow all already-running pods to continue running. Enforced by the scheduler."PreferNoSchedule"
Like TaintEffectNoSchedule, but the scheduler tries not to schedule new pods onto the node, rather than prohibiting new pods from scheduling onto the node entirely. Enforced by the scheduler.
-
tolerations.tolerationSeconds (int64)
TolerationSeconds represents the period of time the toleration (which must be of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.
-
-
schedulerName (string)
If specified, the pod will be dispatched by specified scheduler. If not specified, the pod will be dispatched by default scheduler.
-
runtimeClassName (string)
RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run. If unset or empty, the "legacy" RuntimeClass will be used, which is an implicit class with an empty definition that uses the default runtime handler. More info: https://git.k8s.io/enhancements/keps/sig-node/585-runtime-class This is a beta feature as of Kubernetes v1.14.
-
priorityClassName (string)
If specified, indicates the pod's priority. "system-node-critical" and "system-cluster-critical" are two special keywords which indicate the highest priorities with the former being the highest priority. Any other name must be defined by creating a PriorityClass object with that name. If not specified, the pod priority will be default or zero if there is no default.
-
priority (int32)
The priority value. Various system components use this field to find the priority of the pod. When Priority Admission Controller is enabled, it prevents users from setting this field. The admission controller populates this field from PriorityClassName. The higher the value, the higher the priority.
-
topologySpreadConstraints ([]TopologySpreadConstraint)
Patch strategy: merge on key
topologyKey
Map: unique values on keys
topologyKey, whenUnsatisfiable
will be kept during a mergeTopologySpreadConstraints describes how a group of pods ought to spread across topology domains. Scheduler will schedule pods in a way which abides by the constraints. All topologySpreadConstraints are ANDed.
TopologySpreadConstraint specifies how to spread matching pods among the given topology.
-
topologySpreadConstraints.maxSkew (int32), required
MaxSkew describes the degree to which pods may be unevenly distributed. When
whenUnsatisfiable=DoNotSchedule
, it is the maximum permitted difference between the number of matching pods in the target topology and the global minimum. For example, in a 3-zone cluster, MaxSkew is set to 1, and pods with the same labelSelector spread as 1/1/0: | zone1 | zone2 | zone3 | | P | P | | - if MaxSkew is 1, incoming pod can only be scheduled to zone3 to become 1/1/1; scheduling it onto zone1(zone2) would make the ActualSkew(2-0) on zone1(zone2) violate MaxSkew(1). - if MaxSkew is 2, incoming pod can be scheduled onto any zone. WhenwhenUnsatisfiable=ScheduleAnyway
, it is used to give higher precedence to topologies that satisfy it. It's a required field. Default value is 1 and 0 is not allowed. -
topologySpreadConstraints.topologyKey (string), required
TopologyKey is the key of node labels. Nodes that have a label with this key and identical values are considered to be in the same topology. We consider each <key, value> as a "bucket", and try to put balanced number of pods into each bucket. It's a required field.
-
topologySpreadConstraints.whenUnsatisfiable (string), required
WhenUnsatisfiable indicates how to deal with a pod if it doesn't satisfy the spread constraint. - DoNotSchedule (default) tells the scheduler not to schedule it. - ScheduleAnyway tells the scheduler to schedule the pod in any location, but giving higher precedence to topologies that would help reduce the skew. A constraint is considered "Unsatisfiable" for an incoming pod if and only if every possible node assignment for that pod would violate "MaxSkew" on some topology. For example, in a 3-zone cluster, MaxSkew is set to 1, and pods with the same labelSelector spread as 3/1/1: | zone1 | zone2 | zone3 | | P P P | P | P | If WhenUnsatisfiable is set to DoNotSchedule, incoming pod can only be scheduled to zone2(zone3) to become 3/2/1(3/1/2) as ActualSkew(2-1) on zone2(zone3) satisfies MaxSkew(1). In other words, the cluster can still be imbalanced, but scheduler won't make it more imbalanced. It's a required field.
Possible enum values:
"DoNotSchedule"
instructs the scheduler not to schedule the pod when constraints are not satisfied."ScheduleAnyway"
instructs the scheduler to schedule the pod even if constraints are not satisfied.
-
topologySpreadConstraints.labelSelector (LabelSelector)
LabelSelector is used to find matching pods. Pods that match this label selector are counted to determine the number of pods in their corresponding topology domain.
-
Lifecycle
-
restartPolicy (string)
Restart policy for all containers within the pod. One of Always, OnFailure, Never. Default to Always. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
Possible enum values:
"Always"
"Never"
"OnFailure"
-
terminationGracePeriodSeconds (int64)
Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down). If this value is nil, the default grace period will be used instead. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. Defaults to 30 seconds.
-
activeDeadlineSeconds (int64)
Optional duration in seconds the pod may be active on the node relative to StartTime before the system will actively try to mark it failed and kill associated containers. Value must be a positive integer.
-
readinessGates ([]PodReadinessGate)
If specified, all readiness gates will be evaluated for pod readiness. A pod is ready when all its containers are ready AND all conditions specified in the readiness gates have status equal to "True" More info: https://git.k8s.io/enhancements/keps/sig-network/580-pod-readiness-gates
PodReadinessGate contains the reference to a pod condition
-
readinessGates.conditionType (string), required
ConditionType refers to a condition in the pod's condition list with matching type.
Possible enum values:
"ContainersReady"
indicates whether all containers in the pod are ready."Initialized"
means that all init containers in the pod have started successfully."PodScheduled"
represents status of the scheduling process for this pod."Ready"
means the pod is able to service requests and should be added to the load balancing pools of all matching services.
-
Hostname and Name resolution
-
hostname (string)
Specifies the hostname of the Pod If not specified, the pod's hostname will be set to a system-defined value.
-
setHostnameAsFQDN (boolean)
If true the pod's hostname will be configured as the pod's FQDN, rather than the leaf name (the default). In Linux containers, this means setting the FQDN in the hostname field of the kernel (the nodename field of struct utsname). In Windows containers, this means setting the registry value of hostname for the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters to FQDN. If a pod does not have FQDN, this has no effect. Default to false.
-
subdomain (string)
If specified, the fully qualified Pod hostname will be "<hostname>.<subdomain>.<pod namespace>.svc.<cluster domain>". If not specified, the pod will not have a domainname at all.
-
hostAliases ([]HostAlias)
Patch strategy: merge on key
ip
HostAliases is an optional list of hosts and IPs that will be injected into the pod's hosts file if specified. This is only valid for non-hostNetwork pods.
HostAlias holds the mapping between IP and hostnames that will be injected as an entry in the pod's hosts file.
-
hostAliases.hostnames ([]string)
Hostnames for the above IP address.
-
hostAliases.ip (string)
IP address of the host file entry.
-
-
dnsConfig (PodDNSConfig)
Specifies the DNS parameters of a pod. Parameters specified here will be merged to the generated DNS configuration based on DNSPolicy.
PodDNSConfig defines the DNS parameters of a pod in addition to those generated from DNSPolicy.
-
dnsConfig.nameservers ([]string)
A list of DNS name server IP addresses. This will be appended to the base nameservers generated from DNSPolicy. Duplicated nameservers will be removed.
-
dnsConfig.options ([]PodDNSConfigOption)
A list of DNS resolver options. This will be merged with the base options generated from DNSPolicy. Duplicated entries will be removed. Resolution options given in Options will override those that appear in the base DNSPolicy.
PodDNSConfigOption defines DNS resolver options of a pod.
-
dnsConfig.options.name (string)
Required.
-
dnsConfig.options.value (string)
-
-
dnsConfig.searches ([]string)
A list of DNS search domains for host-name lookup. This will be appended to the base search paths generated from DNSPolicy. Duplicated search paths will be removed.
-
-
dnsPolicy (string)
Set DNS policy for the pod. Defaults to "ClusterFirst". Valid values are 'ClusterFirstWithHostNet', 'ClusterFirst', 'Default' or 'None'. DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy. To have DNS options set along with hostNetwork, you have to specify DNS policy explicitly to 'ClusterFirstWithHostNet'.
Possible enum values:
"ClusterFirst"
indicates that the pod should use cluster DNS first unless hostNetwork is true, if it is available, then fall back on the default (as determined by kubelet) DNS settings."ClusterFirstWithHostNet"
indicates that the pod should use cluster DNS first, if it is available, then fall back on the default (as determined by kubelet) DNS settings."Default"
indicates that the pod should use the default (as determined by kubelet) DNS settings."None"
indicates that the pod should use empty DNS settings. DNS parameters such as nameservers and search paths should be defined via DNSConfig.
Hosts namespaces
-
hostNetwork (boolean)
Host networking requested for this pod. Use the host's network namespace. If this option is set, the ports that will be used must be specified. Default to false.
-
hostPID (boolean)
Use the host's pid namespace. Optional: Default to false.
-
hostIPC (boolean)
Use the host's ipc namespace. Optional: Default to false.
-
shareProcessNamespace (boolean)
Share a single process namespace between all of the containers in a pod. When this is set containers will be able to view and signal processes from other containers in the same pod, and the first process in each container will not be assigned PID 1. HostPID and ShareProcessNamespace cannot both be set. Optional: Default to false.
Service account
-
serviceAccountName (string)
ServiceAccountName is the name of the ServiceAccount to use to run this pod. More info: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
-
automountServiceAccountToken (boolean)
AutomountServiceAccountToken indicates whether a service account token should be automatically mounted.
Security context
-
securityContext (PodSecurityContext)
SecurityContext holds pod-level security attributes and common container settings. Optional: Defaults to empty. See type description for default values of each field.
PodSecurityContext holds pod-level security attributes and common container settings. Some fields are also present in container.securityContext. Field values of container.securityContext take precedence over field values of PodSecurityContext.
-
securityContext.runAsUser (int64)
The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.runAsNonRoot (boolean)
Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
-
securityContext.runAsGroup (int64)
The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.supplementalGroups ([]int64)
A list of groups applied to the first process run in each container, in addition to the container's primary GID. If unspecified, no groups will be added to any container. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.fsGroup (int64)
A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod:
- The owning GID will be the FSGroup 2. The setgid bit is set (new files created in the volume will be owned by FSGroup) 3. The permission bits are OR'd with rw-rw----
If unset, the Kubelet will not modify the ownership and permissions of any volume. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.fsGroupChangePolicy (string)
fsGroupChangePolicy defines behavior of changing ownership and permission of the volume before being exposed inside Pod. This field will only apply to volume types which support fsGroup based ownership(and permissions). It will have no effect on ephemeral volume types such as: secret, configmaps and emptydir. Valid values are "OnRootMismatch" and "Always". If not specified, "Always" is used. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.seccompProfile (SeccompProfile)
The seccomp options to use by the containers in this pod. Note that this field cannot be set when spec.os.name is windows.
SeccompProfile defines a pod/container's seccomp profile settings. Only one profile source may be set.
-
securityContext.seccompProfile.type (string), required
type indicates which kind of seccomp profile will be applied. Valid options are:
Localhost - a profile defined in a file on the node should be used. RuntimeDefault - the container runtime default profile should be used. Unconfined - no profile should be applied.
Possible enum values:
"Localhost"
indicates a profile defined in a file on the node should be used. The file's location relative to <kubelet-root-dir>/seccomp."RuntimeDefault"
represents the default container runtime seccomp profile."Unconfined"
indicates no seccomp profile is applied (A.K.A. unconfined).
-
securityContext.seccompProfile.localhostProfile (string)
localhostProfile indicates a profile defined in a file on the node should be used. The profile must be preconfigured on the node to work. Must be a descending path, relative to the kubelet's configured seccomp profile location. Must only be set if type is "Localhost".
-
-
securityContext.seLinuxOptions (SELinuxOptions)
The SELinux context to be applied to all containers. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.
SELinuxOptions are the labels to be applied to the container
-
securityContext.seLinuxOptions.level (string)
Level is SELinux level label that applies to the container.
-
securityContext.seLinuxOptions.role (string)
Role is a SELinux role label that applies to the container.
-
securityContext.seLinuxOptions.type (string)
Type is a SELinux type label that applies to the container.
-
securityContext.seLinuxOptions.user (string)
User is a SELinux user label that applies to the container.
-
-
securityContext.sysctls ([]Sysctl)
Sysctls hold a list of namespaced sysctls used for the pod. Pods with unsupported sysctls (by the container runtime) might fail to launch. Note that this field cannot be set when spec.os.name is windows.
Sysctl defines a kernel parameter to be set
-
securityContext.sysctls.name (string), required
Name of a property to set
-
securityContext.sysctls.value (string), required
Value of a property to set
-
-
securityContext.windowsOptions (WindowsSecurityContextOptions)
The Windows specific settings applied to all containers. If unspecified, the options within a container's SecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.
WindowsSecurityContextOptions contain Windows-specific options and credentials.
-
securityContext.windowsOptions.gmsaCredentialSpec (string)
GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.
-
securityContext.windowsOptions.gmsaCredentialSpecName (string)
GMSACredentialSpecName is the name of the GMSA credential spec to use.
-
securityContext.windowsOptions.hostProcess (boolean)
HostProcess determines if a container should be run as a 'Host Process' container. This field is alpha-level and will only be honored by components that enable the WindowsHostProcessContainers feature flag. Setting this field without the feature flag will result in errors when validating the Pod. All of a Pod's containers must have the same effective HostProcess value (it is not allowed to have a mix of HostProcess containers and non-HostProcess containers). In addition, if HostProcess is true then HostNetwork must also be set to true.
-
securityContext.windowsOptions.runAsUserName (string)
The UserName in Windows to run the entrypoint of the container process. Defaults to the user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
-
-
Beta level
-
ephemeralContainers ([]EphemeralContainer)
Patch strategy: merge on key
name
List of ephemeral containers run in this pod. Ephemeral containers may be run in an existing pod to perform user-initiated actions such as debugging. This list cannot be specified when creating a pod, and it cannot be modified by updating the pod spec. In order to add an ephemeral container to an existing pod, use the pod's ephemeralcontainers subresource. This field is beta-level and available on clusters that haven't disabled the EphemeralContainers feature gate.
-
preemptionPolicy (string)
PreemptionPolicy is the Policy for preempting pods with lower priority. One of Never, PreemptLowerPriority. Defaults to PreemptLowerPriority if unset. This field is beta-level, gated by the NonPreemptingPriority feature-gate.
-
overhead (map[string]Quantity)
Overhead represents the resource overhead associated with running a pod for a given RuntimeClass. This field will be autopopulated at admission time by the RuntimeClass admission controller. If the RuntimeClass admission controller is enabled, overhead must not be set in Pod create requests. The RuntimeClass admission controller will reject Pod create requests which have the overhead already set. If RuntimeClass is configured and selected in the PodSpec, Overhead will be set to the value defined in the corresponding RuntimeClass, otherwise it will remain unset and treated as zero. More info: https://git.k8s.io/enhancements/keps/sig-node/688-pod-overhead/README.md This field is beta-level as of Kubernetes v1.18, and is only honored by servers that enable the PodOverhead feature.
Deprecated
-
serviceAccount (string)
DeprecatedServiceAccount is a depreciated alias for ServiceAccountName. Deprecated: Use serviceAccountName instead.
Container
A single application container that you want to run within a pod.
-
name (string), required
Name of the container specified as a DNS_LABEL. Each container in a pod must have a unique name (DNS_LABEL). Cannot be updated.
Image
-
image (string)
Docker image name. More info: https://kubernetes.io/docs/concepts/containers/images This field is optional to allow higher level config management to default or override container images in workload controllers like Deployments and StatefulSets.
-
imagePullPolicy (string)
Image pull policy. One of Always, Never, IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise. Cannot be updated. More info: https://kubernetes.io/docs/concepts/containers/images#updating-images
Possible enum values:
"Always"
means that kubelet always attempts to pull the latest image. Container will fail If the pull fails."IfNotPresent"
means that kubelet pulls if the image isn't present on disk. Container will fail if the image isn't present and the pull fails."Never"
means that kubelet never pulls an image, but only uses a local image. Container will fail if the image isn't present
Entrypoint
-
command ([]string)
Entrypoint array. Not executed within a shell. The docker image's ENTRYPOINT is used if this is not provided. Variable references $(VAR_NAME) are expanded using the container's environment. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Cannot be updated. More info: https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#running-a-command-in-a-shell
-
args ([]string)
Arguments to the entrypoint. The docker image's CMD is used if this is not provided. Variable references $(VAR_NAME) are expanded using the container's environment. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Cannot be updated. More info: https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#running-a-command-in-a-shell
-
workingDir (string)
Container's working directory. If not specified, the container runtime's default will be used, which might be configured in the container image. Cannot be updated.
Ports
-
ports ([]ContainerPort)
Patch strategy: merge on key
containerPort
Map: unique values on keys
containerPort, protocol
will be kept during a mergeList of ports to expose from the container. Exposing a port here gives the system additional information about the network connections a container uses, but is primarily informational. Not specifying a port here DOES NOT prevent that port from being exposed. Any port which is listening on the default "0.0.0.0" address inside a container will be accessible from the network. Cannot be updated.
ContainerPort represents a network port in a single container.
-
ports.containerPort (int32), required
Number of port to expose on the pod's IP address. This must be a valid port number, 0 < x < 65536.
-
ports.hostIP (string)
What host IP to bind the external port to.
-
ports.hostPort (int32)
Number of port to expose on the host. If specified, this must be a valid port number, 0 < x < 65536. If HostNetwork is specified, this must match ContainerPort. Most containers do not need this.
-
ports.name (string)
If specified, this must be an IANA_SVC_NAME and unique within the pod. Each named port in a pod must have a unique name. Name for the port that can be referred to by services.
-
ports.protocol (string)
Protocol for port. Must be UDP, TCP, or SCTP. Defaults to "TCP".
Possible enum values:
"SCTP"
is the SCTP protocol."TCP"
is the TCP protocol."UDP"
is the UDP protocol.
-
Environment variables
-
env ([]EnvVar)
Patch strategy: merge on key
name
List of environment variables to set in the container. Cannot be updated.
EnvVar represents an environment variable present in a Container.
-
env.name (string), required
Name of the environment variable. Must be a C_IDENTIFIER.
-
env.value (string)
Variable references $(VAR_NAME) are expanded using the previously defined environment variables in the container and any service environment variables. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Defaults to "".
-
env.valueFrom (EnvVarSource)
Source for the environment variable's value. Cannot be used if value is not empty.
EnvVarSource represents a source for the value of an EnvVar.
-
env.valueFrom.configMapKeyRef (ConfigMapKeySelector)
Selects a key of a ConfigMap.
Selects a key from a ConfigMap.
-
env.valueFrom.configMapKeyRef.key (string), required
The key to select.
-
env.valueFrom.configMapKeyRef.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
env.valueFrom.configMapKeyRef.optional (boolean)
Specify whether the ConfigMap or its key must be defined
-
-
env.valueFrom.fieldRef (ObjectFieldSelector)
Selects a field of the pod: supports metadata.name, metadata.namespace,
metadata.labels['\<KEY>']
,metadata.annotations['\<KEY>']
, spec.nodeName, spec.serviceAccountName, status.hostIP, status.podIP, status.podIPs. -
env.valueFrom.resourceFieldRef (ResourceFieldSelector)
Selects a resource of the container: only resources limits and requests (limits.cpu, limits.memory, limits.ephemeral-storage, requests.cpu, requests.memory and requests.ephemeral-storage) are currently supported.
-
env.valueFrom.secretKeyRef (SecretKeySelector)
Selects a key of a secret in the pod's namespace
SecretKeySelector selects a key of a Secret.
-
env.valueFrom.secretKeyRef.key (string), required
The key of the secret to select from. Must be a valid secret key.
-
env.valueFrom.secretKeyRef.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
env.valueFrom.secretKeyRef.optional (boolean)
Specify whether the Secret or its key must be defined
-
-
-
-
envFrom ([]EnvFromSource)
List of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence. Cannot be updated.
EnvFromSource represents the source of a set of ConfigMaps
-
envFrom.configMapRef (ConfigMapEnvSource)
The ConfigMap to select from
*ConfigMapEnvSource selects a ConfigMap to populate the environment variables with.
The contents of the target ConfigMap's Data field will represent the key-value pairs as environment variables.*
-
envFrom.configMapRef.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
envFrom.configMapRef.optional (boolean)
Specify whether the ConfigMap must be defined
-
-
envFrom.prefix (string)
An optional identifier to prepend to each key in the ConfigMap. Must be a C_IDENTIFIER.
-
envFrom.secretRef (SecretEnvSource)
The Secret to select from
*SecretEnvSource selects a Secret to populate the environment variables with.
The contents of the target Secret's Data field will represent the key-value pairs as environment variables.*
-
envFrom.secretRef.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
envFrom.secretRef.optional (boolean)
Specify whether the Secret must be defined
-
-
Volumes
-
volumeMounts ([]VolumeMount)
Patch strategy: merge on key
mountPath
Pod volumes to mount into the container's filesystem. Cannot be updated.
VolumeMount describes a mounting of a Volume within a container.
-
volumeMounts.mountPath (string), required
Path within the container at which the volume should be mounted. Must not contain ':'.
-
volumeMounts.name (string), required
This must match the Name of a Volume.
-
volumeMounts.mountPropagation (string)
mountPropagation determines how mounts are propagated from the host to container and the other way around. When not set, MountPropagationNone is used. This field is beta in 1.10.
-
volumeMounts.readOnly (boolean)
Mounted read-only if true, read-write otherwise (false or unspecified). Defaults to false.
-
volumeMounts.subPath (string)
Path within the volume from which the container's volume should be mounted. Defaults to "" (volume's root).
-
volumeMounts.subPathExpr (string)
Expanded path within the volume from which the container's volume should be mounted. Behaves similarly to SubPath but environment variable references $(VAR_NAME) are expanded using the container's environment. Defaults to "" (volume's root). SubPathExpr and SubPath are mutually exclusive.
-
-
volumeDevices ([]VolumeDevice)
Patch strategy: merge on key
devicePath
volumeDevices is the list of block devices to be used by the container.
volumeDevice describes a mapping of a raw block device within a container.
-
volumeDevices.devicePath (string), required
devicePath is the path inside of the container that the device will be mapped to.
-
volumeDevices.name (string), required
name must match the name of a persistentVolumeClaim in the pod
-
Resources
-
resources (ResourceRequirements)
Compute Resources required by this container. Cannot be updated. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
ResourceRequirements describes the compute resource requirements.
-
resources.limits (map[string]Quantity)
Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
-
resources.requests (map[string]Quantity)
Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
-
Lifecycle
-
lifecycle (Lifecycle)
Actions that the management system should take in response to container lifecycle events. Cannot be updated.
Lifecycle describes actions that the management system should take in response to container lifecycle events. For the PostStart and PreStop lifecycle handlers, management of the container blocks until the action is complete, unless the container process fails, in which case the handler is aborted.
-
lifecycle.postStart (LifecycleHandler)
PostStart is called immediately after a container is created. If the handler fails, the container is terminated and restarted according to its restart policy. Other management of the container blocks until the hook completes. More info: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks
-
lifecycle.preStop (LifecycleHandler)
PreStop is called immediately before a container is terminated due to an API request or management event such as liveness/startup probe failure, preemption, resource contention, etc. The handler is not called if the container crashes or exits. The Pod's termination grace period countdown begins before the PreStop hook is executed. Regardless of the outcome of the handler, the container will eventually terminate within the Pod's termination grace period (unless delayed by finalizers). Other management of the container blocks until the hook completes or until the termination grace period is reached. More info: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks
-
-
terminationMessagePath (string)
Optional: Path at which the file to which the container's termination message will be written is mounted into the container's filesystem. Message written is intended to be brief final status, such as an assertion failure message. Will be truncated by the node if greater than 4096 bytes. The total message length across all containers will be limited to 12kb. Defaults to /dev/termination-log. Cannot be updated.
-
terminationMessagePolicy (string)
Indicate how the termination message should be populated. File will use the contents of terminationMessagePath to populate the container status message on both success and failure. FallbackToLogsOnError will use the last chunk of container log output if the termination message file is empty and the container exited with an error. The log output is limited to 2048 bytes or 80 lines, whichever is smaller. Defaults to File. Cannot be updated.
Possible enum values:
"FallbackToLogsOnError"
will read the most recent contents of the container logs for the container status message when the container exits with an error and the terminationMessagePath has no contents."File"
is the default behavior and will set the container status message to the contents of the container's terminationMessagePath when the container exits.
-
livenessProbe (Probe)
Periodic probe of container liveness. Container will be restarted if the probe fails. Cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
-
readinessProbe (Probe)
Periodic probe of container service readiness. Container will be removed from service endpoints if the probe fails. Cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
-
startupProbe (Probe)
StartupProbe indicates that the Pod has successfully initialized. If specified, no other probes are executed until this completes successfully. If this probe fails, the Pod will be restarted, just as if the livenessProbe failed. This can be used to provide different probe parameters at the beginning of a Pod's lifecycle, when it might take a long time to load data or warm a cache, than during steady-state operation. This cannot be updated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
Security Context
-
securityContext (SecurityContext)
SecurityContext defines the security options the container should be run with. If set, the fields of SecurityContext override the equivalent fields of PodSecurityContext. More info: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
SecurityContext holds security configuration that will be applied to a container. Some fields are present in both SecurityContext and PodSecurityContext. When both are set, the values in SecurityContext take precedence.
-
securityContext.runAsUser (int64)
The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.runAsNonRoot (boolean)
Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
-
securityContext.runAsGroup (int64)
The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.readOnlyRootFilesystem (boolean)
Whether this container has a read-only root filesystem. Default is false. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.procMount (string)
procMount denotes the type of proc mount to use for the containers. The default is DefaultProcMount which uses the container runtime defaults for readonly paths and masked paths. This requires the ProcMountType feature flag to be enabled. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.privileged (boolean)
Run container in privileged mode. Processes in privileged containers are essentially equivalent to root on the host. Defaults to false. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.allowPrivilegeEscalation (boolean)
AllowPrivilegeEscalation controls whether a process can gain more privileges than its parent process. This bool directly controls if the no_new_privs flag will be set on the container process. AllowPrivilegeEscalation is true always when the container is: 1) run as Privileged 2) has CAP_SYS_ADMIN Note that this field cannot be set when spec.os.name is windows.
-
securityContext.capabilities (Capabilities)
The capabilities to add/drop when running containers. Defaults to the default set of capabilities granted by the container runtime. Note that this field cannot be set when spec.os.name is windows.
Adds and removes POSIX capabilities from running containers.
-
securityContext.capabilities.add ([]string)
Added capabilities
-
securityContext.capabilities.drop ([]string)
Removed capabilities
-
-
securityContext.seccompProfile (SeccompProfile)
The seccomp options to use by this container. If seccomp options are provided at both the pod & container level, the container options override the pod options. Note that this field cannot be set when spec.os.name is windows.
SeccompProfile defines a pod/container's seccomp profile settings. Only one profile source may be set.
-
securityContext.seccompProfile.type (string), required
type indicates which kind of seccomp profile will be applied. Valid options are:
Localhost - a profile defined in a file on the node should be used. RuntimeDefault - the container runtime default profile should be used. Unconfined - no profile should be applied.
Possible enum values:
"Localhost"
indicates a profile defined in a file on the node should be used. The file's location relative to <kubelet-root-dir>/seccomp."RuntimeDefault"
represents the default container runtime seccomp profile."Unconfined"
indicates no seccomp profile is applied (A.K.A. unconfined).
-
securityContext.seccompProfile.localhostProfile (string)
localhostProfile indicates a profile defined in a file on the node should be used. The profile must be preconfigured on the node to work. Must be a descending path, relative to the kubelet's configured seccomp profile location. Must only be set if type is "Localhost".
-
-
securityContext.seLinuxOptions (SELinuxOptions)
The SELinux context to be applied to the container. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.
SELinuxOptions are the labels to be applied to the container
-
securityContext.seLinuxOptions.level (string)
Level is SELinux level label that applies to the container.
-
securityContext.seLinuxOptions.role (string)
Role is a SELinux role label that applies to the container.
-
securityContext.seLinuxOptions.type (string)
Type is a SELinux type label that applies to the container.
-
securityContext.seLinuxOptions.user (string)
User is a SELinux user label that applies to the container.
-
-
securityContext.windowsOptions (WindowsSecurityContextOptions)
The Windows specific settings applied to all containers. If unspecified, the options from the PodSecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.
WindowsSecurityContextOptions contain Windows-specific options and credentials.
-
securityContext.windowsOptions.gmsaCredentialSpec (string)
GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.
-
securityContext.windowsOptions.gmsaCredentialSpecName (string)
GMSACredentialSpecName is the name of the GMSA credential spec to use.
-
securityContext.windowsOptions.hostProcess (boolean)
HostProcess determines if a container should be run as a 'Host Process' container. This field is alpha-level and will only be honored by components that enable the WindowsHostProcessContainers feature flag. Setting this field without the feature flag will result in errors when validating the Pod. All of a Pod's containers must have the same effective HostProcess value (it is not allowed to have a mix of HostProcess containers and non-HostProcess containers). In addition, if HostProcess is true then HostNetwork must also be set to true.
-
securityContext.windowsOptions.runAsUserName (string)
The UserName in Windows to run the entrypoint of the container process. Defaults to the user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
-
-
Debugging
-
stdin (boolean)
Whether this container should allocate a buffer for stdin in the container runtime. If this is not set, reads from stdin in the container will always result in EOF. Default is false.
-
stdinOnce (boolean)
Whether the container runtime should close the stdin channel after it has been opened by a single attach. When stdin is true the stdin stream will remain open across multiple attach sessions. If stdinOnce is set to true, stdin is opened on container start, is empty until the first client attaches to stdin, and then remains open and accepts data until the client disconnects, at which time stdin is closed and remains closed until the container is restarted. If this flag is false, a container processes that reads from stdin will never receive an EOF. Default is false
-
tty (boolean)
Whether this container should allocate a TTY for itself, also requires 'stdin' to be true. Default is false.
EphemeralContainer
An EphemeralContainer is a temporary container that you may add to an existing Pod for user-initiated activities such as debugging. Ephemeral containers have no resource or scheduling guarantees, and they will not be restarted when they exit or when a Pod is removed or restarted. The kubelet may evict a Pod if an ephemeral container causes the Pod to exceed its resource allocation.
To add an ephemeral container, use the ephemeralcontainers subresource of an existing Pod. Ephemeral containers may not be removed or restarted.
This is a beta feature available on clusters that haven't disabled the EphemeralContainers feature gate.
-
name (string), required
Name of the ephemeral container specified as a DNS_LABEL. This name must be unique among all containers, init containers and ephemeral containers.
-
targetContainerName (string)
If set, the name of the container from PodSpec that this ephemeral container targets. The ephemeral container will be run in the namespaces (IPC, PID, etc) of this container. If not set then the ephemeral container uses the namespaces configured in the Pod spec.
The container runtime must implement support for this feature. If the runtime does not support namespace targeting then the result of setting this field is undefined.
Image
-
image (string)
Docker image name. More info: https://kubernetes.io/docs/concepts/containers/images
-
imagePullPolicy (string)
Image pull policy. One of Always, Never, IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise. Cannot be updated. More info: https://kubernetes.io/docs/concepts/containers/images#updating-images
Possible enum values:
"Always"
means that kubelet always attempts to pull the latest image. Container will fail If the pull fails."IfNotPresent"
means that kubelet pulls if the image isn't present on disk. Container will fail if the image isn't present and the pull fails."Never"
means that kubelet never pulls an image, but only uses a local image. Container will fail if the image isn't present
Entrypoint
-
command ([]string)
Entrypoint array. Not executed within a shell. The docker image's ENTRYPOINT is used if this is not provided. Variable references $(VAR_NAME) are expanded using the container's environment. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Cannot be updated. More info: https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#running-a-command-in-a-shell
-
args ([]string)
Arguments to the entrypoint. The docker image's CMD is used if this is not provided. Variable references $(VAR_NAME) are expanded using the container's environment. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Cannot be updated. More info: https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#running-a-command-in-a-shell
-
workingDir (string)
Container's working directory. If not specified, the container runtime's default will be used, which might be configured in the container image. Cannot be updated.
Environment variables
-
env ([]EnvVar)
Patch strategy: merge on key
name
List of environment variables to set in the container. Cannot be updated.
EnvVar represents an environment variable present in a Container.
-
env.name (string), required
Name of the environment variable. Must be a C_IDENTIFIER.
-
env.value (string)
Variable references $(VAR_NAME) are expanded using the previously defined environment variables in the container and any service environment variables. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Defaults to "".
-
env.valueFrom (EnvVarSource)
Source for the environment variable's value. Cannot be used if value is not empty.
EnvVarSource represents a source for the value of an EnvVar.
-
env.valueFrom.configMapKeyRef (ConfigMapKeySelector)
Selects a key of a ConfigMap.
Selects a key from a ConfigMap.
-
env.valueFrom.configMapKeyRef.key (string), required
The key to select.
-
env.valueFrom.configMapKeyRef.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
env.valueFrom.configMapKeyRef.optional (boolean)
Specify whether the ConfigMap or its key must be defined
-
-
env.valueFrom.fieldRef (ObjectFieldSelector)
Selects a field of the pod: supports metadata.name, metadata.namespace,
metadata.labels['\<KEY>']
,metadata.annotations['\<KEY>']
, spec.nodeName, spec.serviceAccountName, status.hostIP, status.podIP, status.podIPs. -
env.valueFrom.resourceFieldRef (ResourceFieldSelector)
Selects a resource of the container: only resources limits and requests (limits.cpu, limits.memory, limits.ephemeral-storage, requests.cpu, requests.memory and requests.ephemeral-storage) are currently supported.
-
env.valueFrom.secretKeyRef (SecretKeySelector)
Selects a key of a secret in the pod's namespace
SecretKeySelector selects a key of a Secret.
-
env.valueFrom.secretKeyRef.key (string), required
The key of the secret to select from. Must be a valid secret key.
-
env.valueFrom.secretKeyRef.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
env.valueFrom.secretKeyRef.optional (boolean)
Specify whether the Secret or its key must be defined
-
-
-
-
envFrom ([]EnvFromSource)
List of sources to populate environment variables in the container. The keys defined within a source must be a C_IDENTIFIER. All invalid keys will be reported as an event when the container is starting. When a key exists in multiple sources, the value associated with the last source will take precedence. Values defined by an Env with a duplicate key will take precedence. Cannot be updated.
EnvFromSource represents the source of a set of ConfigMaps
-
envFrom.configMapRef (ConfigMapEnvSource)
The ConfigMap to select from
*ConfigMapEnvSource selects a ConfigMap to populate the environment variables with.
The contents of the target ConfigMap's Data field will represent the key-value pairs as environment variables.*
-
envFrom.configMapRef.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
envFrom.configMapRef.optional (boolean)
Specify whether the ConfigMap must be defined
-
-
envFrom.prefix (string)
An optional identifier to prepend to each key in the ConfigMap. Must be a C_IDENTIFIER.
-
envFrom.secretRef (SecretEnvSource)
The Secret to select from
*SecretEnvSource selects a Secret to populate the environment variables with.
The contents of the target Secret's Data field will represent the key-value pairs as environment variables.*
-
envFrom.secretRef.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
envFrom.secretRef.optional (boolean)
Specify whether the Secret must be defined
-
-
Volumes
-
volumeMounts ([]VolumeMount)
Patch strategy: merge on key
mountPath
Pod volumes to mount into the container's filesystem. Subpath mounts are not allowed for ephemeral containers. Cannot be updated.
VolumeMount describes a mounting of a Volume within a container.
-
volumeMounts.mountPath (string), required
Path within the container at which the volume should be mounted. Must not contain ':'.
-
volumeMounts.name (string), required
This must match the Name of a Volume.
-
volumeMounts.mountPropagation (string)
mountPropagation determines how mounts are propagated from the host to container and the other way around. When not set, MountPropagationNone is used. This field is beta in 1.10.
-
volumeMounts.readOnly (boolean)
Mounted read-only if true, read-write otherwise (false or unspecified). Defaults to false.
-
volumeMounts.subPath (string)
Path within the volume from which the container's volume should be mounted. Defaults to "" (volume's root).
-
volumeMounts.subPathExpr (string)
Expanded path within the volume from which the container's volume should be mounted. Behaves similarly to SubPath but environment variable references $(VAR_NAME) are expanded using the container's environment. Defaults to "" (volume's root). SubPathExpr and SubPath are mutually exclusive.
-
-
volumeDevices ([]VolumeDevice)
Patch strategy: merge on key
devicePath
volumeDevices is the list of block devices to be used by the container.
volumeDevice describes a mapping of a raw block device within a container.
-
volumeDevices.devicePath (string), required
devicePath is the path inside of the container that the device will be mapped to.
-
volumeDevices.name (string), required
name must match the name of a persistentVolumeClaim in the pod
-
Lifecycle
-
terminationMessagePath (string)
Optional: Path at which the file to which the container's termination message will be written is mounted into the container's filesystem. Message written is intended to be brief final status, such as an assertion failure message. Will be truncated by the node if greater than 4096 bytes. The total message length across all containers will be limited to 12kb. Defaults to /dev/termination-log. Cannot be updated.
-
terminationMessagePolicy (string)
Indicate how the termination message should be populated. File will use the contents of terminationMessagePath to populate the container status message on both success and failure. FallbackToLogsOnError will use the last chunk of container log output if the termination message file is empty and the container exited with an error. The log output is limited to 2048 bytes or 80 lines, whichever is smaller. Defaults to File. Cannot be updated.
Possible enum values:
"FallbackToLogsOnError"
will read the most recent contents of the container logs for the container status message when the container exits with an error and the terminationMessagePath has no contents."File"
is the default behavior and will set the container status message to the contents of the container's terminationMessagePath when the container exits.
Debugging
-
stdin (boolean)
Whether this container should allocate a buffer for stdin in the container runtime. If this is not set, reads from stdin in the container will always result in EOF. Default is false.
-
stdinOnce (boolean)
Whether the container runtime should close the stdin channel after it has been opened by a single attach. When stdin is true the stdin stream will remain open across multiple attach sessions. If stdinOnce is set to true, stdin is opened on container start, is empty until the first client attaches to stdin, and then remains open and accepts data until the client disconnects, at which time stdin is closed and remains closed until the container is restarted. If this flag is false, a container processes that reads from stdin will never receive an EOF. Default is false
-
tty (boolean)
Whether this container should allocate a TTY for itself, also requires 'stdin' to be true. Default is false.
Security context
-
securityContext (SecurityContext)
Optional: SecurityContext defines the security options the ephemeral container should be run with. If set, the fields of SecurityContext override the equivalent fields of PodSecurityContext.
SecurityContext holds security configuration that will be applied to a container. Some fields are present in both SecurityContext and PodSecurityContext. When both are set, the values in SecurityContext take precedence.
-
securityContext.runAsUser (int64)
The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.runAsNonRoot (boolean)
Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
-
securityContext.runAsGroup (int64)
The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.readOnlyRootFilesystem (boolean)
Whether this container has a read-only root filesystem. Default is false. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.procMount (string)
procMount denotes the type of proc mount to use for the containers. The default is DefaultProcMount which uses the container runtime defaults for readonly paths and masked paths. This requires the ProcMountType feature flag to be enabled. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.privileged (boolean)
Run container in privileged mode. Processes in privileged containers are essentially equivalent to root on the host. Defaults to false. Note that this field cannot be set when spec.os.name is windows.
-
securityContext.allowPrivilegeEscalation (boolean)
AllowPrivilegeEscalation controls whether a process can gain more privileges than its parent process. This bool directly controls if the no_new_privs flag will be set on the container process. AllowPrivilegeEscalation is true always when the container is: 1) run as Privileged 2) has CAP_SYS_ADMIN Note that this field cannot be set when spec.os.name is windows.
-
securityContext.capabilities (Capabilities)
The capabilities to add/drop when running containers. Defaults to the default set of capabilities granted by the container runtime. Note that this field cannot be set when spec.os.name is windows.
Adds and removes POSIX capabilities from running containers.
-
securityContext.capabilities.add ([]string)
Added capabilities
-
securityContext.capabilities.drop ([]string)
Removed capabilities
-
-
securityContext.seccompProfile (SeccompProfile)
The seccomp options to use by this container. If seccomp options are provided at both the pod & container level, the container options override the pod options. Note that this field cannot be set when spec.os.name is windows.
SeccompProfile defines a pod/container's seccomp profile settings. Only one profile source may be set.
-
securityContext.seccompProfile.type (string), required
type indicates which kind of seccomp profile will be applied. Valid options are:
Localhost - a profile defined in a file on the node should be used. RuntimeDefault - the container runtime default profile should be used. Unconfined - no profile should be applied.
Possible enum values:
"Localhost"
indicates a profile defined in a file on the node should be used. The file's location relative to <kubelet-root-dir>/seccomp."RuntimeDefault"
represents the default container runtime seccomp profile."Unconfined"
indicates no seccomp profile is applied (A.K.A. unconfined).
-
securityContext.seccompProfile.localhostProfile (string)
localhostProfile indicates a profile defined in a file on the node should be used. The profile must be preconfigured on the node to work. Must be a descending path, relative to the kubelet's configured seccomp profile location. Must only be set if type is "Localhost".
-
-
securityContext.seLinuxOptions (SELinuxOptions)
The SELinux context to be applied to the container. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.
SELinuxOptions are the labels to be applied to the container
-
securityContext.seLinuxOptions.level (string)
Level is SELinux level label that applies to the container.
-
securityContext.seLinuxOptions.role (string)
Role is a SELinux role label that applies to the container.
-
securityContext.seLinuxOptions.type (string)
Type is a SELinux type label that applies to the container.
-
securityContext.seLinuxOptions.user (string)
User is a SELinux user label that applies to the container.
-
-
securityContext.windowsOptions (WindowsSecurityContextOptions)
The Windows specific settings applied to all containers. If unspecified, the options from the PodSecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.
WindowsSecurityContextOptions contain Windows-specific options and credentials.
-
securityContext.windowsOptions.gmsaCredentialSpec (string)
GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.
-
securityContext.windowsOptions.gmsaCredentialSpecName (string)
GMSACredentialSpecName is the name of the GMSA credential spec to use.
-
securityContext.windowsOptions.hostProcess (boolean)
HostProcess determines if a container should be run as a 'Host Process' container. This field is alpha-level and will only be honored by components that enable the WindowsHostProcessContainers feature flag. Setting this field without the feature flag will result in errors when validating the Pod. All of a Pod's containers must have the same effective HostProcess value (it is not allowed to have a mix of HostProcess containers and non-HostProcess containers). In addition, if HostProcess is true then HostNetwork must also be set to true.
-
securityContext.windowsOptions.runAsUserName (string)
The UserName in Windows to run the entrypoint of the container process. Defaults to the user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
-
-
Not allowed
-
ports ([]ContainerPort)
Patch strategy: merge on key
containerPort
Map: unique values on keys
containerPort, protocol
will be kept during a mergePorts are not allowed for ephemeral containers.
ContainerPort represents a network port in a single container.
-
ports.containerPort (int32), required
Number of port to expose on the pod's IP address. This must be a valid port number, 0 < x < 65536.
-
ports.hostIP (string)
What host IP to bind the external port to.
-
ports.hostPort (int32)
Number of port to expose on the host. If specified, this must be a valid port number, 0 < x < 65536. If HostNetwork is specified, this must match ContainerPort. Most containers do not need this.
-
ports.name (string)
If specified, this must be an IANA_SVC_NAME and unique within the pod. Each named port in a pod must have a unique name. Name for the port that can be referred to by services.
-
ports.protocol (string)
Protocol for port. Must be UDP, TCP, or SCTP. Defaults to "TCP".
Possible enum values:
"SCTP"
is the SCTP protocol."TCP"
is the TCP protocol."UDP"
is the UDP protocol.
-
-
resources (ResourceRequirements)
Resources are not allowed for ephemeral containers. Ephemeral containers use spare resources already allocated to the pod.
ResourceRequirements describes the compute resource requirements.
-
resources.limits (map[string]Quantity)
Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
-
resources.requests (map[string]Quantity)
Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
-
-
lifecycle (Lifecycle)
Lifecycle is not allowed for ephemeral containers.
Lifecycle describes actions that the management system should take in response to container lifecycle events. For the PostStart and PreStop lifecycle handlers, management of the container blocks until the action is complete, unless the container process fails, in which case the handler is aborted.
-
lifecycle.postStart (LifecycleHandler)
PostStart is called immediately after a container is created. If the handler fails, the container is terminated and restarted according to its restart policy. Other management of the container blocks until the hook completes. More info: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks
-
lifecycle.preStop (LifecycleHandler)
PreStop is called immediately before a container is terminated due to an API request or management event such as liveness/startup probe failure, preemption, resource contention, etc. The handler is not called if the container crashes or exits. The Pod's termination grace period countdown begins before the PreStop hook is executed. Regardless of the outcome of the handler, the container will eventually terminate within the Pod's termination grace period (unless delayed by finalizers). Other management of the container blocks until the hook completes or until the termination grace period is reached. More info: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks
-
-
livenessProbe (Probe)
Probes are not allowed for ephemeral containers.
-
readinessProbe (Probe)
Probes are not allowed for ephemeral containers.
-
startupProbe (Probe)
Probes are not allowed for ephemeral containers.
LifecycleHandler
LifecycleHandler defines a specific action that should be taken in a lifecycle hook. One and only one of the fields, except TCPSocket must be specified.
-
exec (ExecAction)
Exec specifies the action to take.
ExecAction describes a "run in container" action.
-
exec.command ([]string)
Command is the command line to execute inside the container, the working directory for the command is root ('/') in the container's filesystem. The command is simply exec'd, it is not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use a shell, you need to explicitly call out to that shell. Exit status of 0 is treated as live/healthy and non-zero is unhealthy.
-
-
httpGet (HTTPGetAction)
HTTPGet specifies the http request to perform.
HTTPGetAction describes an action based on HTTP Get requests.
-
httpGet.port (IntOrString), required
Name or number of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
httpGet.host (string)
Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.
-
httpGet.httpHeaders ([]HTTPHeader)
Custom headers to set in the request. HTTP allows repeated headers.
HTTPHeader describes a custom header to be used in HTTP probes
-
httpGet.httpHeaders.name (string), required
The header field name
-
httpGet.httpHeaders.value (string), required
The header field value
-
-
httpGet.path (string)
Path to access on the HTTP server.
-
httpGet.scheme (string)
Scheme to use for connecting to the host. Defaults to HTTP.
Possible enum values:
"HTTP"
means that the scheme used will be http://"HTTPS"
means that the scheme used will be https://
-
-
tcpSocket (TCPSocketAction)
Deprecated. TCPSocket is NOT supported as a LifecycleHandler and kept for the backward compatibility. There are no validation of this field and lifecycle hooks will fail in runtime when tcp handler is specified.
TCPSocketAction describes an action based on opening a socket
-
tcpSocket.port (IntOrString), required
Number or name of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
tcpSocket.host (string)
Optional: Host name to connect to, defaults to the pod IP.
-
NodeAffinity
Node affinity is a group of node affinity scheduling rules.
-
preferredDuringSchedulingIgnoredDuringExecution ([]PreferredSchedulingTerm)
The scheduler will prefer to schedule pods to nodes that satisfy the affinity expressions specified by this field, but it may choose a node that violates one or more of the expressions. The node that is most preferred is the one with the greatest sum of weights, i.e. for each node that meets all of the scheduling requirements (resource request, requiredDuringScheduling affinity expressions, etc.), compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node matches the corresponding matchExpressions; the node(s) with the highest sum are the most preferred.
An empty preferred scheduling term matches all objects with implicit weight 0 (i.e. it's a no-op). A null preferred scheduling term matches no objects (i.e. is also a no-op).
-
preferredDuringSchedulingIgnoredDuringExecution.preference (NodeSelectorTerm), required
A node selector term, associated with the corresponding weight.
A null or empty node selector term matches no objects. The requirements of them are ANDed. The TopologySelectorTerm type implements a subset of the NodeSelectorTerm.
-
preferredDuringSchedulingIgnoredDuringExecution.preference.matchExpressions ([]NodeSelectorRequirement)
A list of node selector requirements by node's labels.
-
preferredDuringSchedulingIgnoredDuringExecution.preference.matchFields ([]NodeSelectorRequirement)
A list of node selector requirements by node's fields.
-
-
preferredDuringSchedulingIgnoredDuringExecution.weight (int32), required
Weight associated with matching the corresponding nodeSelectorTerm, in the range 1-100.
-
-
requiredDuringSchedulingIgnoredDuringExecution (NodeSelector)
If the affinity requirements specified by this field are not met at scheduling time, the pod will not be scheduled onto the node. If the affinity requirements specified by this field cease to be met at some point during pod execution (e.g. due to an update), the system may or may not try to eventually evict the pod from its node.
A node selector represents the union of the results of one or more label queries over a set of nodes; that is, it represents the OR of the selectors represented by the node selector terms.
-
requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms ([]NodeSelectorTerm), required
Required. A list of node selector terms. The terms are ORed.
A null or empty node selector term matches no objects. The requirements of them are ANDed. The TopologySelectorTerm type implements a subset of the NodeSelectorTerm.
-
requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.matchExpressions ([]NodeSelectorRequirement)
A list of node selector requirements by node's labels.
-
requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.matchFields ([]NodeSelectorRequirement)
A list of node selector requirements by node's fields.
-
-
PodAffinity
Pod affinity is a group of inter pod affinity scheduling rules.
-
preferredDuringSchedulingIgnoredDuringExecution ([]WeightedPodAffinityTerm)
The scheduler will prefer to schedule pods to nodes that satisfy the affinity expressions specified by this field, but it may choose a node that violates one or more of the expressions. The node that is most preferred is the one with the greatest sum of weights, i.e. for each node that meets all of the scheduling requirements (resource request, requiredDuringScheduling affinity expressions, etc.), compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node has pods which matches the corresponding podAffinityTerm; the node(s) with the highest sum are the most preferred.
The weights of all of the matched WeightedPodAffinityTerm fields are added per-node to find the most preferred node(s)
-
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm (PodAffinityTerm), required
Required. A pod affinity term, associated with the corresponding weight.
Defines a set of pods (namely those matching the labelSelector relative to the given namespace(s)) that this pod should be co-located (affinity) or not co-located (anti-affinity) with, where co-located is defined as running on a node whose value of the label with key
matches that of any node on which a pod of the set of pods is running -
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm.topologyKey (string), required
This pod should be co-located (affinity) or not co-located (anti-affinity) with the pods matching the labelSelector in the specified namespaces, where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running. Empty topologyKey is not allowed.
-
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm.labelSelector (LabelSelector)
A label query over a set of resources, in this case pods.
-
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm.namespaceSelector (LabelSelector)
A label query over the set of namespaces that the term applies to. The term is applied to the union of the namespaces selected by this field and the ones listed in the namespaces field. null selector and null or empty namespaces list means "this pod's namespace". An empty selector ({}) matches all namespaces. This field is beta-level and is only honored when PodAffinityNamespaceSelector feature is enabled.
-
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm.namespaces ([]string)
namespaces specifies a static list of namespace names that the term applies to. The term is applied to the union of the namespaces listed in this field and the ones selected by namespaceSelector. null or empty namespaces list and null namespaceSelector means "this pod's namespace"
-
-
preferredDuringSchedulingIgnoredDuringExecution.weight (int32), required
weight associated with matching the corresponding podAffinityTerm, in the range 1-100.
-
-
requiredDuringSchedulingIgnoredDuringExecution ([]PodAffinityTerm)
If the affinity requirements specified by this field are not met at scheduling time, the pod will not be scheduled onto the node. If the affinity requirements specified by this field cease to be met at some point during pod execution (e.g. due to a pod label update), the system may or may not try to eventually evict the pod from its node. When there are multiple elements, the lists of nodes corresponding to each podAffinityTerm are intersected, i.e. all terms must be satisfied.
Defines a set of pods (namely those matching the labelSelector relative to the given namespace(s)) that this pod should be co-located (affinity) or not co-located (anti-affinity) with, where co-located is defined as running on a node whose value of the label with key
matches that of any node on which a pod of the set of pods is running -
requiredDuringSchedulingIgnoredDuringExecution.topologyKey (string), required
This pod should be co-located (affinity) or not co-located (anti-affinity) with the pods matching the labelSelector in the specified namespaces, where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running. Empty topologyKey is not allowed.
-
requiredDuringSchedulingIgnoredDuringExecution.labelSelector (LabelSelector)
A label query over a set of resources, in this case pods.
-
requiredDuringSchedulingIgnoredDuringExecution.namespaceSelector (LabelSelector)
A label query over the set of namespaces that the term applies to. The term is applied to the union of the namespaces selected by this field and the ones listed in the namespaces field. null selector and null or empty namespaces list means "this pod's namespace". An empty selector ({}) matches all namespaces. This field is beta-level and is only honored when PodAffinityNamespaceSelector feature is enabled.
-
requiredDuringSchedulingIgnoredDuringExecution.namespaces ([]string)
namespaces specifies a static list of namespace names that the term applies to. The term is applied to the union of the namespaces listed in this field and the ones selected by namespaceSelector. null or empty namespaces list and null namespaceSelector means "this pod's namespace"
-
PodAntiAffinity
Pod anti affinity is a group of inter pod anti affinity scheduling rules.
-
preferredDuringSchedulingIgnoredDuringExecution ([]WeightedPodAffinityTerm)
The scheduler will prefer to schedule pods to nodes that satisfy the anti-affinity expressions specified by this field, but it may choose a node that violates one or more of the expressions. The node that is most preferred is the one with the greatest sum of weights, i.e. for each node that meets all of the scheduling requirements (resource request, requiredDuringScheduling anti-affinity expressions, etc.), compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node has pods which matches the corresponding podAffinityTerm; the node(s) with the highest sum are the most preferred.
The weights of all of the matched WeightedPodAffinityTerm fields are added per-node to find the most preferred node(s)
-
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm (PodAffinityTerm), required
Required. A pod affinity term, associated with the corresponding weight.
Defines a set of pods (namely those matching the labelSelector relative to the given namespace(s)) that this pod should be co-located (affinity) or not co-located (anti-affinity) with, where co-located is defined as running on a node whose value of the label with key
matches that of any node on which a pod of the set of pods is running -
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm.topologyKey (string), required
This pod should be co-located (affinity) or not co-located (anti-affinity) with the pods matching the labelSelector in the specified namespaces, where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running. Empty topologyKey is not allowed.
-
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm.labelSelector (LabelSelector)
A label query over a set of resources, in this case pods.
-
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm.namespaceSelector (LabelSelector)
A label query over the set of namespaces that the term applies to. The term is applied to the union of the namespaces selected by this field and the ones listed in the namespaces field. null selector and null or empty namespaces list means "this pod's namespace". An empty selector ({}) matches all namespaces. This field is beta-level and is only honored when PodAffinityNamespaceSelector feature is enabled.
-
preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm.namespaces ([]string)
namespaces specifies a static list of namespace names that the term applies to. The term is applied to the union of the namespaces listed in this field and the ones selected by namespaceSelector. null or empty namespaces list and null namespaceSelector means "this pod's namespace"
-
-
preferredDuringSchedulingIgnoredDuringExecution.weight (int32), required
weight associated with matching the corresponding podAffinityTerm, in the range 1-100.
-
-
requiredDuringSchedulingIgnoredDuringExecution ([]PodAffinityTerm)
If the anti-affinity requirements specified by this field are not met at scheduling time, the pod will not be scheduled onto the node. If the anti-affinity requirements specified by this field cease to be met at some point during pod execution (e.g. due to a pod label update), the system may or may not try to eventually evict the pod from its node. When there are multiple elements, the lists of nodes corresponding to each podAffinityTerm are intersected, i.e. all terms must be satisfied.
Defines a set of pods (namely those matching the labelSelector relative to the given namespace(s)) that this pod should be co-located (affinity) or not co-located (anti-affinity) with, where co-located is defined as running on a node whose value of the label with key
matches that of any node on which a pod of the set of pods is running -
requiredDuringSchedulingIgnoredDuringExecution.topologyKey (string), required
This pod should be co-located (affinity) or not co-located (anti-affinity) with the pods matching the labelSelector in the specified namespaces, where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running. Empty topologyKey is not allowed.
-
requiredDuringSchedulingIgnoredDuringExecution.labelSelector (LabelSelector)
A label query over a set of resources, in this case pods.
-
requiredDuringSchedulingIgnoredDuringExecution.namespaceSelector (LabelSelector)
A label query over the set of namespaces that the term applies to. The term is applied to the union of the namespaces selected by this field and the ones listed in the namespaces field. null selector and null or empty namespaces list means "this pod's namespace". An empty selector ({}) matches all namespaces. This field is beta-level and is only honored when PodAffinityNamespaceSelector feature is enabled.
-
requiredDuringSchedulingIgnoredDuringExecution.namespaces ([]string)
namespaces specifies a static list of namespace names that the term applies to. The term is applied to the union of the namespaces listed in this field and the ones selected by namespaceSelector. null or empty namespaces list and null namespaceSelector means "this pod's namespace"
-
Probe
Probe describes a health check to be performed against a container to determine whether it is alive or ready to receive traffic.
-
exec (ExecAction)
Exec specifies the action to take.
ExecAction describes a "run in container" action.
-
exec.command ([]string)
Command is the command line to execute inside the container, the working directory for the command is root ('/') in the container's filesystem. The command is simply exec'd, it is not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use a shell, you need to explicitly call out to that shell. Exit status of 0 is treated as live/healthy and non-zero is unhealthy.
-
-
httpGet (HTTPGetAction)
HTTPGet specifies the http request to perform.
HTTPGetAction describes an action based on HTTP Get requests.
-
httpGet.port (IntOrString), required
Name or number of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
httpGet.host (string)
Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.
-
httpGet.httpHeaders ([]HTTPHeader)
Custom headers to set in the request. HTTP allows repeated headers.
HTTPHeader describes a custom header to be used in HTTP probes
-
httpGet.httpHeaders.name (string), required
The header field name
-
httpGet.httpHeaders.value (string), required
The header field value
-
-
httpGet.path (string)
Path to access on the HTTP server.
-
httpGet.scheme (string)
Scheme to use for connecting to the host. Defaults to HTTP.
Possible enum values:
"HTTP"
means that the scheme used will be http://"HTTPS"
means that the scheme used will be https://
-
-
tcpSocket (TCPSocketAction)
TCPSocket specifies an action involving a TCP port.
TCPSocketAction describes an action based on opening a socket
-
tcpSocket.port (IntOrString), required
Number or name of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
tcpSocket.host (string)
Optional: Host name to connect to, defaults to the pod IP.
-
-
initialDelaySeconds (int32)
Number of seconds after the container has started before liveness probes are initiated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
-
terminationGracePeriodSeconds (int64)
Optional duration in seconds the pod needs to terminate gracefully upon probe failure. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. If this value is nil, the pod's terminationGracePeriodSeconds will be used. Otherwise, this value overrides the value provided by the pod spec. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down). This is a beta field and requires enabling ProbeTerminationGracePeriod feature gate. Minimum value is 1. spec.terminationGracePeriodSeconds is used if unset.
-
periodSeconds (int32)
How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
-
timeoutSeconds (int32)
Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
-
failureThreshold (int32)
Minimum consecutive failures for the probe to be considered failed after having succeeded. Defaults to 3. Minimum value is 1.
-
successThreshold (int32)
Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.
-
grpc (GRPCAction)
GRPC specifies an action involving a GRPC port. This is an alpha field and requires enabling GRPCContainerProbe feature gate.
-
grpc.port (int32), required
Port number of the gRPC service. Number must be in the range 1 to 65535.
-
grpc.service (string)
Service is the name of the service to place in the gRPC HealthCheckRequest (see https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
If this is not specified, the default behavior is defined by gRPC.
-
PodStatus
PodStatus represents information about the status of a pod. Status may trail the actual state of a system, especially if the node that hosts the pod cannot contact the control plane.
-
nominatedNodeName (string)
nominatedNodeName is set only when this pod preempts other pods on the node, but it cannot be scheduled right away as preemption victims receive their graceful termination periods. This field does not guarantee that the pod will be scheduled on this node. Scheduler may decide to place the pod elsewhere if other nodes become available sooner. Scheduler may also decide to give the resources on this node to a higher priority pod that is created after preemption. As a result, this field may be different than PodSpec.nodeName when the pod is scheduled.
-
hostIP (string)
IP address of the host to which the pod is assigned. Empty if not yet scheduled.
-
startTime (Time)
RFC 3339 date and time at which the object was acknowledged by the Kubelet. This is before the Kubelet pulled the container image(s) for the pod.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
phase (string)
The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. The conditions array, the reason and message fields, and the individual container status arrays contain more detail about the pod's status. There are five possible phase values:
Pending: The pod has been accepted by the Kubernetes system, but one or more of the container images has not been created. This includes time before being scheduled as well as time spent downloading images over the network, which could take a while. Running: The pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting. Succeeded: All containers in the pod have terminated in success, and will not be restarted. Failed: All containers in the pod have terminated, and at least one container has terminated in failure. The container either exited with non-zero status or was terminated by the system. Unknown: For some reason the state of the pod could not be obtained, typically due to an error in communicating with the host of the pod.
More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-phase
Possible enum values:
"Failed"
means that all containers in the pod have terminated, and at least one container has terminated in a failure (exited with a non-zero exit code or was stopped by the system)."Pending"
means the pod has been accepted by the system, but one or more of the containers has not been started. This includes time before being bound to a node, as well as time spent pulling images onto the host."Running"
means the pod has been bound to a node and all of the containers have been started. At least one container is still running or is in the process of being restarted."Succeeded"
means that all containers in the pod have voluntarily terminated with a container exit code of 0, and the system is not going to restart any of these containers."Unknown"
means that for some reason the state of the pod could not be obtained, typically due to an error in communicating with the host of the pod. Deprecated: It isn't being set since 2015 (74da3b14b0c0f658b3bb8d2def5094686d0e9095)
-
message (string)
A human readable message indicating details about why the pod is in this condition.
-
reason (string)
A brief CamelCase message indicating details about why the pod is in this state. e.g. 'Evicted'
-
podIP (string)
IP address allocated to the pod. Routable at least within the cluster. Empty if not yet allocated.
-
podIPs ([]PodIP)
Patch strategy: merge on key
ip
podIPs holds the IP addresses allocated to the pod. If this field is specified, the 0th entry must match the podIP field. Pods may be allocated at most 1 value for each of IPv4 and IPv6. This list is empty if no IPs have been allocated yet.
IP address information for entries in the (plural) PodIPs field. Each entry includes: IP: An IP address allocated to the pod. Routable at least within the cluster.
-
podIPs.ip (string)
ip is an IP address (IPv4 or IPv6) assigned to the pod
-
-
conditions ([]PodCondition)
Patch strategy: merge on key
type
Current service state of pod. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-conditions
PodCondition contains details for the current condition of this pod.
-
conditions.status (string), required
Status is the status of the condition. Can be True, False, Unknown. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-conditions
-
conditions.type (string), required
Type is the type of the condition. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-conditions
Possible enum values:
"ContainersReady"
indicates whether all containers in the pod are ready."Initialized"
means that all init containers in the pod have started successfully."PodScheduled"
represents status of the scheduling process for this pod."Ready"
means the pod is able to service requests and should be added to the load balancing pools of all matching services.
-
conditions.lastProbeTime (Time)
Last time we probed the condition.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.lastTransitionTime (Time)
Last time the condition transitioned from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
Human-readable message indicating details about last transition.
-
conditions.reason (string)
Unique, one-word, CamelCase reason for the condition's last transition.
-
-
qosClass (string)
The Quality of Service (QOS) classification assigned to the pod based on resource requirements See PodQOSClass type for available QOS classes More info: https://git.k8s.io/community/contributors/design-proposals/node/resource-qos.md
Possible enum values:
"BestEffort"
is the BestEffort qos class."Burstable"
is the Burstable qos class."Guaranteed"
is the Guaranteed qos class.
-
initContainerStatuses ([]ContainerStatus)
The list has one entry per init container in the manifest. The most recent successful init container will have ready = true, the most recently started container will have startTime set. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-and-container-status
ContainerStatus contains details for the current status of this container.
-
initContainerStatuses.name (string), required
This must be a DNS_LABEL. Each container in a pod must have a unique name. Cannot be updated.
-
initContainerStatuses.image (string), required
The image the container is running. More info: https://kubernetes.io/docs/concepts/containers/images.
-
initContainerStatuses.imageID (string), required
ImageID of the container's image.
-
initContainerStatuses.containerID (string)
Container's ID in the format 'docker://<container_id>'.
-
initContainerStatuses.state (ContainerState)
Details about the container's current condition.
ContainerState holds a possible state of container. Only one of its members may be specified. If none of them is specified, the default one is ContainerStateWaiting.
-
initContainerStatuses.state.running (ContainerStateRunning)
Details about a running container
-
initContainerStatuses.state.terminated (ContainerStateTerminated)
Details about a terminated container
ContainerStateTerminated is a terminated state of a container.
-
initContainerStatuses.state.terminated.containerID (string)
Container's ID in the format 'docker://<container_id>'
-
initContainerStatuses.state.terminated.exitCode (int32), required
Exit status from the last termination of the container
-
initContainerStatuses.state.terminated.startedAt (Time)
Time at which previous execution of the container started
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
initContainerStatuses.state.terminated.finishedAt (Time)
Time at which the container last terminated
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
initContainerStatuses.state.terminated.message (string)
Message regarding the last termination of the container
-
initContainerStatuses.state.terminated.reason (string)
(brief) reason from the last termination of the container
-
initContainerStatuses.state.terminated.signal (int32)
Signal from the last termination of the container
-
-
initContainerStatuses.state.waiting (ContainerStateWaiting)
Details about a waiting container
ContainerStateWaiting is a waiting state of a container.
-
initContainerStatuses.state.waiting.message (string)
Message regarding why the container is not yet running.
-
initContainerStatuses.state.waiting.reason (string)
(brief) reason the container is not yet running.
-
-
-
initContainerStatuses.lastState (ContainerState)
Details about the container's last termination condition.
ContainerState holds a possible state of container. Only one of its members may be specified. If none of them is specified, the default one is ContainerStateWaiting.
-
initContainerStatuses.lastState.running (ContainerStateRunning)
Details about a running container
-
initContainerStatuses.lastState.terminated (ContainerStateTerminated)
Details about a terminated container
ContainerStateTerminated is a terminated state of a container.
-
initContainerStatuses.lastState.terminated.containerID (string)
Container's ID in the format 'docker://<container_id>'
-
initContainerStatuses.lastState.terminated.exitCode (int32), required
Exit status from the last termination of the container
-
initContainerStatuses.lastState.terminated.startedAt (Time)
Time at which previous execution of the container started
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
initContainerStatuses.lastState.terminated.finishedAt (Time)
Time at which the container last terminated
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
initContainerStatuses.lastState.terminated.message (string)
Message regarding the last termination of the container
-
initContainerStatuses.lastState.terminated.reason (string)
(brief) reason from the last termination of the container
-
initContainerStatuses.lastState.terminated.signal (int32)
Signal from the last termination of the container
-
-
initContainerStatuses.lastState.waiting (ContainerStateWaiting)
Details about a waiting container
ContainerStateWaiting is a waiting state of a container.
-
initContainerStatuses.lastState.waiting.message (string)
Message regarding why the container is not yet running.
-
initContainerStatuses.lastState.waiting.reason (string)
(brief) reason the container is not yet running.
-
-
-
initContainerStatuses.ready (boolean), required
Specifies whether the container has passed its readiness probe.
-
initContainerStatuses.restartCount (int32), required
The number of times the container has been restarted.
-
initContainerStatuses.started (boolean)
Specifies whether the container has passed its startup probe. Initialized as false, becomes true after startupProbe is considered successful. Resets to false when the container is restarted, or if kubelet loses state temporarily. Is always true when no startupProbe is defined.
-
-
containerStatuses ([]ContainerStatus)
The list has one entry per container in the manifest. Each entry is currently the output of
docker inspect
. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-and-container-statusContainerStatus contains details for the current status of this container.
-
containerStatuses.name (string), required
This must be a DNS_LABEL. Each container in a pod must have a unique name. Cannot be updated.
-
containerStatuses.image (string), required
The image the container is running. More info: https://kubernetes.io/docs/concepts/containers/images.
-
containerStatuses.imageID (string), required
ImageID of the container's image.
-
containerStatuses.containerID (string)
Container's ID in the format 'docker://<container_id>'.
-
containerStatuses.state (ContainerState)
Details about the container's current condition.
ContainerState holds a possible state of container. Only one of its members may be specified. If none of them is specified, the default one is ContainerStateWaiting.
-
containerStatuses.state.running (ContainerStateRunning)
Details about a running container
-
containerStatuses.state.terminated (ContainerStateTerminated)
Details about a terminated container
ContainerStateTerminated is a terminated state of a container.
-
containerStatuses.state.terminated.containerID (string)
Container's ID in the format 'docker://<container_id>'
-
containerStatuses.state.terminated.exitCode (int32), required
Exit status from the last termination of the container
-
containerStatuses.state.terminated.startedAt (Time)
Time at which previous execution of the container started
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
containerStatuses.state.terminated.finishedAt (Time)
Time at which the container last terminated
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
containerStatuses.state.terminated.message (string)
Message regarding the last termination of the container
-
containerStatuses.state.terminated.reason (string)
(brief) reason from the last termination of the container
-
containerStatuses.state.terminated.signal (int32)
Signal from the last termination of the container
-
-
containerStatuses.state.waiting (ContainerStateWaiting)
Details about a waiting container
ContainerStateWaiting is a waiting state of a container.
-
containerStatuses.state.waiting.message (string)
Message regarding why the container is not yet running.
-
containerStatuses.state.waiting.reason (string)
(brief) reason the container is not yet running.
-
-
-
containerStatuses.lastState (ContainerState)
Details about the container's last termination condition.
ContainerState holds a possible state of container. Only one of its members may be specified. If none of them is specified, the default one is ContainerStateWaiting.
-
containerStatuses.lastState.running (ContainerStateRunning)
Details about a running container
-
containerStatuses.lastState.terminated (ContainerStateTerminated)
Details about a terminated container
ContainerStateTerminated is a terminated state of a container.
-
containerStatuses.lastState.terminated.containerID (string)
Container's ID in the format 'docker://<container_id>'
-
containerStatuses.lastState.terminated.exitCode (int32), required
Exit status from the last termination of the container
-
containerStatuses.lastState.terminated.startedAt (Time)
Time at which previous execution of the container started
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
containerStatuses.lastState.terminated.finishedAt (Time)
Time at which the container last terminated
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
containerStatuses.lastState.terminated.message (string)
Message regarding the last termination of the container
-
containerStatuses.lastState.terminated.reason (string)
(brief) reason from the last termination of the container
-
containerStatuses.lastState.terminated.signal (int32)
Signal from the last termination of the container
-
-
containerStatuses.lastState.waiting (ContainerStateWaiting)
Details about a waiting container
ContainerStateWaiting is a waiting state of a container.
-
containerStatuses.lastState.waiting.message (string)
Message regarding why the container is not yet running.
-
containerStatuses.lastState.waiting.reason (string)
(brief) reason the container is not yet running.
-
-
-
containerStatuses.ready (boolean), required
Specifies whether the container has passed its readiness probe.
-
containerStatuses.restartCount (int32), required
The number of times the container has been restarted.
-
containerStatuses.started (boolean)
Specifies whether the container has passed its startup probe. Initialized as false, becomes true after startupProbe is considered successful. Resets to false when the container is restarted, or if kubelet loses state temporarily. Is always true when no startupProbe is defined.
-
-
ephemeralContainerStatuses ([]ContainerStatus)
Status for any ephemeral containers that have run in this pod. This field is beta-level and available on clusters that haven't disabled the EphemeralContainers feature gate.
ContainerStatus contains details for the current status of this container.
-
ephemeralContainerStatuses.name (string), required
This must be a DNS_LABEL. Each container in a pod must have a unique name. Cannot be updated.
-
ephemeralContainerStatuses.image (string), required
The image the container is running. More info: https://kubernetes.io/docs/concepts/containers/images.
-
ephemeralContainerStatuses.imageID (string), required
ImageID of the container's image.
-
ephemeralContainerStatuses.containerID (string)
Container's ID in the format 'docker://<container_id>'.
-
ephemeralContainerStatuses.state (ContainerState)
Details about the container's current condition.
ContainerState holds a possible state of container. Only one of its members may be specified. If none of them is specified, the default one is ContainerStateWaiting.
-
ephemeralContainerStatuses.state.running (ContainerStateRunning)
Details about a running container
-
ephemeralContainerStatuses.state.terminated (ContainerStateTerminated)
Details about a terminated container
ContainerStateTerminated is a terminated state of a container.
-
ephemeralContainerStatuses.state.terminated.containerID (string)
Container's ID in the format 'docker://<container_id>'
-
ephemeralContainerStatuses.state.terminated.exitCode (int32), required
Exit status from the last termination of the container
-
ephemeralContainerStatuses.state.terminated.startedAt (Time)
Time at which previous execution of the container started
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
ephemeralContainerStatuses.state.terminated.finishedAt (Time)
Time at which the container last terminated
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
ephemeralContainerStatuses.state.terminated.message (string)
Message regarding the last termination of the container
-
ephemeralContainerStatuses.state.terminated.reason (string)
(brief) reason from the last termination of the container
-
ephemeralContainerStatuses.state.terminated.signal (int32)
Signal from the last termination of the container
-
-
ephemeralContainerStatuses.state.waiting (ContainerStateWaiting)
Details about a waiting container
ContainerStateWaiting is a waiting state of a container.
-
ephemeralContainerStatuses.state.waiting.message (string)
Message regarding why the container is not yet running.
-
ephemeralContainerStatuses.state.waiting.reason (string)
(brief) reason the container is not yet running.
-
-
-
ephemeralContainerStatuses.lastState (ContainerState)
Details about the container's last termination condition.
ContainerState holds a possible state of container. Only one of its members may be specified. If none of them is specified, the default one is ContainerStateWaiting.
-
ephemeralContainerStatuses.lastState.running (ContainerStateRunning)
Details about a running container
-
ephemeralContainerStatuses.lastState.terminated (ContainerStateTerminated)
Details about a terminated container
ContainerStateTerminated is a terminated state of a container.
-
ephemeralContainerStatuses.lastState.terminated.containerID (string)
Container's ID in the format 'docker://<container_id>'
-
ephemeralContainerStatuses.lastState.terminated.exitCode (int32), required
Exit status from the last termination of the container
-
ephemeralContainerStatuses.lastState.terminated.startedAt (Time)
Time at which previous execution of the container started
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
ephemeralContainerStatuses.lastState.terminated.finishedAt (Time)
Time at which the container last terminated
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
ephemeralContainerStatuses.lastState.terminated.message (string)
Message regarding the last termination of the container
-
ephemeralContainerStatuses.lastState.terminated.reason (string)
(brief) reason from the last termination of the container
-
ephemeralContainerStatuses.lastState.terminated.signal (int32)
Signal from the last termination of the container
-
-
ephemeralContainerStatuses.lastState.waiting (ContainerStateWaiting)
Details about a waiting container
ContainerStateWaiting is a waiting state of a container.
-
ephemeralContainerStatuses.lastState.waiting.message (string)
Message regarding why the container is not yet running.
-
ephemeralContainerStatuses.lastState.waiting.reason (string)
(brief) reason the container is not yet running.
-
-
-
ephemeralContainerStatuses.ready (boolean), required
Specifies whether the container has passed its readiness probe.
-
ephemeralContainerStatuses.restartCount (int32), required
The number of times the container has been restarted.
-
ephemeralContainerStatuses.started (boolean)
Specifies whether the container has passed its startup probe. Initialized as false, becomes true after startupProbe is considered successful. Resets to false when the container is restarted, or if kubelet loses state temporarily. Is always true when no startupProbe is defined.
-
PodList
PodList is a list of Pods.
-
items ([]Pod), required
List of pods. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md
-
apiVersion (string)
APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
-
kind (string)
Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
Operations
get
read the specified Pod
HTTP Request
GET /api/v1/namespaces/{namespace}/pods/{name}
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Pod): OK
401: Unauthorized
get
read ephemeralcontainers of the specified Pod
HTTP Request
GET /api/v1/namespaces/{namespace}/pods/{name}/ephemeralcontainers
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Pod): OK
401: Unauthorized
get
read log of the specified Pod
HTTP Request
GET /api/v1/namespaces/{namespace}/pods/{name}/log
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
container (in query): string
The container for which to stream logs. Defaults to only container if there is one container in the pod.
-
follow (in query): boolean
Follow the log stream of the pod. Defaults to false.
-
insecureSkipTLSVerifyBackend (in query): boolean
insecureSkipTLSVerifyBackend indicates that the apiserver should not confirm the validity of the serving certificate of the backend it is connecting to. This will make the HTTPS connection between the apiserver and the backend insecure. This means the apiserver cannot verify the log data it is receiving came from the real kubelet. If the kubelet is configured to verify the apiserver's TLS credentials, it does not mean the connection to the real kubelet is vulnerable to a man in the middle attack (e.g. an attacker could not intercept the actual log data coming from the real kubelet).
-
limitBytes (in query): integer
If set, the number of bytes to read from the server before terminating the log output. This may not display a complete final line of logging, and may return slightly more or slightly less than the specified limit.
-
pretty (in query): string
-
previous (in query): boolean
Return previous terminated container logs. Defaults to false.
-
sinceSeconds (in query): integer
A relative time in seconds before the current time from which to show logs. If this value precedes the time a pod was started, only logs since the pod start will be returned. If this value is in the future, no logs will be returned. Only one of sinceSeconds or sinceTime may be specified.
-
tailLines (in query): integer
If set, the number of lines from the end of the logs to show. If not specified, logs are shown from the creation of the container or sinceSeconds or sinceTime
-
timestamps (in query): boolean
If true, add an RFC3339 or RFC3339Nano timestamp at the beginning of every line of log output. Defaults to false.
Response
200 (string): OK
401: Unauthorized
get
read status of the specified Pod
HTTP Request
GET /api/v1/namespaces/{namespace}/pods/{name}/status
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Pod): OK
401: Unauthorized
list
list or watch objects of kind Pod
HTTP Request
GET /api/v1/namespaces/{namespace}/pods
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PodList): OK
401: Unauthorized
list
list or watch objects of kind Pod
HTTP Request
GET /api/v1/pods
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PodList): OK
401: Unauthorized
create
create a Pod
HTTP Request
POST /api/v1/namespaces/{namespace}/pods
Parameters
-
namespace (in path): string, required
-
body: Pod, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Pod): OK
201 (Pod): Created
202 (Pod): Accepted
401: Unauthorized
update
replace the specified Pod
HTTP Request
PUT /api/v1/namespaces/{namespace}/pods/{name}
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
body: Pod, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Pod): OK
201 (Pod): Created
401: Unauthorized
update
replace ephemeralcontainers of the specified Pod
HTTP Request
PUT /api/v1/namespaces/{namespace}/pods/{name}/ephemeralcontainers
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
body: Pod, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Pod): OK
201 (Pod): Created
401: Unauthorized
update
replace status of the specified Pod
HTTP Request
PUT /api/v1/namespaces/{namespace}/pods/{name}/status
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
body: Pod, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Pod): OK
201 (Pod): Created
401: Unauthorized
patch
partially update the specified Pod
HTTP Request
PATCH /api/v1/namespaces/{namespace}/pods/{name}
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Pod): OK
201 (Pod): Created
401: Unauthorized
patch
partially update ephemeralcontainers of the specified Pod
HTTP Request
PATCH /api/v1/namespaces/{namespace}/pods/{name}/ephemeralcontainers
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Pod): OK
201 (Pod): Created
401: Unauthorized
patch
partially update status of the specified Pod
HTTP Request
PATCH /api/v1/namespaces/{namespace}/pods/{name}/status
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Pod): OK
201 (Pod): Created
401: Unauthorized
delete
delete a Pod
HTTP Request
DELETE /api/v1/namespaces/{namespace}/pods/{name}
Parameters
-
name (in path): string, required
name of the Pod
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Pod): OK
202 (Pod): Accepted
401: Unauthorized
deletecollection
delete collection of Pod
HTTP Request
DELETE /api/v1/namespaces/{namespace}/pods
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.2 - PodTemplate
apiVersion: v1
import "k8s.io/api/core/v1"
PodTemplate
PodTemplate describes a template for creating copies of a predefined pod.
-
apiVersion: v1
-
kind: PodTemplate
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
template (PodTemplateSpec)
Template defines the pods that will be created from this pod template. https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
PodTemplateSpec
PodTemplateSpec describes the data a pod should have when created from a template
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (PodSpec)
Specification of the desired behavior of the pod. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
PodTemplateList
PodTemplateList is a list of PodTemplates.
-
apiVersion: v1
-
kind: PodTemplateList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]PodTemplate), required
List of pod templates
Operations
get
read the specified PodTemplate
HTTP Request
GET /api/v1/namespaces/{namespace}/podtemplates/{name}
Parameters
-
name (in path): string, required
name of the PodTemplate
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (PodTemplate): OK
401: Unauthorized
list
list or watch objects of kind PodTemplate
HTTP Request
GET /api/v1/namespaces/{namespace}/podtemplates
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PodTemplateList): OK
401: Unauthorized
list
list or watch objects of kind PodTemplate
HTTP Request
GET /api/v1/podtemplates
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PodTemplateList): OK
401: Unauthorized
create
create a PodTemplate
HTTP Request
POST /api/v1/namespaces/{namespace}/podtemplates
Parameters
-
namespace (in path): string, required
-
body: PodTemplate, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PodTemplate): OK
201 (PodTemplate): Created
202 (PodTemplate): Accepted
401: Unauthorized
update
replace the specified PodTemplate
HTTP Request
PUT /api/v1/namespaces/{namespace}/podtemplates/{name}
Parameters
-
name (in path): string, required
name of the PodTemplate
-
namespace (in path): string, required
-
body: PodTemplate, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PodTemplate): OK
201 (PodTemplate): Created
401: Unauthorized
patch
partially update the specified PodTemplate
HTTP Request
PATCH /api/v1/namespaces/{namespace}/podtemplates/{name}
Parameters
-
name (in path): string, required
name of the PodTemplate
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PodTemplate): OK
201 (PodTemplate): Created
401: Unauthorized
delete
delete a PodTemplate
HTTP Request
DELETE /api/v1/namespaces/{namespace}/podtemplates/{name}
Parameters
-
name (in path): string, required
name of the PodTemplate
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (PodTemplate): OK
202 (PodTemplate): Accepted
401: Unauthorized
deletecollection
delete collection of PodTemplate
HTTP Request
DELETE /api/v1/namespaces/{namespace}/podtemplates
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.3 - ReplicationController
apiVersion: v1
import "k8s.io/api/core/v1"
ReplicationController
ReplicationController represents the configuration of a replication controller.
-
apiVersion: v1
-
kind: ReplicationController
-
metadata (ObjectMeta)
If the Labels of a ReplicationController are empty, they are defaulted to be the same as the Pod(s) that the replication controller manages. Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (ReplicationControllerSpec)
Spec defines the specification of the desired behavior of the replication controller. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (ReplicationControllerStatus)
Status is the most recently observed status of the replication controller. This data may be out of date by some window of time. Populated by the system. Read-only. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
ReplicationControllerSpec
ReplicationControllerSpec is the specification of a replication controller.
-
selector (map[string]string)
Selector is a label query over pods that should match the Replicas count. If Selector is empty, it is defaulted to the labels present on the Pod template. Label keys and values that must match in order to be controlled by this replication controller, if empty defaulted to labels on Pod template. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
-
template (PodTemplateSpec)
Template is the object that describes the pod that will be created if insufficient replicas are detected. This takes precedence over a TemplateRef. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller#pod-template
-
replicas (int32)
Replicas is the number of desired replicas. This is a pointer to distinguish between explicit zero and unspecified. Defaults to 1. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller#what-is-a-replicationcontroller
-
minReadySeconds (int32)
Minimum number of seconds for which a newly created pod should be ready without any of its container crashing, for it to be considered available. Defaults to 0 (pod will be considered available as soon as it is ready)
ReplicationControllerStatus
ReplicationControllerStatus represents the current status of a replication controller.
-
replicas (int32), required
Replicas is the most recently oberved number of replicas. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller#what-is-a-replicationcontroller
-
availableReplicas (int32)
The number of available replicas (ready for at least minReadySeconds) for this replication controller.
-
readyReplicas (int32)
The number of ready replicas for this replication controller.
-
fullyLabeledReplicas (int32)
The number of pods that have labels matching the labels of the pod template of the replication controller.
-
conditions ([]ReplicationControllerCondition)
Patch strategy: merge on key
type
Represents the latest available observations of a replication controller's current state.
ReplicationControllerCondition describes the state of a replication controller at a certain point.
-
conditions.status (string), required
Status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
Type of replication controller condition.
-
conditions.lastTransitionTime (Time)
The last time the condition transitioned from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
A human readable message indicating details about the transition.
-
conditions.reason (string)
The reason for the condition's last transition.
-
-
observedGeneration (int64)
ObservedGeneration reflects the generation of the most recently observed replication controller.
ReplicationControllerList
ReplicationControllerList is a collection of replication controllers.
-
apiVersion: v1
-
kind: ReplicationControllerList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]ReplicationController), required
List of replication controllers. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller
Operations
get
read the specified ReplicationController
HTTP Request
GET /api/v1/namespaces/{namespace}/replicationcontrollers/{name}
Parameters
-
name (in path): string, required
name of the ReplicationController
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (ReplicationController): OK
401: Unauthorized
get
read status of the specified ReplicationController
HTTP Request
GET /api/v1/namespaces/{namespace}/replicationcontrollers/{name}/status
Parameters
-
name (in path): string, required
name of the ReplicationController
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (ReplicationController): OK
401: Unauthorized
list
list or watch objects of kind ReplicationController
HTTP Request
GET /api/v1/namespaces/{namespace}/replicationcontrollers
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ReplicationControllerList): OK
401: Unauthorized
list
list or watch objects of kind ReplicationController
HTTP Request
GET /api/v1/replicationcontrollers
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ReplicationControllerList): OK
401: Unauthorized
create
create a ReplicationController
HTTP Request
POST /api/v1/namespaces/{namespace}/replicationcontrollers
Parameters
-
namespace (in path): string, required
-
body: ReplicationController, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ReplicationController): OK
201 (ReplicationController): Created
202 (ReplicationController): Accepted
401: Unauthorized
update
replace the specified ReplicationController
HTTP Request
PUT /api/v1/namespaces/{namespace}/replicationcontrollers/{name}
Parameters
-
name (in path): string, required
name of the ReplicationController
-
namespace (in path): string, required
-
body: ReplicationController, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ReplicationController): OK
201 (ReplicationController): Created
401: Unauthorized
update
replace status of the specified ReplicationController
HTTP Request
PUT /api/v1/namespaces/{namespace}/replicationcontrollers/{name}/status
Parameters
-
name (in path): string, required
name of the ReplicationController
-
namespace (in path): string, required
-
body: ReplicationController, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ReplicationController): OK
201 (ReplicationController): Created
401: Unauthorized
patch
partially update the specified ReplicationController
HTTP Request
PATCH /api/v1/namespaces/{namespace}/replicationcontrollers/{name}
Parameters
-
name (in path): string, required
name of the ReplicationController
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ReplicationController): OK
201 (ReplicationController): Created
401: Unauthorized
patch
partially update status of the specified ReplicationController
HTTP Request
PATCH /api/v1/namespaces/{namespace}/replicationcontrollers/{name}/status
Parameters
-
name (in path): string, required
name of the ReplicationController
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ReplicationController): OK
201 (ReplicationController): Created
401: Unauthorized
delete
delete a ReplicationController
HTTP Request
DELETE /api/v1/namespaces/{namespace}/replicationcontrollers/{name}
Parameters
-
name (in path): string, required
name of the ReplicationController
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of ReplicationController
HTTP Request
DELETE /api/v1/namespaces/{namespace}/replicationcontrollers
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.4 - ReplicaSet
apiVersion: apps/v1
import "k8s.io/api/apps/v1"
ReplicaSet
ReplicaSet ensures that a specified number of pod replicas are running at any given time.
-
apiVersion: apps/v1
-
kind: ReplicaSet
-
metadata (ObjectMeta)
If the Labels of a ReplicaSet are empty, they are defaulted to be the same as the Pod(s) that the ReplicaSet manages. Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (ReplicaSetSpec)
Spec defines the specification of the desired behavior of the ReplicaSet. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (ReplicaSetStatus)
Status is the most recently observed status of the ReplicaSet. This data may be out of date by some window of time. Populated by the system. Read-only. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
ReplicaSetSpec
ReplicaSetSpec is the specification of a ReplicaSet.
-
selector (LabelSelector), required
Selector is a label query over pods that should match the replica count. Label keys and values that must match in order to be controlled by this replica set. It must match the pod template's labels. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
-
template (PodTemplateSpec)
Template is the object that describes the pod that will be created if insufficient replicas are detected. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller#pod-template
-
replicas (int32)
Replicas is the number of desired replicas. This is a pointer to distinguish between explicit zero and unspecified. Defaults to 1. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/#what-is-a-replicationcontroller
-
minReadySeconds (int32)
Minimum number of seconds for which a newly created pod should be ready without any of its container crashing, for it to be considered available. Defaults to 0 (pod will be considered available as soon as it is ready)
ReplicaSetStatus
ReplicaSetStatus represents the current status of a ReplicaSet.
-
replicas (int32), required
Replicas is the most recently oberved number of replicas. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/#what-is-a-replicationcontroller
-
availableReplicas (int32)
The number of available replicas (ready for at least minReadySeconds) for this replica set.
-
readyReplicas (int32)
readyReplicas is the number of pods targeted by this ReplicaSet with a Ready Condition.
-
fullyLabeledReplicas (int32)
The number of pods that have labels matching the labels of the pod template of the replicaset.
-
conditions ([]ReplicaSetCondition)
Patch strategy: merge on key
type
Represents the latest available observations of a replica set's current state.
ReplicaSetCondition describes the state of a replica set at a certain point.
-
conditions.status (string), required
Status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
Type of replica set condition.
-
conditions.lastTransitionTime (Time)
The last time the condition transitioned from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
A human readable message indicating details about the transition.
-
conditions.reason (string)
The reason for the condition's last transition.
-
-
observedGeneration (int64)
ObservedGeneration reflects the generation of the most recently observed ReplicaSet.
ReplicaSetList
ReplicaSetList is a collection of ReplicaSets.
-
apiVersion: apps/v1
-
kind: ReplicaSetList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]ReplicaSet), required
List of ReplicaSets. More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller
Operations
get
read the specified ReplicaSet
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/replicasets/{name}
Parameters
-
name (in path): string, required
name of the ReplicaSet
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (ReplicaSet): OK
401: Unauthorized
get
read status of the specified ReplicaSet
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/replicasets/{name}/status
Parameters
-
name (in path): string, required
name of the ReplicaSet
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (ReplicaSet): OK
401: Unauthorized
list
list or watch objects of kind ReplicaSet
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/replicasets
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ReplicaSetList): OK
401: Unauthorized
list
list or watch objects of kind ReplicaSet
HTTP Request
GET /apis/apps/v1/replicasets
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ReplicaSetList): OK
401: Unauthorized
create
create a ReplicaSet
HTTP Request
POST /apis/apps/v1/namespaces/{namespace}/replicasets
Parameters
-
namespace (in path): string, required
-
body: ReplicaSet, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ReplicaSet): OK
201 (ReplicaSet): Created
202 (ReplicaSet): Accepted
401: Unauthorized
update
replace the specified ReplicaSet
HTTP Request
PUT /apis/apps/v1/namespaces/{namespace}/replicasets/{name}
Parameters
-
name (in path): string, required
name of the ReplicaSet
-
namespace (in path): string, required
-
body: ReplicaSet, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ReplicaSet): OK
201 (ReplicaSet): Created
401: Unauthorized
update
replace status of the specified ReplicaSet
HTTP Request
PUT /apis/apps/v1/namespaces/{namespace}/replicasets/{name}/status
Parameters
-
name (in path): string, required
name of the ReplicaSet
-
namespace (in path): string, required
-
body: ReplicaSet, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ReplicaSet): OK
201 (ReplicaSet): Created
401: Unauthorized
patch
partially update the specified ReplicaSet
HTTP Request
PATCH /apis/apps/v1/namespaces/{namespace}/replicasets/{name}
Parameters
-
name (in path): string, required
name of the ReplicaSet
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ReplicaSet): OK
201 (ReplicaSet): Created
401: Unauthorized
patch
partially update status of the specified ReplicaSet
HTTP Request
PATCH /apis/apps/v1/namespaces/{namespace}/replicasets/{name}/status
Parameters
-
name (in path): string, required
name of the ReplicaSet
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ReplicaSet): OK
201 (ReplicaSet): Created
401: Unauthorized
delete
delete a ReplicaSet
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/replicasets/{name}
Parameters
-
name (in path): string, required
name of the ReplicaSet
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of ReplicaSet
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/replicasets
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.5 - Deployment
apiVersion: apps/v1
import "k8s.io/api/apps/v1"
Deployment
Deployment enables declarative updates for Pods and ReplicaSets.
-
apiVersion: apps/v1
-
kind: Deployment
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (DeploymentSpec)
Specification of the desired behavior of the Deployment.
-
status (DeploymentStatus)
Most recently observed status of the Deployment.
DeploymentSpec
DeploymentSpec is the specification of the desired behavior of the Deployment.
-
selector (LabelSelector), required
Label selector for pods. Existing ReplicaSets whose pods are selected by this will be the ones affected by this deployment. It must match the pod template's labels.
-
template (PodTemplateSpec), required
Template describes the pods that will be created.
-
replicas (int32)
Number of desired pods. This is a pointer to distinguish between explicit zero and not specified. Defaults to 1.
-
minReadySeconds (int32)
Minimum number of seconds for which a newly created pod should be ready without any of its container crashing, for it to be considered available. Defaults to 0 (pod will be considered available as soon as it is ready)
-
strategy (DeploymentStrategy)
Patch strategy: retainKeys
The deployment strategy to use to replace existing pods with new ones.
DeploymentStrategy describes how to replace existing pods with new ones.
-
strategy.type (string)
Type of deployment. Can be "Recreate" or "RollingUpdate". Default is RollingUpdate.
Possible enum values:
"Recreate"
Kill all existing pods before creating new ones."RollingUpdate"
Replace the old ReplicaSets by new one using rolling update i.e gradually scale down the old ReplicaSets and scale up the new one.
-
strategy.rollingUpdate (RollingUpdateDeployment)
Rolling update config params. Present only if DeploymentStrategyType = RollingUpdate.
Spec to control the desired behavior of rolling update.
-
strategy.rollingUpdate.maxSurge (IntOrString)
The maximum number of pods that can be scheduled above the desired number of pods. Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). This can not be 0 if MaxUnavailable is 0. Absolute number is calculated from percentage by rounding up. Defaults to 25%. Example: when this is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such that the total number of old and new pods do not exceed 130% of desired pods. Once old pods have been killed, new ReplicaSet can be scaled up further, ensuring that total number of pods running at any time during the update is at most 130% of desired pods.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
strategy.rollingUpdate.maxUnavailable (IntOrString)
The maximum number of pods that can be unavailable during the update. Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). Absolute number is calculated from percentage by rounding down. This can not be 0 if MaxSurge is 0. Defaults to 25%. Example: when this is set to 30%, the old ReplicaSet can be scaled down to 70% of desired pods immediately when the rolling update starts. Once new pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of pods available at all times during the update is at least 70% of desired pods.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
-
-
revisionHistoryLimit (int32)
The number of old ReplicaSets to retain to allow rollback. This is a pointer to distinguish between explicit zero and not specified. Defaults to 10.
-
progressDeadlineSeconds (int32)
The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.
-
paused (boolean)
Indicates that the deployment is paused.
DeploymentStatus
DeploymentStatus is the most recently observed status of the Deployment.
-
replicas (int32)
Total number of non-terminated pods targeted by this deployment (their labels match the selector).
-
availableReplicas (int32)
Total number of available pods (ready for at least minReadySeconds) targeted by this deployment.
-
readyReplicas (int32)
readyReplicas is the number of pods targeted by this Deployment with a Ready Condition.
-
unavailableReplicas (int32)
Total number of unavailable pods targeted by this deployment. This is the total number of pods that are still required for the deployment to have 100% available capacity. They may either be pods that are running but not yet available or pods that still have not been created.
-
updatedReplicas (int32)
Total number of non-terminated pods targeted by this deployment that have the desired template spec.
-
collisionCount (int32)
Count of hash collisions for the Deployment. The Deployment controller uses this field as a collision avoidance mechanism when it needs to create the name for the newest ReplicaSet.
-
conditions ([]DeploymentCondition)
Patch strategy: merge on key
type
Represents the latest available observations of a deployment's current state.
DeploymentCondition describes the state of a deployment at a certain point.
-
conditions.status (string), required
Status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
Type of deployment condition.
-
conditions.lastTransitionTime (Time)
Last time the condition transitioned from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.lastUpdateTime (Time)
The last time this condition was updated.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
A human readable message indicating details about the transition.
-
conditions.reason (string)
The reason for the condition's last transition.
-
-
observedGeneration (int64)
The generation observed by the deployment controller.
DeploymentList
DeploymentList is a list of Deployments.
-
apiVersion: apps/v1
-
kind: DeploymentList
-
metadata (ListMeta)
Standard list metadata.
-
items ([]Deployment), required
Items is the list of Deployments.
Operations
get
read the specified Deployment
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/deployments/{name}
Parameters
-
name (in path): string, required
name of the Deployment
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Deployment): OK
401: Unauthorized
get
read status of the specified Deployment
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/deployments/{name}/status
Parameters
-
name (in path): string, required
name of the Deployment
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Deployment): OK
401: Unauthorized
list
list or watch objects of kind Deployment
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/deployments
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (DeploymentList): OK
401: Unauthorized
list
list or watch objects of kind Deployment
HTTP Request
GET /apis/apps/v1/deployments
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (DeploymentList): OK
401: Unauthorized
create
create a Deployment
HTTP Request
POST /apis/apps/v1/namespaces/{namespace}/deployments
Parameters
-
namespace (in path): string, required
-
body: Deployment, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Deployment): OK
201 (Deployment): Created
202 (Deployment): Accepted
401: Unauthorized
update
replace the specified Deployment
HTTP Request
PUT /apis/apps/v1/namespaces/{namespace}/deployments/{name}
Parameters
-
name (in path): string, required
name of the Deployment
-
namespace (in path): string, required
-
body: Deployment, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Deployment): OK
201 (Deployment): Created
401: Unauthorized
update
replace status of the specified Deployment
HTTP Request
PUT /apis/apps/v1/namespaces/{namespace}/deployments/{name}/status
Parameters
-
name (in path): string, required
name of the Deployment
-
namespace (in path): string, required
-
body: Deployment, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Deployment): OK
201 (Deployment): Created
401: Unauthorized
patch
partially update the specified Deployment
HTTP Request
PATCH /apis/apps/v1/namespaces/{namespace}/deployments/{name}
Parameters
-
name (in path): string, required
name of the Deployment
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Deployment): OK
201 (Deployment): Created
401: Unauthorized
patch
partially update status of the specified Deployment
HTTP Request
PATCH /apis/apps/v1/namespaces/{namespace}/deployments/{name}/status
Parameters
-
name (in path): string, required
name of the Deployment
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Deployment): OK
201 (Deployment): Created
401: Unauthorized
delete
delete a Deployment
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/deployments/{name}
Parameters
-
name (in path): string, required
name of the Deployment
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of Deployment
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/deployments
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.6 - StatefulSet
apiVersion: apps/v1
import "k8s.io/api/apps/v1"
StatefulSet
StatefulSet represents a set of pods with consistent identities. Identities are defined as:
- Network: A single stable DNS and hostname.
- Storage: As many VolumeClaims as requested. The StatefulSet guarantees that a given network identity will always map to the same storage identity.
-
apiVersion: apps/v1
-
kind: StatefulSet
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (StatefulSetSpec)
Spec defines the desired identities of pods in this set.
-
status (StatefulSetStatus)
Status is the current status of Pods in this StatefulSet. This data may be out of date by some window of time.
StatefulSetSpec
A StatefulSetSpec is the specification of a StatefulSet.
-
serviceName (string), required
serviceName is the name of the service that governs this StatefulSet. This service must exist before the StatefulSet, and is responsible for the network identity of the set. Pods get DNS/hostnames that follow the pattern: pod-specific-string.serviceName.default.svc.cluster.local where "pod-specific-string" is managed by the StatefulSet controller.
-
selector (LabelSelector), required
selector is a label query over pods that should match the replica count. It must match the pod template's labels. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
-
template (PodTemplateSpec), required
template is the object that describes the pod that will be created if insufficient replicas are detected. Each pod stamped out by the StatefulSet will fulfill this Template, but have a unique identity from the rest of the StatefulSet.
-
replicas (int32)
replicas is the desired number of replicas of the given Template. These are replicas in the sense that they are instantiations of the same Template, but individual replicas also have a consistent identity. If unspecified, defaults to 1.
-
updateStrategy (StatefulSetUpdateStrategy)
updateStrategy indicates the StatefulSetUpdateStrategy that will be employed to update Pods in the StatefulSet when a revision is made to Template.
StatefulSetUpdateStrategy indicates the strategy that the StatefulSet controller will use to perform updates. It includes any additional parameters necessary to perform the update for the indicated strategy.
-
updateStrategy.type (string)
Type indicates the type of the StatefulSetUpdateStrategy. Default is RollingUpdate.
Possible enum values:
"OnDelete"
triggers the legacy behavior. Version tracking and ordered rolling restarts are disabled. Pods are recreated from the StatefulSetSpec when they are manually deleted. When a scale operation is performed with this strategy,specification version indicated by the StatefulSet's currentRevision."RollingUpdate"
indicates that update will be applied to all Pods in the StatefulSet with respect to the StatefulSet ordering constraints. When a scale operation is performed with this strategy, new Pods will be created from the specification version indicated by the StatefulSet's updateRevision.
-
updateStrategy.rollingUpdate (RollingUpdateStatefulSetStrategy)
RollingUpdate is used to communicate parameters when Type is RollingUpdateStatefulSetStrategyType.
RollingUpdateStatefulSetStrategy is used to communicate parameter for RollingUpdateStatefulSetStrategyType.
-
updateStrategy.rollingUpdate.partition (int32)
Partition indicates the ordinal at which the StatefulSet should be partitioned. Default value is 0.
-
-
-
podManagementPolicy (string)
podManagementPolicy controls how pods are created during initial scale up, when replacing pods on nodes, or when scaling down. The default policy is
OrderedReady
, where pods are created in increasing order (pod-0, then pod-1, etc) and the controller will wait until each pod is ready before continuing. When scaling down, the pods are removed in the opposite order. The alternative policy isParallel
which will create pods in parallel to match the desired scale without waiting, and on scale down will delete all pods at once.Possible enum values:
"OrderedReady"
will create pods in strictly increasing order on scale up and strictly decreasing order on scale down, progressing only when the previous pod is ready or terminated. At most one pod will be changed at any time."Parallel"
will create and delete pods as soon as the stateful set replica count is changed, and will not wait for pods to be ready or complete termination.
-
revisionHistoryLimit (int32)
revisionHistoryLimit is the maximum number of revisions that will be maintained in the StatefulSet's revision history. The revision history consists of all revisions not represented by a currently applied StatefulSetSpec version. The default value is 10.
-
volumeClaimTemplates ([]PersistentVolumeClaim)
volumeClaimTemplates is a list of claims that pods are allowed to reference. The StatefulSet controller is responsible for mapping network identities to claims in a way that maintains the identity of a pod. Every claim in this list must have at least one matching (by name) volumeMount in one container in the template. A claim in this list takes precedence over any volumes in the template, with the same name.
-
minReadySeconds (int32)
Minimum number of seconds for which a newly created pod should be ready without any of its container crashing for it to be considered available. Defaults to 0 (pod will be considered available as soon as it is ready) This is an alpha field and requires enabling StatefulSetMinReadySeconds feature gate.
-
persistentVolumeClaimRetentionPolicy (StatefulSetPersistentVolumeClaimRetentionPolicy)
persistentVolumeClaimRetentionPolicy describes the lifecycle of persistent volume claims created from volumeClaimTemplates. By default, all persistent volume claims are created as needed and retained until manually deleted. This policy allows the lifecycle to be altered, for example by deleting persistent volume claims when their stateful set is deleted, or when their pod is scaled down. This requires the StatefulSetAutoDeletePVC feature gate to be enabled, which is alpha. +optional
StatefulSetPersistentVolumeClaimRetentionPolicy describes the policy used for PVCs created from the StatefulSet VolumeClaimTemplates.
-
persistentVolumeClaimRetentionPolicy.whenDeleted (string)
WhenDeleted specifies what happens to PVCs created from StatefulSet VolumeClaimTemplates when the StatefulSet is deleted. The default policy of
Retain
causes PVCs to not be affected by StatefulSet deletion. TheDelete
policy causes those PVCs to be deleted. -
persistentVolumeClaimRetentionPolicy.whenScaled (string)
WhenScaled specifies what happens to PVCs created from StatefulSet VolumeClaimTemplates when the StatefulSet is scaled down. The default policy of
Retain
causes PVCs to not be affected by a scaledown. TheDelete
policy causes the associated PVCs for any excess pods above the replica count to be deleted.
-
StatefulSetStatus
StatefulSetStatus represents the current state of a StatefulSet.
-
replicas (int32), required
replicas is the number of Pods created by the StatefulSet controller.
-
readyReplicas (int32)
readyReplicas is the number of pods created for this StatefulSet with a Ready Condition.
-
currentReplicas (int32)
currentReplicas is the number of Pods created by the StatefulSet controller from the StatefulSet version indicated by currentRevision.
-
updatedReplicas (int32)
updatedReplicas is the number of Pods created by the StatefulSet controller from the StatefulSet version indicated by updateRevision.
-
availableReplicas (int32), required
Total number of available pods (ready for at least minReadySeconds) targeted by this statefulset. This is a beta field and enabled/disabled by StatefulSetMinReadySeconds feature gate.
-
collisionCount (int32)
collisionCount is the count of hash collisions for the StatefulSet. The StatefulSet controller uses this field as a collision avoidance mechanism when it needs to create the name for the newest ControllerRevision.
-
conditions ([]StatefulSetCondition)
Patch strategy: merge on key
type
Represents the latest available observations of a statefulset's current state.
StatefulSetCondition describes the state of a statefulset at a certain point.
-
conditions.status (string), required
Status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
Type of statefulset condition.
-
conditions.lastTransitionTime (Time)
Last time the condition transitioned from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
A human readable message indicating details about the transition.
-
conditions.reason (string)
The reason for the condition's last transition.
-
-
currentRevision (string)
currentRevision, if not empty, indicates the version of the StatefulSet used to generate Pods in the sequence [0,currentReplicas).
-
updateRevision (string)
updateRevision, if not empty, indicates the version of the StatefulSet used to generate Pods in the sequence [replicas-updatedReplicas,replicas)
-
observedGeneration (int64)
observedGeneration is the most recent generation observed for this StatefulSet. It corresponds to the StatefulSet's generation, which is updated on mutation by the API Server.
StatefulSetList
StatefulSetList is a collection of StatefulSets.
-
apiVersion: apps/v1
-
kind: StatefulSetList
-
metadata (ListMeta)
Standard list's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]StatefulSet), required
Items is the list of stateful sets.
Operations
get
read the specified StatefulSet
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/statefulsets/{name}
Parameters
-
name (in path): string, required
name of the StatefulSet
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (StatefulSet): OK
401: Unauthorized
get
read status of the specified StatefulSet
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/statefulsets/{name}/status
Parameters
-
name (in path): string, required
name of the StatefulSet
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (StatefulSet): OK
401: Unauthorized
list
list or watch objects of kind StatefulSet
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/statefulsets
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (StatefulSetList): OK
401: Unauthorized
list
list or watch objects of kind StatefulSet
HTTP Request
GET /apis/apps/v1/statefulsets
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (StatefulSetList): OK
401: Unauthorized
create
create a StatefulSet
HTTP Request
POST /apis/apps/v1/namespaces/{namespace}/statefulsets
Parameters
-
namespace (in path): string, required
-
body: StatefulSet, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (StatefulSet): OK
201 (StatefulSet): Created
202 (StatefulSet): Accepted
401: Unauthorized
update
replace the specified StatefulSet
HTTP Request
PUT /apis/apps/v1/namespaces/{namespace}/statefulsets/{name}
Parameters
-
name (in path): string, required
name of the StatefulSet
-
namespace (in path): string, required
-
body: StatefulSet, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (StatefulSet): OK
201 (StatefulSet): Created
401: Unauthorized
update
replace status of the specified StatefulSet
HTTP Request
PUT /apis/apps/v1/namespaces/{namespace}/statefulsets/{name}/status
Parameters
-
name (in path): string, required
name of the StatefulSet
-
namespace (in path): string, required
-
body: StatefulSet, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (StatefulSet): OK
201 (StatefulSet): Created
401: Unauthorized
patch
partially update the specified StatefulSet
HTTP Request
PATCH /apis/apps/v1/namespaces/{namespace}/statefulsets/{name}
Parameters
-
name (in path): string, required
name of the StatefulSet
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (StatefulSet): OK
201 (StatefulSet): Created
401: Unauthorized
patch
partially update status of the specified StatefulSet
HTTP Request
PATCH /apis/apps/v1/namespaces/{namespace}/statefulsets/{name}/status
Parameters
-
name (in path): string, required
name of the StatefulSet
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (StatefulSet): OK
201 (StatefulSet): Created
401: Unauthorized
delete
delete a StatefulSet
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/statefulsets/{name}
Parameters
-
name (in path): string, required
name of the StatefulSet
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of StatefulSet
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/statefulsets
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.7 - ControllerRevision
apiVersion: apps/v1
import "k8s.io/api/apps/v1"
ControllerRevision
ControllerRevision implements an immutable snapshot of state data. Clients are responsible for serializing and deserializing the objects that contain their internal state. Once a ControllerRevision has been successfully created, it can not be updated. The API Server will fail validation of all requests that attempt to mutate the Data field. ControllerRevisions may, however, be deleted. Note that, due to its use by both the DaemonSet and StatefulSet controllers for update and rollback, this object is beta. However, it may be subject to name and representation changes in future releases, and clients should not depend on its stability. It is primarily for internal use by controllers.
-
apiVersion: apps/v1
-
kind: ControllerRevision
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
revision (int64), required
Revision indicates the revision of the state represented by Data.
-
data (RawExtension)
Data is the serialized representation of the state.
*RawExtension is used to hold extensions in external versions.
To use this, make a field which has RawExtension as its type in your external, versioned struct, and Object in your internal struct. You also need to register your various plugin types.
// Internal package: type MyAPIObject struct { runtime.TypeMeta
json:",inline"
MyPlugin runtime.Objectjson:"myPlugin"
} type PluginA struct { AOption stringjson:"aOption"
}// External package: type MyAPIObject struct { runtime.TypeMeta
json:",inline"
MyPlugin runtime.RawExtensionjson:"myPlugin"
} type PluginA struct { AOption stringjson:"aOption"
}// On the wire, the JSON will look something like this: { "kind":"MyAPIObject", "apiVersion":"v1", "myPlugin": { "kind":"PluginA", "aOption":"foo", }, }
So what happens? Decode first uses json or yaml to unmarshal the serialized data into your external MyAPIObject. That causes the raw JSON to be stored, but not unpacked. The next step is to copy (using pkg/conversion) into the internal struct. The runtime package's DefaultScheme has conversion functions installed which will unpack the JSON stored in RawExtension, turning it into the correct object type, and storing it in the Object. (TODO: In the case where the object is of an unknown type, a runtime.Unknown object will be created and stored.)*
ControllerRevisionList
ControllerRevisionList is a resource containing a list of ControllerRevision objects.
-
apiVersion: apps/v1
-
kind: ControllerRevisionList
-
metadata (ListMeta)
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]ControllerRevision), required
Items is the list of ControllerRevisions
Operations
get
read the specified ControllerRevision
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/controllerrevisions/{name}
Parameters
-
name (in path): string, required
name of the ControllerRevision
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (ControllerRevision): OK
401: Unauthorized
list
list or watch objects of kind ControllerRevision
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/controllerrevisions
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ControllerRevisionList): OK
401: Unauthorized
list
list or watch objects of kind ControllerRevision
HTTP Request
GET /apis/apps/v1/controllerrevisions
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ControllerRevisionList): OK
401: Unauthorized
create
create a ControllerRevision
HTTP Request
POST /apis/apps/v1/namespaces/{namespace}/controllerrevisions
Parameters
-
namespace (in path): string, required
-
body: ControllerRevision, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ControllerRevision): OK
201 (ControllerRevision): Created
202 (ControllerRevision): Accepted
401: Unauthorized
update
replace the specified ControllerRevision
HTTP Request
PUT /apis/apps/v1/namespaces/{namespace}/controllerrevisions/{name}
Parameters
-
name (in path): string, required
name of the ControllerRevision
-
namespace (in path): string, required
-
body: ControllerRevision, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ControllerRevision): OK
201 (ControllerRevision): Created
401: Unauthorized
patch
partially update the specified ControllerRevision
HTTP Request
PATCH /apis/apps/v1/namespaces/{namespace}/controllerrevisions/{name}
Parameters
-
name (in path): string, required
name of the ControllerRevision
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ControllerRevision): OK
201 (ControllerRevision): Created
401: Unauthorized
delete
delete a ControllerRevision
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/controllerrevisions/{name}
Parameters
-
name (in path): string, required
name of the ControllerRevision
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of ControllerRevision
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/controllerrevisions
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.8 - DaemonSet
apiVersion: apps/v1
import "k8s.io/api/apps/v1"
DaemonSet
DaemonSet represents the configuration of a daemon set.
-
apiVersion: apps/v1
-
kind: DaemonSet
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (DaemonSetSpec)
The desired behavior of this daemon set. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (DaemonSetStatus)
The current status of this daemon set. This data may be out of date by some window of time. Populated by the system. Read-only. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
DaemonSetSpec
DaemonSetSpec is the specification of a daemon set.
-
selector (LabelSelector), required
A label query over pods that are managed by the daemon set. Must match in order to be controlled. It must match the pod template's labels. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
-
template (PodTemplateSpec), required
An object that describes the pod that will be created. The DaemonSet will create exactly one copy of this pod on every node that matches the template's node selector (or on every node if no node selector is specified). More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller#pod-template
-
minReadySeconds (int32)
The minimum number of seconds for which a newly created DaemonSet pod should be ready without any of its container crashing, for it to be considered available. Defaults to 0 (pod will be considered available as soon as it is ready).
-
updateStrategy (DaemonSetUpdateStrategy)
An update strategy to replace existing DaemonSet pods with new pods.
DaemonSetUpdateStrategy is a struct used to control the update strategy for a DaemonSet.
-
updateStrategy.type (string)
Type of daemon set update. Can be "RollingUpdate" or "OnDelete". Default is RollingUpdate.
Possible enum values:
"OnDelete"
Replace the old daemons only when it's killed"RollingUpdate"
Replace the old daemons by new ones using rolling update i.e replace them on each node one after the other.
-
updateStrategy.rollingUpdate (RollingUpdateDaemonSet)
Rolling update config params. Present only if type = "RollingUpdate".
Spec to control the desired behavior of daemon set rolling update.
-
updateStrategy.rollingUpdate.maxSurge (IntOrString)
The maximum number of nodes with an existing available DaemonSet pod that can have an updated DaemonSet pod during during an update. Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). This can not be 0 if MaxUnavailable is 0. Absolute number is calculated from percentage by rounding up to a minimum of 1. Default value is 0. Example: when this is set to 30%, at most 30% of the total number of nodes that should be running the daemon pod (i.e. status.desiredNumberScheduled) can have their a new pod created before the old pod is marked as deleted. The update starts by launching new pods on 30% of nodes. Once an updated pod is available (Ready for at least minReadySeconds) the old DaemonSet pod on that node is marked deleted. If the old pod becomes unavailable for any reason (Ready transitions to false, is evicted, or is drained) an updated pod is immediatedly created on that node without considering surge limits. Allowing surge implies the possibility that the resources consumed by the daemonset on any given node can double if the readiness check fails, and so resource intensive daemonsets should take into account that they may cause evictions during disruption. This is beta field and enabled/disabled by DaemonSetUpdateSurge feature gate.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
updateStrategy.rollingUpdate.maxUnavailable (IntOrString)
The maximum number of DaemonSet pods that can be unavailable during the update. Value can be an absolute number (ex: 5) or a percentage of total number of DaemonSet pods at the start of the update (ex: 10%). Absolute number is calculated from percentage by rounding up. This cannot be 0 if MaxSurge is 0 Default value is 1. Example: when this is set to 30%, at most 30% of the total number of nodes that should be running the daemon pod (i.e. status.desiredNumberScheduled) can have their pods stopped for an update at any given time. The update starts by stopping at most 30% of those DaemonSet pods and then brings up new DaemonSet pods in their place. Once the new pods are available, it then proceeds onto other DaemonSet pods, thus ensuring that at least 70% of original number of DaemonSet pods are available at all times during the update.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
-
-
revisionHistoryLimit (int32)
The number of old history to retain to allow rollback. This is a pointer to distinguish between explicit zero and not specified. Defaults to 10.
DaemonSetStatus
DaemonSetStatus represents the current status of a daemon set.
-
numberReady (int32), required
numberReady is the number of nodes that should be running the daemon pod and have one or more of the daemon pod running with a Ready Condition.
-
numberAvailable (int32)
The number of nodes that should be running the daemon pod and have one or more of the daemon pod running and available (ready for at least spec.minReadySeconds)
-
numberUnavailable (int32)
The number of nodes that should be running the daemon pod and have none of the daemon pod running and available (ready for at least spec.minReadySeconds)
-
numberMisscheduled (int32), required
The number of nodes that are running the daemon pod, but are not supposed to run the daemon pod. More info: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
-
desiredNumberScheduled (int32), required
The total number of nodes that should be running the daemon pod (including nodes correctly running the daemon pod). More info: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
-
currentNumberScheduled (int32), required
The number of nodes that are running at least 1 daemon pod and are supposed to run the daemon pod. More info: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
-
updatedNumberScheduled (int32)
The total number of nodes that are running updated daemon pod
-
collisionCount (int32)
Count of hash collisions for the DaemonSet. The DaemonSet controller uses this field as a collision avoidance mechanism when it needs to create the name for the newest ControllerRevision.
-
conditions ([]DaemonSetCondition)
Patch strategy: merge on key
type
Represents the latest available observations of a DaemonSet's current state.
DaemonSetCondition describes the state of a DaemonSet at a certain point.
-
conditions.status (string), required
Status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
Type of DaemonSet condition.
-
conditions.lastTransitionTime (Time)
Last time the condition transitioned from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
A human readable message indicating details about the transition.
-
conditions.reason (string)
The reason for the condition's last transition.
-
-
observedGeneration (int64)
The most recent generation observed by the daemon set controller.
DaemonSetList
DaemonSetList is a collection of daemon sets.
-
apiVersion: apps/v1
-
kind: DaemonSetList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]DaemonSet), required
A list of daemon sets.
Operations
get
read the specified DaemonSet
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/daemonsets/{name}
Parameters
-
name (in path): string, required
name of the DaemonSet
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (DaemonSet): OK
401: Unauthorized
get
read status of the specified DaemonSet
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/daemonsets/{name}/status
Parameters
-
name (in path): string, required
name of the DaemonSet
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (DaemonSet): OK
401: Unauthorized
list
list or watch objects of kind DaemonSet
HTTP Request
GET /apis/apps/v1/namespaces/{namespace}/daemonsets
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (DaemonSetList): OK
401: Unauthorized
list
list or watch objects of kind DaemonSet
HTTP Request
GET /apis/apps/v1/daemonsets
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (DaemonSetList): OK
401: Unauthorized
create
create a DaemonSet
HTTP Request
POST /apis/apps/v1/namespaces/{namespace}/daemonsets
Parameters
-
namespace (in path): string, required
-
body: DaemonSet, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (DaemonSet): OK
201 (DaemonSet): Created
202 (DaemonSet): Accepted
401: Unauthorized
update
replace the specified DaemonSet
HTTP Request
PUT /apis/apps/v1/namespaces/{namespace}/daemonsets/{name}
Parameters
-
name (in path): string, required
name of the DaemonSet
-
namespace (in path): string, required
-
body: DaemonSet, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (DaemonSet): OK
201 (DaemonSet): Created
401: Unauthorized
update
replace status of the specified DaemonSet
HTTP Request
PUT /apis/apps/v1/namespaces/{namespace}/daemonsets/{name}/status
Parameters
-
name (in path): string, required
name of the DaemonSet
-
namespace (in path): string, required
-
body: DaemonSet, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (DaemonSet): OK
201 (DaemonSet): Created
401: Unauthorized
patch
partially update the specified DaemonSet
HTTP Request
PATCH /apis/apps/v1/namespaces/{namespace}/daemonsets/{name}
Parameters
-
name (in path): string, required
name of the DaemonSet
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (DaemonSet): OK
201 (DaemonSet): Created
401: Unauthorized
patch
partially update status of the specified DaemonSet
HTTP Request
PATCH /apis/apps/v1/namespaces/{namespace}/daemonsets/{name}/status
Parameters
-
name (in path): string, required
name of the DaemonSet
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (DaemonSet): OK
201 (DaemonSet): Created
401: Unauthorized
delete
delete a DaemonSet
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/daemonsets/{name}
Parameters
-
name (in path): string, required
name of the DaemonSet
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of DaemonSet
HTTP Request
DELETE /apis/apps/v1/namespaces/{namespace}/daemonsets
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.9 - Job
apiVersion: batch/v1
import "k8s.io/api/batch/v1"
Job
Job represents the configuration of a single job.
-
apiVersion: batch/v1
-
kind: Job
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (JobSpec)
Specification of the desired behavior of a job. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (JobStatus)
Current status of a job. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
JobSpec
JobSpec describes how the job execution will look like.
Replicas
-
template (PodTemplateSpec), required
Describes the pod that will be created when executing a job. More info: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
-
parallelism (int32)
Specifies the maximum desired number of pods the job should run at any given time. The actual number of pods running in steady state will be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism), i.e. when the work left to do is less than max parallelism. More info: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
Lifecycle
-
completions (int32)
Specifies the desired number of successfully finished pods the job should be run with. Setting to nil means that the success of any pod signals the success of all pods, and allows parallelism to have any positive value. Setting to 1 means that parallelism is limited to 1 and the success of that pod signals the success of the job. More info: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
-
completionMode (string)
CompletionMode specifies how Pod completions are tracked. It can be
NonIndexed
(default) orIndexed
.NonIndexed
means that the Job is considered complete when there have been .spec.completions successfully completed Pods. Each Pod completion is homologous to each other.Indexed
means that the Pods of a Job get an associated completion index from 0 to (.spec.completions - 1), available in the annotation batch.kubernetes.io/job-completion-index. The Job is considered complete when there is one successfully completed Pod for each index. When value isIndexed
, .spec.completions must be specified and.spec.parallelism
must be less than or equal to 10^5. In addition, The Pod name takes the form$(job-name)-$(index)-$(random-string)
, the Pod hostname takes the form$(job-name)-$(index)
.This field is beta-level. More completion modes can be added in the future. If the Job controller observes a mode that it doesn't recognize, the controller skips updates for the Job.
-
backoffLimit (int32)
Specifies the number of retries before marking this job failed. Defaults to 6
-
activeDeadlineSeconds (int64)
Specifies the duration in seconds relative to the startTime that the job may be continuously active before the system tries to terminate it; value must be positive integer. If a Job is suspended (at creation or through an update), this timer will effectively be stopped and reset when the Job is resumed again.
-
ttlSecondsAfterFinished (int32)
ttlSecondsAfterFinished limits the lifetime of a Job that has finished execution (either Complete or Failed). If this field is set, ttlSecondsAfterFinished after the Job finishes, it is eligible to be automatically deleted. When the Job is being deleted, its lifecycle guarantees (e.g. finalizers) will be honored. If this field is unset, the Job won't be automatically deleted. If this field is set to zero, the Job becomes eligible to be deleted immediately after it finishes.
-
suspend (boolean)
Suspend specifies whether the Job controller should create Pods or not. If a Job is created with suspend set to true, no Pods are created by the Job controller. If a Job is suspended after creation (i.e. the flag goes from false to true), the Job controller will delete all active Pods associated with this Job. Users must design their workload to gracefully handle this. Suspending a Job will reset the StartTime field of the Job, effectively resetting the ActiveDeadlineSeconds timer too. Defaults to false.
This field is beta-level, gated by SuspendJob feature flag (enabled by default).
Selector
-
selector (LabelSelector)
A label query over pods that should match the pod count. Normally, the system sets this field for you. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
-
manualSelector (boolean)
manualSelector controls generation of pod labels and pod selectors. Leave
manualSelector
unset unless you are certain what you are doing. When false or unset, the system pick labels unique to this job and appends those labels to the pod template. When true, the user is responsible for picking unique labels and specifying the selector. Failure to pick a unique label may cause this and other jobs to not function correctly. However, You may seemanualSelector=true
in jobs that were created with the oldextensions/v1beta1
API. More info: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#specifying-your-own-pod-selector
JobStatus
JobStatus represents the current state of a Job.
-
startTime (Time)
Represents time when the job controller started processing a job. When a Job is created in the suspended state, this field is not set until the first time it is resumed. This field is reset every time a Job is resumed from suspension. It is represented in RFC3339 form and is in UTC.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
completionTime (Time)
Represents time when the job was completed. It is not guaranteed to be set in happens-before order across separate operations. It is represented in RFC3339 form and is in UTC. The completion time is only set when the job finishes successfully.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
active (int32)
The number of pending and running pods.
-
failed (int32)
The number of pods which reached phase Failed.
-
succeeded (int32)
The number of pods which reached phase Succeeded.
-
completedIndexes (string)
CompletedIndexes holds the completed indexes when .spec.completionMode = "Indexed" in a text format. The indexes are represented as decimal integers separated by commas. The numbers are listed in increasing order. Three or more consecutive numbers are compressed and represented by the first and last element of the series, separated by a hyphen. For example, if the completed indexes are 1, 3, 4, 5 and 7, they are represented as "1,3-5,7".
-
conditions ([]JobCondition)
Patch strategy: merge on key
type
Atomic: will be replaced during a merge
The latest available observations of an object's current state. When a Job fails, one of the conditions will have type "Failed" and status true. When a Job is suspended, one of the conditions will have type "Suspended" and status true; when the Job is resumed, the status of this condition will become false. When a Job is completed, one of the conditions will have type "Complete" and status true. More info: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
JobCondition describes current state of a job.
-
conditions.status (string), required
Status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
Type of job condition, Complete or Failed.
Possible enum values:
"Complete"
means the job has completed its execution."Failed"
means the job has failed its execution."Suspended"
means the job has been suspended.
-
conditions.lastProbeTime (Time)
Last time the condition was checked.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.lastTransitionTime (Time)
Last time the condition transit from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
Human readable message indicating details about last transition.
-
conditions.reason (string)
(brief) reason for the condition's last transition.
-
-
uncountedTerminatedPods (UncountedTerminatedPods)
UncountedTerminatedPods holds the UIDs of Pods that have terminated but the job controller hasn't yet accounted for in the status counters.
The job controller creates pods with a finalizer. When a pod terminates (succeeded or failed), the controller does three steps to account for it in the job status: (1) Add the pod UID to the arrays in this field. (2) Remove the pod finalizer. (3) Remove the pod UID from the arrays while increasing the corresponding counter.
This field is beta-level. The job controller only makes use of this field when the feature gate JobTrackingWithFinalizers is enabled (enabled by default). Old jobs might not be tracked using this field, in which case the field remains null.
UncountedTerminatedPods holds UIDs of Pods that have terminated but haven't been accounted in Job status counters.
-
uncountedTerminatedPods.failed ([]string)
Set: unique values will be kept during a merge
Failed holds UIDs of failed Pods.
-
uncountedTerminatedPods.succeeded ([]string)
Set: unique values will be kept during a merge
Succeeded holds UIDs of succeeded Pods.
-
Alpha level
-
ready (int32)
The number of pods which have a Ready condition.
This field is alpha-level. The job controller populates the field when the feature gate JobReadyPods is enabled (disabled by default).
JobList
JobList is a collection of jobs.
-
apiVersion: batch/v1
-
kind: JobList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]Job), required
items is the list of Jobs.
Operations
get
read the specified Job
HTTP Request
GET /apis/batch/v1/namespaces/{namespace}/jobs/{name}
Parameters
-
name (in path): string, required
name of the Job
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Job): OK
401: Unauthorized
get
read status of the specified Job
HTTP Request
GET /apis/batch/v1/namespaces/{namespace}/jobs/{name}/status
Parameters
-
name (in path): string, required
name of the Job
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Job): OK
401: Unauthorized
list
list or watch objects of kind Job
HTTP Request
GET /apis/batch/v1/namespaces/{namespace}/jobs
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (JobList): OK
401: Unauthorized
list
list or watch objects of kind Job
HTTP Request
GET /apis/batch/v1/jobs
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (JobList): OK
401: Unauthorized
create
create a Job
HTTP Request
POST /apis/batch/v1/namespaces/{namespace}/jobs
Parameters
-
namespace (in path): string, required
-
body: Job, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Job): OK
201 (Job): Created
202 (Job): Accepted
401: Unauthorized
update
replace the specified Job
HTTP Request
PUT /apis/batch/v1/namespaces/{namespace}/jobs/{name}
Parameters
-
name (in path): string, required
name of the Job
-
namespace (in path): string, required
-
body: Job, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Job): OK
201 (Job): Created
401: Unauthorized
update
replace status of the specified Job
HTTP Request
PUT /apis/batch/v1/namespaces/{namespace}/jobs/{name}/status
Parameters
-
name (in path): string, required
name of the Job
-
namespace (in path): string, required
-
body: Job, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Job): OK
201 (Job): Created
401: Unauthorized
patch
partially update the specified Job
HTTP Request
PATCH /apis/batch/v1/namespaces/{namespace}/jobs/{name}
Parameters
-
name (in path): string, required
name of the Job
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Job): OK
201 (Job): Created
401: Unauthorized
patch
partially update status of the specified Job
HTTP Request
PATCH /apis/batch/v1/namespaces/{namespace}/jobs/{name}/status
Parameters
-
name (in path): string, required
name of the Job
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Job): OK
201 (Job): Created
401: Unauthorized
delete
delete a Job
HTTP Request
DELETE /apis/batch/v1/namespaces/{namespace}/jobs/{name}
Parameters
-
name (in path): string, required
name of the Job
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of Job
HTTP Request
DELETE /apis/batch/v1/namespaces/{namespace}/jobs
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.10 - CronJob
apiVersion: batch/v1
import "k8s.io/api/batch/v1"
CronJob
CronJob represents the configuration of a single cron job.
-
apiVersion: batch/v1
-
kind: CronJob
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (CronJobSpec)
Specification of the desired behavior of a cron job, including the schedule. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (CronJobStatus)
Current status of a cron job. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
CronJobSpec
CronJobSpec describes how the job execution will look like and when it will actually run.
-
jobTemplate (JobTemplateSpec), required
Specifies the job that will be created when executing a CronJob.
JobTemplateSpec describes the data a Job should have when created from a template
-
jobTemplate.metadata (ObjectMeta)
Standard object's metadata of the jobs created from this template. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
jobTemplate.spec (JobSpec)
Specification of the desired behavior of the job. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
-
schedule (string), required
The schedule in Cron format, see https://en.wikipedia.org/wiki/Cron.
-
concurrencyPolicy (string)
Specifies how to treat concurrent executions of a Job. Valid values are: - "Allow" (default): allows CronJobs to run concurrently; - "Forbid": forbids concurrent runs, skipping next run if previous run hasn't finished yet; - "Replace": cancels currently running job and replaces it with a new one
Possible enum values:
"Allow"
allows CronJobs to run concurrently."Forbid"
forbids concurrent runs, skipping next run if previous hasn't finished yet."Replace"
cancels currently running job and replaces it with a new one.
-
startingDeadlineSeconds (int64)
Optional deadline in seconds for starting the job if it misses scheduled time for any reason. Missed jobs executions will be counted as failed ones.
-
suspend (boolean)
This flag tells the controller to suspend subsequent executions, it does not apply to already started executions. Defaults to false.
-
successfulJobsHistoryLimit (int32)
The number of successful finished jobs to retain. Value must be non-negative integer. Defaults to 3.
-
failedJobsHistoryLimit (int32)
The number of failed finished jobs to retain. Value must be non-negative integer. Defaults to 1.
CronJobStatus
CronJobStatus represents the current state of a cron job.
-
active ([]ObjectReference)
Atomic: will be replaced during a merge
A list of pointers to currently running jobs.
-
lastScheduleTime (Time)
Information when was the last time the job was successfully scheduled.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
lastSuccessfulTime (Time)
Information when was the last time the job successfully completed.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
CronJobList
CronJobList is a collection of cron jobs.
-
apiVersion: batch/v1
-
kind: CronJobList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]CronJob), required
items is the list of CronJobs.
Operations
get
read the specified CronJob
HTTP Request
GET /apis/batch/v1/namespaces/{namespace}/cronjobs/{name}
Parameters
-
name (in path): string, required
name of the CronJob
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (CronJob): OK
401: Unauthorized
get
read status of the specified CronJob
HTTP Request
GET /apis/batch/v1/namespaces/{namespace}/cronjobs/{name}/status
Parameters
-
name (in path): string, required
name of the CronJob
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (CronJob): OK
401: Unauthorized
list
list or watch objects of kind CronJob
HTTP Request
GET /apis/batch/v1/namespaces/{namespace}/cronjobs
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (CronJobList): OK
401: Unauthorized
list
list or watch objects of kind CronJob
HTTP Request
GET /apis/batch/v1/cronjobs
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (CronJobList): OK
401: Unauthorized
create
create a CronJob
HTTP Request
POST /apis/batch/v1/namespaces/{namespace}/cronjobs
Parameters
-
namespace (in path): string, required
-
body: CronJob, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CronJob): OK
201 (CronJob): Created
202 (CronJob): Accepted
401: Unauthorized
update
replace the specified CronJob
HTTP Request
PUT /apis/batch/v1/namespaces/{namespace}/cronjobs/{name}
Parameters
-
name (in path): string, required
name of the CronJob
-
namespace (in path): string, required
-
body: CronJob, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CronJob): OK
201 (CronJob): Created
401: Unauthorized
update
replace status of the specified CronJob
HTTP Request
PUT /apis/batch/v1/namespaces/{namespace}/cronjobs/{name}/status
Parameters
-
name (in path): string, required
name of the CronJob
-
namespace (in path): string, required
-
body: CronJob, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CronJob): OK
201 (CronJob): Created
401: Unauthorized
patch
partially update the specified CronJob
HTTP Request
PATCH /apis/batch/v1/namespaces/{namespace}/cronjobs/{name}
Parameters
-
name (in path): string, required
name of the CronJob
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CronJob): OK
201 (CronJob): Created
401: Unauthorized
patch
partially update status of the specified CronJob
HTTP Request
PATCH /apis/batch/v1/namespaces/{namespace}/cronjobs/{name}/status
Parameters
-
name (in path): string, required
name of the CronJob
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CronJob): OK
201 (CronJob): Created
401: Unauthorized
delete
delete a CronJob
HTTP Request
DELETE /apis/batch/v1/namespaces/{namespace}/cronjobs/{name}
Parameters
-
name (in path): string, required
name of the CronJob
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of CronJob
HTTP Request
DELETE /apis/batch/v1/namespaces/{namespace}/cronjobs
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.11 - HorizontalPodAutoscaler
apiVersion: autoscaling/v1
import "k8s.io/api/autoscaling/v1"
HorizontalPodAutoscaler
configuration of a horizontal pod autoscaler.
-
apiVersion: autoscaling/v1
-
kind: HorizontalPodAutoscaler
-
metadata (ObjectMeta)
Standard object metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (HorizontalPodAutoscalerSpec)
behaviour of autoscaler. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status.
-
status (HorizontalPodAutoscalerStatus)
current information about the autoscaler.
HorizontalPodAutoscalerSpec
specification of a horizontal pod autoscaler.
-
maxReplicas (int32), required
upper limit for the number of pods that can be set by the autoscaler; cannot be smaller than MinReplicas.
-
scaleTargetRef (CrossVersionObjectReference), required
reference to scaled resource; horizontal pod autoscaler will learn the current resource consumption and will set the desired number of pods by using its Scale subresource.
CrossVersionObjectReference contains enough information to let you identify the referred resource.
-
scaleTargetRef.kind (string), required
Kind of the referent; More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds"
-
scaleTargetRef.name (string), required
Name of the referent; More info: http://kubernetes.io/docs/user-guide/identifiers#names
-
scaleTargetRef.apiVersion (string)
API version of the referent
-
-
minReplicas (int32)
minReplicas is the lower limit for the number of replicas to which the autoscaler can scale down. It defaults to 1 pod. minReplicas is allowed to be 0 if the alpha feature gate HPAScaleToZero is enabled and at least one Object or External metric is configured. Scaling is active as long as at least one metric value is available.
-
targetCPUUtilizationPercentage (int32)
target average CPU utilization (represented as a percentage of requested CPU) over all the pods; if not specified the default autoscaling policy will be used.
HorizontalPodAutoscalerStatus
current status of a horizontal pod autoscaler
-
currentReplicas (int32), required
current number of replicas of pods managed by this autoscaler.
-
desiredReplicas (int32), required
desired number of replicas of pods managed by this autoscaler.
-
currentCPUUtilizationPercentage (int32)
current average CPU utilization over all pods, represented as a percentage of requested CPU, e.g. 70 means that an average pod is using now 70% of its requested CPU.
-
lastScaleTime (Time)
last time the HorizontalPodAutoscaler scaled the number of pods; used by the autoscaler to control how often the number of pods is changed.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
observedGeneration (int64)
most recent generation observed by this autoscaler.
HorizontalPodAutoscalerList
list of horizontal pod autoscaler objects.
-
apiVersion: autoscaling/v1
-
kind: HorizontalPodAutoscalerList
-
metadata (ListMeta)
Standard list metadata.
-
items ([]HorizontalPodAutoscaler), required
list of horizontal pod autoscaler objects.
Operations
get
read the specified HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
401: Unauthorized
get
read status of the specified HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers/{name}/status
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
401: Unauthorized
list
list or watch objects of kind HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (HorizontalPodAutoscalerList): OK
401: Unauthorized
list
list or watch objects of kind HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v1/horizontalpodautoscalers
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (HorizontalPodAutoscalerList): OK
401: Unauthorized
create
create a HorizontalPodAutoscaler
HTTP Request
POST /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers
Parameters
-
namespace (in path): string, required
-
body: HorizontalPodAutoscaler, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
202 (HorizontalPodAutoscaler): Accepted
401: Unauthorized
update
replace the specified HorizontalPodAutoscaler
HTTP Request
PUT /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: HorizontalPodAutoscaler, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
update
replace status of the specified HorizontalPodAutoscaler
HTTP Request
PUT /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers/{name}/status
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: HorizontalPodAutoscaler, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
patch
partially update the specified HorizontalPodAutoscaler
HTTP Request
PATCH /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
patch
partially update status of the specified HorizontalPodAutoscaler
HTTP Request
PATCH /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers/{name}/status
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
delete
delete a HorizontalPodAutoscaler
HTTP Request
DELETE /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of HorizontalPodAutoscaler
HTTP Request
DELETE /apis/autoscaling/v1/namespaces/{namespace}/horizontalpodautoscalers
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.12 - HorizontalPodAutoscaler
apiVersion: autoscaling/v2
import "k8s.io/api/autoscaling/v2"
HorizontalPodAutoscaler
HorizontalPodAutoscaler is the configuration for a horizontal pod autoscaler, which automatically manages the replica count of any resource implementing the scale subresource based on the metrics specified.
-
apiVersion: autoscaling/v2
-
kind: HorizontalPodAutoscaler
-
metadata (ObjectMeta)
metadata is the standard object metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (HorizontalPodAutoscalerSpec)
spec is the specification for the behaviour of the autoscaler. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status.
-
status (HorizontalPodAutoscalerStatus)
status is the current information about the autoscaler.
HorizontalPodAutoscalerSpec
HorizontalPodAutoscalerSpec describes the desired functionality of the HorizontalPodAutoscaler.
-
maxReplicas (int32), required
maxReplicas is the upper limit for the number of replicas to which the autoscaler can scale up. It cannot be less that minReplicas.
-
scaleTargetRef (CrossVersionObjectReference), required
scaleTargetRef points to the target resource to scale, and is used to the pods for which metrics should be collected, as well as to actually change the replica count.
CrossVersionObjectReference contains enough information to let you identify the referred resource.
-
scaleTargetRef.kind (string), required
Kind of the referent; More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds"
-
scaleTargetRef.name (string), required
Name of the referent; More info: http://kubernetes.io/docs/user-guide/identifiers#names
-
scaleTargetRef.apiVersion (string)
API version of the referent
-
-
minReplicas (int32)
minReplicas is the lower limit for the number of replicas to which the autoscaler can scale down. It defaults to 1 pod. minReplicas is allowed to be 0 if the alpha feature gate HPAScaleToZero is enabled and at least one Object or External metric is configured. Scaling is active as long as at least one metric value is available.
-
behavior (HorizontalPodAutoscalerBehavior)
behavior configures the scaling behavior of the target in both Up and Down directions (scaleUp and scaleDown fields respectively). If not set, the default HPAScalingRules for scale up and scale down are used.
HorizontalPodAutoscalerBehavior configures the scaling behavior of the target in both Up and Down directions (scaleUp and scaleDown fields respectively).
-
behavior.scaleDown (HPAScalingRules)
scaleDown is scaling policy for scaling Down. If not set, the default value is to allow to scale down to minReplicas pods, with a 300 second stabilization window (i.e., the highest recommendation for the last 300sec is used).
HPAScalingRules configures the scaling behavior for one direction. These Rules are applied after calculating DesiredReplicas from metrics for the HPA. They can limit the scaling velocity by specifying scaling policies. They can prevent flapping by specifying the stabilization window, so that the number of replicas is not set instantly, instead, the safest value from the stabilization window is chosen.
-
behavior.scaleDown.policies ([]HPAScalingPolicy)
Atomic: will be replaced during a merge
policies is a list of potential scaling polices which can be used during scaling. At least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid
HPAScalingPolicy is a single policy which must hold true for a specified past interval.
-
behavior.scaleDown.policies.type (string), required
Type is used to specify the scaling policy.
-
behavior.scaleDown.policies.value (int32), required
Value contains the amount of change which is permitted by the policy. It must be greater than zero
-
behavior.scaleDown.policies.periodSeconds (int32), required
PeriodSeconds specifies the window of time for which the policy should hold true. PeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).
-
-
behavior.scaleDown.selectPolicy (string)
selectPolicy is used to specify which policy should be used. If not set, the default value Max is used.
-
behavior.scaleDown.stabilizationWindowSeconds (int32)
StabilizationWindowSeconds is the number of seconds for which past recommendations should be considered while scaling up or scaling down. StabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour). If not set, use the default values: - For scale up: 0 (i.e. no stabilization is done). - For scale down: 300 (i.e. the stabilization window is 300 seconds long).
-
-
behavior.scaleUp (HPAScalingRules)
scaleUp is scaling policy for scaling Up. If not set, the default value is the higher of:
- increase no more than 4 pods per 60 seconds
- double the number of pods per 60 seconds No stabilization is used.
HPAScalingRules configures the scaling behavior for one direction. These Rules are applied after calculating DesiredReplicas from metrics for the HPA. They can limit the scaling velocity by specifying scaling policies. They can prevent flapping by specifying the stabilization window, so that the number of replicas is not set instantly, instead, the safest value from the stabilization window is chosen.
-
behavior.scaleUp.policies ([]HPAScalingPolicy)
Atomic: will be replaced during a merge
policies is a list of potential scaling polices which can be used during scaling. At least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid
HPAScalingPolicy is a single policy which must hold true for a specified past interval.
-
behavior.scaleUp.policies.type (string), required
Type is used to specify the scaling policy.
-
behavior.scaleUp.policies.value (int32), required
Value contains the amount of change which is permitted by the policy. It must be greater than zero
-
behavior.scaleUp.policies.periodSeconds (int32), required
PeriodSeconds specifies the window of time for which the policy should hold true. PeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).
-
-
behavior.scaleUp.selectPolicy (string)
selectPolicy is used to specify which policy should be used. If not set, the default value Max is used.
-
behavior.scaleUp.stabilizationWindowSeconds (int32)
StabilizationWindowSeconds is the number of seconds for which past recommendations should be considered while scaling up or scaling down. StabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour). If not set, use the default values: - For scale up: 0 (i.e. no stabilization is done). - For scale down: 300 (i.e. the stabilization window is 300 seconds long).
-
-
metrics ([]MetricSpec)
Atomic: will be replaced during a merge
metrics contains the specifications for which to use to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the default metric will be set to 80% average CPU utilization.
MetricSpec specifies how to scale based on a single metric (only
type
and one other matching field should be set at once).-
metrics.type (string), required
type is the type of metric source. It should be one of "ContainerResource", "External", "Object", "Pods" or "Resource", each mapping to a matching field in the object. Note: "ContainerResource" type is available on when the feature-gate HPAContainerMetrics is enabled
-
metrics.containerResource (ContainerResourceMetricSource)
containerResource refers to a resource metric (such as those specified in requests and limits) known to Kubernetes describing a single container in each pod of the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source. This is an alpha feature and can be enabled by the HPAContainerMetrics feature flag.
ContainerResourceMetricSource indicates how to scale on a resource metric known to Kubernetes, as specified in requests and limits, describing each pod in the current scale target (e.g. CPU or memory). The values will be averaged together before being compared to the target. Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source. Only one "target" type should be set.
-
metrics.containerResource.container (string), required
container is the name of the container in the pods of the scaling target
-
metrics.containerResource.name (string), required
name is the name of the resource in question.
-
metrics.containerResource.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.containerResource.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.containerResource.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.containerResource.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.containerResource.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
metrics.external (ExternalMetricSource)
external refers to a global metric that is not associated with any Kubernetes object. It allows autoscaling based on information coming from components running outside of cluster (for example length of queue in cloud messaging service, or QPS from loadbalancer running outside of cluster).
ExternalMetricSource indicates how to scale on a metric not associated with any Kubernetes object (for example length of queue in cloud messaging service, or QPS from loadbalancer running outside of cluster).
-
metrics.external.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
metrics.external.metric.name (string), required
name is the name of the given metric
-
metrics.external.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
metrics.external.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.external.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.external.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.external.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.external.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
metrics.object (ObjectMetricSource)
object refers to a metric describing a single kubernetes object (for example, hits-per-second on an Ingress object).
ObjectMetricSource indicates how to scale on a metric describing a kubernetes object (for example, hits-per-second on an Ingress object).
-
metrics.object.describedObject (CrossVersionObjectReference), required
describedObject specifies the descriptions of a object,such as kind,name apiVersion
CrossVersionObjectReference contains enough information to let you identify the referred resource.
-
metrics.object.describedObject.kind (string), required
Kind of the referent; More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds"
-
metrics.object.describedObject.name (string), required
Name of the referent; More info: http://kubernetes.io/docs/user-guide/identifiers#names
-
metrics.object.describedObject.apiVersion (string)
API version of the referent
-
-
metrics.object.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
metrics.object.metric.name (string), required
name is the name of the given metric
-
metrics.object.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
metrics.object.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.object.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.object.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.object.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.object.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
metrics.pods (PodsMetricSource)
pods refers to a metric describing each pod in the current scale target (for example, transactions-processed-per-second). The values will be averaged together before being compared to the target value.
PodsMetricSource indicates how to scale on a metric describing each pod in the current scale target (for example, transactions-processed-per-second). The values will be averaged together before being compared to the target value.
-
metrics.pods.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
metrics.pods.metric.name (string), required
name is the name of the given metric
-
metrics.pods.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
metrics.pods.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.pods.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.pods.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.pods.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.pods.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
metrics.resource (ResourceMetricSource)
resource refers to a resource metric (such as those specified in requests and limits) known to Kubernetes describing each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
ResourceMetricSource indicates how to scale on a resource metric known to Kubernetes, as specified in requests and limits, describing each pod in the current scale target (e.g. CPU or memory). The values will be averaged together before being compared to the target. Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source. Only one "target" type should be set.
-
metrics.resource.name (string), required
name is the name of the resource in question.
-
metrics.resource.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.resource.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.resource.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.resource.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.resource.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
HorizontalPodAutoscalerStatus
HorizontalPodAutoscalerStatus describes the current status of a horizontal pod autoscaler.
-
desiredReplicas (int32), required
desiredReplicas is the desired number of replicas of pods managed by this autoscaler, as last calculated by the autoscaler.
-
conditions ([]HorizontalPodAutoscalerCondition)
Patch strategy: merge on key
type
Map: unique values on key type will be kept during a merge
conditions is the set of conditions required for this autoscaler to scale its target, and indicates whether or not those conditions are met.
HorizontalPodAutoscalerCondition describes the state of a HorizontalPodAutoscaler at a certain point.
-
conditions.status (string), required
status is the status of the condition (True, False, Unknown)
-
conditions.type (string), required
type describes the current condition
-
conditions.lastTransitionTime (Time)
lastTransitionTime is the last time the condition transitioned from one status to another
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
message is a human-readable explanation containing details about the transition
-
conditions.reason (string)
reason is the reason for the condition's last transition.
-
-
currentMetrics ([]MetricStatus)
Atomic: will be replaced during a merge
currentMetrics is the last read state of the metrics used by this autoscaler.
MetricStatus describes the last-read state of a single metric.
-
currentMetrics.type (string), required
type is the type of metric source. It will be one of "ContainerResource", "External", "Object", "Pods" or "Resource", each corresponds to a matching field in the object. Note: "ContainerResource" type is available on when the feature-gate HPAContainerMetrics is enabled
-
currentMetrics.containerResource (ContainerResourceMetricStatus)
container resource refers to a resource metric (such as those specified in requests and limits) known to Kubernetes describing a single container in each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
ContainerResourceMetricStatus indicates the current value of a resource metric known to Kubernetes, as specified in requests and limits, describing a single container in each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
-
currentMetrics.containerResource.container (string), required
Container is the name of the container in the pods of the scaling target
-
currentMetrics.containerResource.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.containerResource.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.containerResource.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.containerResource.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.containerResource.name (string), required
Name is the name of the resource in question.
-
-
currentMetrics.external (ExternalMetricStatus)
external refers to a global metric that is not associated with any Kubernetes object. It allows autoscaling based on information coming from components running outside of cluster (for example length of queue in cloud messaging service, or QPS from loadbalancer running outside of cluster).
ExternalMetricStatus indicates the current value of a global metric not associated with any Kubernetes object.
-
currentMetrics.external.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.external.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.external.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.external.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.external.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
currentMetrics.external.metric.name (string), required
name is the name of the given metric
-
currentMetrics.external.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
-
currentMetrics.object (ObjectMetricStatus)
object refers to a metric describing a single kubernetes object (for example, hits-per-second on an Ingress object).
ObjectMetricStatus indicates the current value of a metric describing a kubernetes object (for example, hits-per-second on an Ingress object).
-
currentMetrics.object.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.object.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.object.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.object.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.object.describedObject (CrossVersionObjectReference), required
DescribedObject specifies the descriptions of a object,such as kind,name apiVersion
CrossVersionObjectReference contains enough information to let you identify the referred resource.
-
currentMetrics.object.describedObject.kind (string), required
Kind of the referent; More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds"
-
currentMetrics.object.describedObject.name (string), required
Name of the referent; More info: http://kubernetes.io/docs/user-guide/identifiers#names
-
currentMetrics.object.describedObject.apiVersion (string)
API version of the referent
-
-
currentMetrics.object.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
currentMetrics.object.metric.name (string), required
name is the name of the given metric
-
currentMetrics.object.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
-
currentMetrics.pods (PodsMetricStatus)
pods refers to a metric describing each pod in the current scale target (for example, transactions-processed-per-second). The values will be averaged together before being compared to the target value.
PodsMetricStatus indicates the current value of a metric describing each pod in the current scale target (for example, transactions-processed-per-second).
-
currentMetrics.pods.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.pods.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.pods.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.pods.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.pods.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
currentMetrics.pods.metric.name (string), required
name is the name of the given metric
-
currentMetrics.pods.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
-
currentMetrics.resource (ResourceMetricStatus)
resource refers to a resource metric (such as those specified in requests and limits) known to Kubernetes describing each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
ResourceMetricStatus indicates the current value of a resource metric known to Kubernetes, as specified in requests and limits, describing each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
-
currentMetrics.resource.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.resource.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.resource.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.resource.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.resource.name (string), required
Name is the name of the resource in question.
-
-
-
currentReplicas (int32)
currentReplicas is current number of replicas of pods managed by this autoscaler, as last seen by the autoscaler.
-
lastScaleTime (Time)
lastScaleTime is the last time the HorizontalPodAutoscaler scaled the number of pods, used by the autoscaler to control how often the number of pods is changed.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
observedGeneration (int64)
observedGeneration is the most recent generation observed by this autoscaler.
HorizontalPodAutoscalerList
HorizontalPodAutoscalerList is a list of horizontal pod autoscaler objects.
-
apiVersion: autoscaling/v2
-
kind: HorizontalPodAutoscalerList
-
metadata (ListMeta)
metadata is the standard list metadata.
-
items ([]HorizontalPodAutoscaler), required
items is the list of horizontal pod autoscaler objects.
Operations
get
read the specified HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
401: Unauthorized
get
read status of the specified HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers/{name}/status
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
401: Unauthorized
list
list or watch objects of kind HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (HorizontalPodAutoscalerList): OK
401: Unauthorized
list
list or watch objects of kind HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v2/horizontalpodautoscalers
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (HorizontalPodAutoscalerList): OK
401: Unauthorized
create
create a HorizontalPodAutoscaler
HTTP Request
POST /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers
Parameters
-
namespace (in path): string, required
-
body: HorizontalPodAutoscaler, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
202 (HorizontalPodAutoscaler): Accepted
401: Unauthorized
update
replace the specified HorizontalPodAutoscaler
HTTP Request
PUT /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: HorizontalPodAutoscaler, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
update
replace status of the specified HorizontalPodAutoscaler
HTTP Request
PUT /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers/{name}/status
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: HorizontalPodAutoscaler, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
patch
partially update the specified HorizontalPodAutoscaler
HTTP Request
PATCH /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
patch
partially update status of the specified HorizontalPodAutoscaler
HTTP Request
PATCH /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers/{name}/status
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
delete
delete a HorizontalPodAutoscaler
HTTP Request
DELETE /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of HorizontalPodAutoscaler
HTTP Request
DELETE /apis/autoscaling/v2/namespaces/{namespace}/horizontalpodautoscalers
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.13 - HorizontalPodAutoscaler v2beta2
apiVersion: autoscaling/v2beta2
import "k8s.io/api/autoscaling/v2beta2"
HorizontalPodAutoscaler
HorizontalPodAutoscaler is the configuration for a horizontal pod autoscaler, which automatically manages the replica count of any resource implementing the scale subresource based on the metrics specified.
-
apiVersion: autoscaling/v2beta2
-
kind: HorizontalPodAutoscaler
-
metadata (ObjectMeta)
metadata is the standard object metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (HorizontalPodAutoscalerSpec)
spec is the specification for the behaviour of the autoscaler. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status.
-
status (HorizontalPodAutoscalerStatus)
status is the current information about the autoscaler.
HorizontalPodAutoscalerSpec
HorizontalPodAutoscalerSpec describes the desired functionality of the HorizontalPodAutoscaler.
-
maxReplicas (int32), required
maxReplicas is the upper limit for the number of replicas to which the autoscaler can scale up. It cannot be less that minReplicas.
-
scaleTargetRef (CrossVersionObjectReference), required
scaleTargetRef points to the target resource to scale, and is used to the pods for which metrics should be collected, as well as to actually change the replica count.
CrossVersionObjectReference contains enough information to let you identify the referred resource.
-
scaleTargetRef.kind (string), required
Kind of the referent; More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds"
-
scaleTargetRef.name (string), required
Name of the referent; More info: http://kubernetes.io/docs/user-guide/identifiers#names
-
scaleTargetRef.apiVersion (string)
API version of the referent
-
-
minReplicas (int32)
minReplicas is the lower limit for the number of replicas to which the autoscaler can scale down. It defaults to 1 pod. minReplicas is allowed to be 0 if the alpha feature gate HPAScaleToZero is enabled and at least one Object or External metric is configured. Scaling is active as long as at least one metric value is available.
-
behavior (HorizontalPodAutoscalerBehavior)
behavior configures the scaling behavior of the target in both Up and Down directions (scaleUp and scaleDown fields respectively). If not set, the default HPAScalingRules for scale up and scale down are used.
HorizontalPodAutoscalerBehavior configures the scaling behavior of the target in both Up and Down directions (scaleUp and scaleDown fields respectively).
-
behavior.scaleDown (HPAScalingRules)
scaleDown is scaling policy for scaling Down. If not set, the default value is to allow to scale down to minReplicas pods, with a 300 second stabilization window (i.e., the highest recommendation for the last 300sec is used).
HPAScalingRules configures the scaling behavior for one direction. These Rules are applied after calculating DesiredReplicas from metrics for the HPA. They can limit the scaling velocity by specifying scaling policies. They can prevent flapping by specifying the stabilization window, so that the number of replicas is not set instantly, instead, the safest value from the stabilization window is chosen.
-
behavior.scaleDown.policies ([]HPAScalingPolicy)
policies is a list of potential scaling polices which can be used during scaling. At least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid
HPAScalingPolicy is a single policy which must hold true for a specified past interval.
-
behavior.scaleDown.policies.type (string), required
Type is used to specify the scaling policy.
-
behavior.scaleDown.policies.value (int32), required
Value contains the amount of change which is permitted by the policy. It must be greater than zero
-
behavior.scaleDown.policies.periodSeconds (int32), required
PeriodSeconds specifies the window of time for which the policy should hold true. PeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).
-
-
behavior.scaleDown.selectPolicy (string)
selectPolicy is used to specify which policy should be used. If not set, the default value MaxPolicySelect is used.
-
behavior.scaleDown.stabilizationWindowSeconds (int32)
StabilizationWindowSeconds is the number of seconds for which past recommendations should be considered while scaling up or scaling down. StabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour). If not set, use the default values: - For scale up: 0 (i.e. no stabilization is done). - For scale down: 300 (i.e. the stabilization window is 300 seconds long).
-
-
behavior.scaleUp (HPAScalingRules)
scaleUp is scaling policy for scaling Up. If not set, the default value is the higher of:
- increase no more than 4 pods per 60 seconds
- double the number of pods per 60 seconds No stabilization is used.
HPAScalingRules configures the scaling behavior for one direction. These Rules are applied after calculating DesiredReplicas from metrics for the HPA. They can limit the scaling velocity by specifying scaling policies. They can prevent flapping by specifying the stabilization window, so that the number of replicas is not set instantly, instead, the safest value from the stabilization window is chosen.
-
behavior.scaleUp.policies ([]HPAScalingPolicy)
policies is a list of potential scaling polices which can be used during scaling. At least one policy must be specified, otherwise the HPAScalingRules will be discarded as invalid
HPAScalingPolicy is a single policy which must hold true for a specified past interval.
-
behavior.scaleUp.policies.type (string), required
Type is used to specify the scaling policy.
-
behavior.scaleUp.policies.value (int32), required
Value contains the amount of change which is permitted by the policy. It must be greater than zero
-
behavior.scaleUp.policies.periodSeconds (int32), required
PeriodSeconds specifies the window of time for which the policy should hold true. PeriodSeconds must be greater than zero and less than or equal to 1800 (30 min).
-
-
behavior.scaleUp.selectPolicy (string)
selectPolicy is used to specify which policy should be used. If not set, the default value MaxPolicySelect is used.
-
behavior.scaleUp.stabilizationWindowSeconds (int32)
StabilizationWindowSeconds is the number of seconds for which past recommendations should be considered while scaling up or scaling down. StabilizationWindowSeconds must be greater than or equal to zero and less than or equal to 3600 (one hour). If not set, use the default values: - For scale up: 0 (i.e. no stabilization is done). - For scale down: 300 (i.e. the stabilization window is 300 seconds long).
-
-
metrics ([]MetricSpec)
metrics contains the specifications for which to use to calculate the desired replica count (the maximum replica count across all metrics will be used). The desired replica count is calculated multiplying the ratio between the target value and the current value by the current number of pods. Ergo, metrics used must decrease as the pod count is increased, and vice-versa. See the individual metric source types for more information about how each type of metric must respond. If not set, the default metric will be set to 80% average CPU utilization.
MetricSpec specifies how to scale based on a single metric (only
type
and one other matching field should be set at once).-
metrics.type (string), required
type is the type of metric source. It should be one of "ContainerResource", "External", "Object", "Pods" or "Resource", each mapping to a matching field in the object. Note: "ContainerResource" type is available on when the feature-gate HPAContainerMetrics is enabled
-
metrics.containerResource (ContainerResourceMetricSource)
container resource refers to a resource metric (such as those specified in requests and limits) known to Kubernetes describing a single container in each pod of the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source. This is an alpha feature and can be enabled by the HPAContainerMetrics feature flag.
ContainerResourceMetricSource indicates how to scale on a resource metric known to Kubernetes, as specified in requests and limits, describing each pod in the current scale target (e.g. CPU or memory). The values will be averaged together before being compared to the target. Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source. Only one "target" type should be set.
-
metrics.containerResource.container (string), required
container is the name of the container in the pods of the scaling target
-
metrics.containerResource.name (string), required
name is the name of the resource in question.
-
metrics.containerResource.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.containerResource.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.containerResource.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.containerResource.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.containerResource.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
metrics.external (ExternalMetricSource)
external refers to a global metric that is not associated with any Kubernetes object. It allows autoscaling based on information coming from components running outside of cluster (for example length of queue in cloud messaging service, or QPS from loadbalancer running outside of cluster).
ExternalMetricSource indicates how to scale on a metric not associated with any Kubernetes object (for example length of queue in cloud messaging service, or QPS from loadbalancer running outside of cluster).
-
metrics.external.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
metrics.external.metric.name (string), required
name is the name of the given metric
-
metrics.external.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
metrics.external.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.external.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.external.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.external.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.external.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
metrics.object (ObjectMetricSource)
object refers to a metric describing a single kubernetes object (for example, hits-per-second on an Ingress object).
ObjectMetricSource indicates how to scale on a metric describing a kubernetes object (for example, hits-per-second on an Ingress object).
-
metrics.object.describedObject (CrossVersionObjectReference), required
CrossVersionObjectReference contains enough information to let you identify the referred resource.
-
metrics.object.describedObject.kind (string), required
Kind of the referent; More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds"
-
metrics.object.describedObject.name (string), required
Name of the referent; More info: http://kubernetes.io/docs/user-guide/identifiers#names
-
metrics.object.describedObject.apiVersion (string)
API version of the referent
-
-
metrics.object.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
metrics.object.metric.name (string), required
name is the name of the given metric
-
metrics.object.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
metrics.object.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.object.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.object.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.object.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.object.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
metrics.pods (PodsMetricSource)
pods refers to a metric describing each pod in the current scale target (for example, transactions-processed-per-second). The values will be averaged together before being compared to the target value.
PodsMetricSource indicates how to scale on a metric describing each pod in the current scale target (for example, transactions-processed-per-second). The values will be averaged together before being compared to the target value.
-
metrics.pods.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
metrics.pods.metric.name (string), required
name is the name of the given metric
-
metrics.pods.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
metrics.pods.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.pods.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.pods.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.pods.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.pods.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
metrics.resource (ResourceMetricSource)
resource refers to a resource metric (such as those specified in requests and limits) known to Kubernetes describing each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
ResourceMetricSource indicates how to scale on a resource metric known to Kubernetes, as specified in requests and limits, describing each pod in the current scale target (e.g. CPU or memory). The values will be averaged together before being compared to the target. Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source. Only one "target" type should be set.
-
metrics.resource.name (string), required
name is the name of the resource in question.
-
metrics.resource.target (MetricTarget), required
target specifies the target value for the given metric
MetricTarget defines the target value, average value, or average utilization of a specific metric
-
metrics.resource.target.type (string), required
type represents whether the metric type is Utilization, Value, or AverageValue
-
metrics.resource.target.averageUtilization (int32)
averageUtilization is the target value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods. Currently only valid for Resource metric source type
-
metrics.resource.target.averageValue (Quantity)
averageValue is the target value of the average of the metric across all relevant pods (as a quantity)
-
metrics.resource.target.value (Quantity)
value is the target value of the metric (as a quantity).
-
-
-
HorizontalPodAutoscalerStatus
HorizontalPodAutoscalerStatus describes the current status of a horizontal pod autoscaler.
-
currentReplicas (int32), required
currentReplicas is current number of replicas of pods managed by this autoscaler, as last seen by the autoscaler.
-
desiredReplicas (int32), required
desiredReplicas is the desired number of replicas of pods managed by this autoscaler, as last calculated by the autoscaler.
-
conditions ([]HorizontalPodAutoscalerCondition)
conditions is the set of conditions required for this autoscaler to scale its target, and indicates whether or not those conditions are met.
HorizontalPodAutoscalerCondition describes the state of a HorizontalPodAutoscaler at a certain point.
-
conditions.status (string), required
status is the status of the condition (True, False, Unknown)
-
conditions.type (string), required
type describes the current condition
-
conditions.lastTransitionTime (Time)
lastTransitionTime is the last time the condition transitioned from one status to another
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
message is a human-readable explanation containing details about the transition
-
conditions.reason (string)
reason is the reason for the condition's last transition.
-
-
currentMetrics ([]MetricStatus)
currentMetrics is the last read state of the metrics used by this autoscaler.
MetricStatus describes the last-read state of a single metric.
-
currentMetrics.type (string), required
type is the type of metric source. It will be one of "ContainerResource", "External", "Object", "Pods" or "Resource", each corresponds to a matching field in the object. Note: "ContainerResource" type is available on when the feature-gate HPAContainerMetrics is enabled
-
currentMetrics.containerResource (ContainerResourceMetricStatus)
container resource refers to a resource metric (such as those specified in requests and limits) known to Kubernetes describing a single container in each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
ContainerResourceMetricStatus indicates the current value of a resource metric known to Kubernetes, as specified in requests and limits, describing a single container in each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
-
currentMetrics.containerResource.container (string), required
Container is the name of the container in the pods of the scaling target
-
currentMetrics.containerResource.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.containerResource.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.containerResource.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.containerResource.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.containerResource.name (string), required
Name is the name of the resource in question.
-
-
currentMetrics.external (ExternalMetricStatus)
external refers to a global metric that is not associated with any Kubernetes object. It allows autoscaling based on information coming from components running outside of cluster (for example length of queue in cloud messaging service, or QPS from loadbalancer running outside of cluster).
ExternalMetricStatus indicates the current value of a global metric not associated with any Kubernetes object.
-
currentMetrics.external.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.external.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.external.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.external.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.external.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
currentMetrics.external.metric.name (string), required
name is the name of the given metric
-
currentMetrics.external.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
-
currentMetrics.object (ObjectMetricStatus)
object refers to a metric describing a single kubernetes object (for example, hits-per-second on an Ingress object).
ObjectMetricStatus indicates the current value of a metric describing a kubernetes object (for example, hits-per-second on an Ingress object).
-
currentMetrics.object.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.object.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.object.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.object.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.object.describedObject (CrossVersionObjectReference), required
CrossVersionObjectReference contains enough information to let you identify the referred resource.
-
currentMetrics.object.describedObject.kind (string), required
Kind of the referent; More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds"
-
currentMetrics.object.describedObject.name (string), required
Name of the referent; More info: http://kubernetes.io/docs/user-guide/identifiers#names
-
currentMetrics.object.describedObject.apiVersion (string)
API version of the referent
-
-
currentMetrics.object.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
currentMetrics.object.metric.name (string), required
name is the name of the given metric
-
currentMetrics.object.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
-
currentMetrics.pods (PodsMetricStatus)
pods refers to a metric describing each pod in the current scale target (for example, transactions-processed-per-second). The values will be averaged together before being compared to the target value.
PodsMetricStatus indicates the current value of a metric describing each pod in the current scale target (for example, transactions-processed-per-second).
-
currentMetrics.pods.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.pods.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.pods.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.pods.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.pods.metric (MetricIdentifier), required
metric identifies the target metric by name and selector
MetricIdentifier defines the name and optionally selector for a metric
-
currentMetrics.pods.metric.name (string), required
name is the name of the given metric
-
currentMetrics.pods.metric.selector (LabelSelector)
selector is the string-encoded form of a standard kubernetes label selector for the given metric When set, it is passed as an additional parameter to the metrics server for more specific metrics scoping. When unset, just the metricName will be used to gather metrics.
-
-
-
currentMetrics.resource (ResourceMetricStatus)
resource refers to a resource metric (such as those specified in requests and limits) known to Kubernetes describing each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
ResourceMetricStatus indicates the current value of a resource metric known to Kubernetes, as specified in requests and limits, describing each pod in the current scale target (e.g. CPU or memory). Such metrics are built in to Kubernetes, and have special scaling options on top of those available to normal per-pod metrics using the "pods" source.
-
currentMetrics.resource.current (MetricValueStatus), required
current contains the current value for the given metric
MetricValueStatus holds the current value for a metric
-
currentMetrics.resource.current.averageUtilization (int32)
currentAverageUtilization is the current value of the average of the resource metric across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
-
currentMetrics.resource.current.averageValue (Quantity)
averageValue is the current value of the average of the metric across all relevant pods (as a quantity)
-
currentMetrics.resource.current.value (Quantity)
value is the current value of the metric (as a quantity).
-
-
currentMetrics.resource.name (string), required
Name is the name of the resource in question.
-
-
-
lastScaleTime (Time)
lastScaleTime is the last time the HorizontalPodAutoscaler scaled the number of pods, used by the autoscaler to control how often the number of pods is changed.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
observedGeneration (int64)
observedGeneration is the most recent generation observed by this autoscaler.
HorizontalPodAutoscalerList
HorizontalPodAutoscalerList is a list of horizontal pod autoscaler objects.
-
apiVersion: autoscaling/v2beta2
-
kind: HorizontalPodAutoscalerList
-
metadata (ListMeta)
metadata is the standard list metadata.
-
items ([]HorizontalPodAutoscaler), required
items is the list of horizontal pod autoscaler objects.
Operations
get
read the specified HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
401: Unauthorized
get
read status of the specified HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers/{name}/status
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
401: Unauthorized
list
list or watch objects of kind HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (HorizontalPodAutoscalerList): OK
401: Unauthorized
list
list or watch objects of kind HorizontalPodAutoscaler
HTTP Request
GET /apis/autoscaling/v2beta2/horizontalpodautoscalers
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (HorizontalPodAutoscalerList): OK
401: Unauthorized
create
create a HorizontalPodAutoscaler
HTTP Request
POST /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers
Parameters
-
namespace (in path): string, required
-
body: HorizontalPodAutoscaler, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
202 (HorizontalPodAutoscaler): Accepted
401: Unauthorized
update
replace the specified HorizontalPodAutoscaler
HTTP Request
PUT /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: HorizontalPodAutoscaler, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
update
replace status of the specified HorizontalPodAutoscaler
HTTP Request
PUT /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers/{name}/status
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: HorizontalPodAutoscaler, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
patch
partially update the specified HorizontalPodAutoscaler
HTTP Request
PATCH /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
patch
partially update status of the specified HorizontalPodAutoscaler
HTTP Request
PATCH /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers/{name}/status
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (HorizontalPodAutoscaler): OK
201 (HorizontalPodAutoscaler): Created
401: Unauthorized
delete
delete a HorizontalPodAutoscaler
HTTP Request
DELETE /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers/{name}
Parameters
-
name (in path): string, required
name of the HorizontalPodAutoscaler
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of HorizontalPodAutoscaler
HTTP Request
DELETE /apis/autoscaling/v2beta2/namespaces/{namespace}/horizontalpodautoscalers
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.1.14 - PriorityClass
apiVersion: scheduling.k8s.io/v1
import "k8s.io/api/scheduling/v1"
PriorityClass
PriorityClass defines mapping from a priority class name to the priority integer value. The value can be any valid integer.
-
apiVersion: scheduling.k8s.io/v1
-
kind: PriorityClass
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
value (int32), required
The value of this priority class. This is the actual priority that pods receive when they have the name of this class in their pod spec.
-
description (string)
description is an arbitrary string that usually provides guidelines on when this priority class should be used.
-
globalDefault (boolean)
globalDefault specifies whether this PriorityClass should be considered as the default priority for pods that do not have any priority class. Only one PriorityClass can be marked as
globalDefault
. However, if more than one PriorityClasses exists with theirglobalDefault
field set to true, the smallest value of such global default PriorityClasses will be used as the default priority. -
preemptionPolicy (string)
PreemptionPolicy is the Policy for preempting pods with lower priority. One of Never, PreemptLowerPriority. Defaults to PreemptLowerPriority if unset. This field is beta-level, gated by the NonPreemptingPriority feature-gate.
PriorityClassList
PriorityClassList is a collection of priority classes.
-
apiVersion: scheduling.k8s.io/v1
-
kind: PriorityClassList
-
metadata (ListMeta)
Standard list metadata More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]PriorityClass), required
items is the list of PriorityClasses
Operations
get
read the specified PriorityClass
HTTP Request
GET /apis/scheduling.k8s.io/v1/priorityclasses/{name}
Parameters
-
name (in path): string, required
name of the PriorityClass
-
pretty (in query): string
Response
200 (PriorityClass): OK
401: Unauthorized
list
list or watch objects of kind PriorityClass
HTTP Request
GET /apis/scheduling.k8s.io/v1/priorityclasses
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PriorityClassList): OK
401: Unauthorized
create
create a PriorityClass
HTTP Request
POST /apis/scheduling.k8s.io/v1/priorityclasses
Parameters
-
body: PriorityClass, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PriorityClass): OK
201 (PriorityClass): Created
202 (PriorityClass): Accepted
401: Unauthorized
update
replace the specified PriorityClass
HTTP Request
PUT /apis/scheduling.k8s.io/v1/priorityclasses/{name}
Parameters
-
name (in path): string, required
name of the PriorityClass
-
body: PriorityClass, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PriorityClass): OK
201 (PriorityClass): Created
401: Unauthorized
patch
partially update the specified PriorityClass
HTTP Request
PATCH /apis/scheduling.k8s.io/v1/priorityclasses/{name}
Parameters
-
name (in path): string, required
name of the PriorityClass
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PriorityClass): OK
201 (PriorityClass): Created
401: Unauthorized
delete
delete a PriorityClass
HTTP Request
DELETE /apis/scheduling.k8s.io/v1/priorityclasses/{name}
Parameters
-
name (in path): string, required
name of the PriorityClass
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of PriorityClass
HTTP Request
DELETE /apis/scheduling.k8s.io/v1/priorityclasses
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.2 - Service Resources
6.5.2.1 - Service
apiVersion: v1
import "k8s.io/api/core/v1"
Service
Service is a named abstraction of software service (for example, mysql) consisting of local port (for example 3306) that the proxy listens on, and the selector that determines which pods will answer requests sent through the proxy.
-
apiVersion: v1
-
kind: Service
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (ServiceSpec)
Spec defines the behavior of a service. https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (ServiceStatus)
Most recently observed status of the service. Populated by the system. Read-only. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
ServiceSpec
ServiceSpec describes the attributes that a user creates on a service.
-
selector (map[string]string)
Route service traffic to pods with label keys and values matching this selector. If empty or not present, the service is assumed to have an external process managing its endpoints, which Kubernetes will not modify. Only applies to types ClusterIP, NodePort, and LoadBalancer. Ignored if type is ExternalName. More info: https://kubernetes.io/docs/concepts/services-networking/service/
-
ports ([]ServicePort)
Patch strategy: merge on key
port
Map: unique values on keys
port, protocol
will be kept during a mergeThe list of ports that are exposed by this service. More info: https://kubernetes.io/docs/concepts/services-networking/service/#virtual-ips-and-service-proxies
ServicePort contains information on service's port.
-
ports.port (int32), required
The port that will be exposed by this service.
-
ports.targetPort (IntOrString)
Number or name of the port to access on the pods targeted by the service. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME. If this is a string, it will be looked up as a named port in the target Pod's container ports. If this is not specified, the value of the 'port' field is used (an identity map). This field is ignored for services with clusterIP=None, and should be omitted or set equal to the 'port' field. More info: https://kubernetes.io/docs/concepts/services-networking/service/#defining-a-service
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
ports.protocol (string)
The IP protocol for this port. Supports "TCP", "UDP", and "SCTP". Default is TCP.
Possible enum values:
"SCTP"
is the SCTP protocol."TCP"
is the TCP protocol."UDP"
is the UDP protocol.
-
ports.name (string)
The name of this port within the service. This must be a DNS_LABEL. All ports within a ServiceSpec must have unique names. When considering the endpoints for a Service, this must match the 'name' field in the EndpointPort. Optional if only one ServicePort is defined on this service.
-
ports.nodePort (int32)
The port on each node on which this service is exposed when type is NodePort or LoadBalancer. Usually assigned by the system. If a value is specified, in-range, and not in use it will be used, otherwise the operation will fail. If not specified, a port will be allocated if this Service requires one. If this field is specified when creating a Service which does not need it, creation will fail. This field will be wiped when updating a Service to no longer need it (e.g. changing type from NodePort to ClusterIP). More info: https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport
-
ports.appProtocol (string)
The application protocol for this port. This field follows standard Kubernetes label syntax. Un-prefixed names are reserved for IANA standard service names (as per RFC-6335 and http://www.iana.org/assignments/service-names). Non-standard protocols should use prefixed names such as mycompany.com/my-custom-protocol.
-
-
type (string)
type determines how the Service is exposed. Defaults to ClusterIP. Valid options are ExternalName, ClusterIP, NodePort, and LoadBalancer. "ClusterIP" allocates a cluster-internal IP address for load-balancing to endpoints. Endpoints are determined by the selector or if that is not specified, by manual construction of an Endpoints object or EndpointSlice objects. If clusterIP is "None", no virtual IP is allocated and the endpoints are published as a set of endpoints rather than a virtual IP. "NodePort" builds on ClusterIP and allocates a port on every node which routes to the same endpoints as the clusterIP. "LoadBalancer" builds on NodePort and creates an external load-balancer (if supported in the current cloud) which routes to the same endpoints as the clusterIP. "ExternalName" aliases this service to the specified externalName. Several other fields do not apply to ExternalName services. More info: https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types
Possible enum values:
"ClusterIP"
means a service will only be accessible inside the cluster, via the cluster IP."ExternalName"
means a service consists of only a reference to an external name that kubedns or equivalent will return as a CNAME record, with no exposing or proxying of any pods involved."LoadBalancer"
means a service will be exposed via an external load balancer (if the cloud provider supports it), in addition to 'NodePort' type."NodePort"
means a service will be exposed on one port of every node, in addition to 'ClusterIP' type.
-
ipFamilies ([]string)
Atomic: will be replaced during a merge
IPFamilies is a list of IP families (e.g. IPv4, IPv6) assigned to this service. This field is usually assigned automatically based on cluster configuration and the ipFamilyPolicy field. If this field is specified manually, the requested family is available in the cluster, and ipFamilyPolicy allows it, it will be used; otherwise creation of the service will fail. This field is conditionally mutable: it allows for adding or removing a secondary IP family, but it does not allow changing the primary IP family of the Service. Valid values are "IPv4" and "IPv6". This field only applies to Services of types ClusterIP, NodePort, and LoadBalancer, and does apply to "headless" services. This field will be wiped when updating a Service to type ExternalName.
This field may hold a maximum of two entries (dual-stack families, in either order). These families must correspond to the values of the clusterIPs field, if specified. Both clusterIPs and ipFamilies are governed by the ipFamilyPolicy field.
-
ipFamilyPolicy (string)
IPFamilyPolicy represents the dual-stack-ness requested or required by this Service. If there is no value provided, then this field will be set to SingleStack. Services can be "SingleStack" (a single IP family), "PreferDualStack" (two IP families on dual-stack configured clusters or a single IP family on single-stack clusters), or "RequireDualStack" (two IP families on dual-stack configured clusters, otherwise fail). The ipFamilies and clusterIPs fields depend on the value of this field. This field will be wiped when updating a service to type ExternalName.
-
clusterIP (string)
clusterIP is the IP address of the service and is usually assigned randomly. If an address is specified manually, is in-range (as per system configuration), and is not in use, it will be allocated to the service; otherwise creation of the service will fail. This field may not be changed through updates unless the type field is also being changed to ExternalName (which requires this field to be blank) or the type field is being changed from ExternalName (in which case this field may optionally be specified, as describe above). Valid values are "None", empty string (""), or a valid IP address. Setting this to "None" makes a "headless service" (no virtual IP), which is useful when direct endpoint connections are preferred and proxying is not required. Only applies to types ClusterIP, NodePort, and LoadBalancer. If this field is specified when creating a Service of type ExternalName, creation will fail. This field will be wiped when updating a Service to type ExternalName. More info: https://kubernetes.io/docs/concepts/services-networking/service/#virtual-ips-and-service-proxies
-
clusterIPs ([]string)
Atomic: will be replaced during a merge
ClusterIPs is a list of IP addresses assigned to this service, and are usually assigned randomly. If an address is specified manually, is in-range (as per system configuration), and is not in use, it will be allocated to the service; otherwise creation of the service will fail. This field may not be changed through updates unless the type field is also being changed to ExternalName (which requires this field to be empty) or the type field is being changed from ExternalName (in which case this field may optionally be specified, as describe above). Valid values are "None", empty string (""), or a valid IP address. Setting this to "None" makes a "headless service" (no virtual IP), which is useful when direct endpoint connections are preferred and proxying is not required. Only applies to types ClusterIP, NodePort, and LoadBalancer. If this field is specified when creating a Service of type ExternalName, creation will fail. This field will be wiped when updating a Service to type ExternalName. If this field is not specified, it will be initialized from the clusterIP field. If this field is specified, clients must ensure that clusterIPs[0] and clusterIP have the same value.
This field may hold a maximum of two entries (dual-stack IPs, in either order). These IPs must correspond to the values of the ipFamilies field. Both clusterIPs and ipFamilies are governed by the ipFamilyPolicy field. More info: https://kubernetes.io/docs/concepts/services-networking/service/#virtual-ips-and-service-proxies
-
externalIPs ([]string)
externalIPs is a list of IP addresses for which nodes in the cluster will also accept traffic for this service. These IPs are not managed by Kubernetes. The user is responsible for ensuring that traffic arrives at a node with this IP. A common example is external load-balancers that are not part of the Kubernetes system.
-
sessionAffinity (string)
Supports "ClientIP" and "None". Used to maintain session affinity. Enable client IP based session affinity. Must be ClientIP or None. Defaults to None. More info: https://kubernetes.io/docs/concepts/services-networking/service/#virtual-ips-and-service-proxies
Possible enum values:
"ClientIP"
is the Client IP based."None"
- no session affinity.
-
loadBalancerIP (string)
Only applies to Service Type: LoadBalancer LoadBalancer will get created with the IP specified in this field. This feature depends on whether the underlying cloud-provider supports specifying the loadBalancerIP when a load balancer is created. This field will be ignored if the cloud-provider does not support the feature.
-
loadBalancerSourceRanges ([]string)
If specified and supported by the platform, this will restrict traffic through the cloud-provider load-balancer will be restricted to the specified client IPs. This field will be ignored if the cloud-provider does not support the feature." More info: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/
-
loadBalancerClass (string)
loadBalancerClass is the class of the load balancer implementation this Service belongs to. If specified, the value of this field must be a label-style identifier, with an optional prefix, e.g. "internal-vip" or "example.com/internal-vip". Unprefixed names are reserved for end-users. This field can only be set when the Service type is 'LoadBalancer'. If not set, the default load balancer implementation is used, today this is typically done through the cloud provider integration, but should apply for any default implementation. If set, it is assumed that a load balancer implementation is watching for Services with a matching class. Any default load balancer implementation (e.g. cloud providers) should ignore Services that set this field. This field can only be set when creating or updating a Service to type 'LoadBalancer'. Once set, it can not be changed. This field will be wiped when a service is updated to a non 'LoadBalancer' type.
-
externalName (string)
externalName is the external reference that discovery mechanisms will return as an alias for this service (e.g. a DNS CNAME record). No proxying will be involved. Must be a lowercase RFC-1123 hostname (https://tools.ietf.org/html/rfc1123) and requires
type
to be "ExternalName". -
externalTrafficPolicy (string)
externalTrafficPolicy denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints. "Local" preserves the client source IP and avoids a second hop for LoadBalancer and Nodeport type services, but risks potentially imbalanced traffic spreading. "Cluster" obscures the client source IP and may cause a second hop to another node, but should have good overall load-spreading.
Possible enum values:
"Cluster"
specifies node-global (legacy) behavior."Local"
specifies node-local endpoints behavior.
-
internalTrafficPolicy (string)
InternalTrafficPolicy specifies if the cluster internal traffic should be routed to all endpoints or node-local endpoints only. "Cluster" routes internal traffic to a Service to all endpoints. "Local" routes traffic to node-local endpoints only, traffic is dropped if no node-local endpoints are ready. The default value is "Cluster".
-
healthCheckNodePort (int32)
healthCheckNodePort specifies the healthcheck nodePort for the service. This only applies when type is set to LoadBalancer and externalTrafficPolicy is set to Local. If a value is specified, is in-range, and is not in use, it will be used. If not specified, a value will be automatically allocated. External systems (e.g. load-balancers) can use this port to determine if a given node holds endpoints for this service or not. If this field is specified when creating a Service which does not need it, creation will fail. This field will be wiped when updating a Service to no longer need it (e.g. changing type).
-
publishNotReadyAddresses (boolean)
publishNotReadyAddresses indicates that any agent which deals with endpoints for this Service should disregard any indications of ready/not-ready. The primary use case for setting this field is for a StatefulSet's Headless Service to propagate SRV DNS records for its Pods for the purpose of peer discovery. The Kubernetes controllers that generate Endpoints and EndpointSlice resources for Services interpret this to mean that all endpoints are considered "ready" even if the Pods themselves are not. Agents which consume only Kubernetes generated endpoints through the Endpoints or EndpointSlice resources can safely assume this behavior.
-
sessionAffinityConfig (SessionAffinityConfig)
sessionAffinityConfig contains the configurations of session affinity.
SessionAffinityConfig represents the configurations of session affinity.
-
sessionAffinityConfig.clientIP (ClientIPConfig)
clientIP contains the configurations of Client IP based session affinity.
ClientIPConfig represents the configurations of Client IP based session affinity.
-
sessionAffinityConfig.clientIP.timeoutSeconds (int32)
timeoutSeconds specifies the seconds of ClientIP type session sticky time. The value must be >0 && <=86400(for 1 day) if ServiceAffinity == "ClientIP". Default value is 10800(for 3 hours).
-
-
-
allocateLoadBalancerNodePorts (boolean)
allocateLoadBalancerNodePorts defines if NodePorts will be automatically allocated for services with type LoadBalancer. Default is "true". It may be set to "false" if the cluster load-balancer does not rely on NodePorts. If the caller requests specific NodePorts (by specifying a value), those requests will be respected, regardless of this field. This field may only be set for services with type LoadBalancer and will be cleared if the type is changed to any other type. This field is beta-level and is only honored by servers that enable the ServiceLBNodePortControl feature.
ServiceStatus
ServiceStatus represents the current status of a service.
-
conditions ([]Condition)
Patch strategy: merge on key
type
Map: unique values on key type will be kept during a merge
Current service state
Condition contains details for one aspect of the current state of this API Resource.
-
conditions.lastTransitionTime (Time), required
lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string), required
message is a human readable message indicating details about the transition. This may be an empty string.
-
conditions.reason (string), required
reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.
-
conditions.status (string), required
status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
type of condition in CamelCase or in foo.example.com/CamelCase.
-
conditions.observedGeneration (int64)
observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.
-
-
loadBalancer (LoadBalancerStatus)
LoadBalancer contains the current status of the load-balancer, if one is present.
LoadBalancerStatus represents the status of a load-balancer.
-
loadBalancer.ingress ([]LoadBalancerIngress)
Ingress is a list containing ingress points for the load-balancer. Traffic intended for the service should be sent to these ingress points.
LoadBalancerIngress represents the status of a load-balancer ingress point: traffic intended for the service should be sent to an ingress point.
-
loadBalancer.ingress.hostname (string)
Hostname is set for load-balancer ingress points that are DNS based (typically AWS load-balancers)
-
loadBalancer.ingress.ip (string)
IP is set for load-balancer ingress points that are IP based (typically GCE or OpenStack load-balancers)
-
loadBalancer.ingress.ports ([]PortStatus)
Atomic: will be replaced during a merge
Ports is a list of records of service ports If used, every port defined in the service should have an entry in it
-
loadBalancer.ingress.ports.port (int32), required
Port is the port number of the service port of which status is recorded here
-
loadBalancer.ingress.ports.protocol (string), required
Protocol is the protocol of the service port of which status is recorded here The supported values are: "TCP", "UDP", "SCTP"
Possible enum values:
"SCTP"
is the SCTP protocol."TCP"
is the TCP protocol."UDP"
is the UDP protocol.
-
loadBalancer.ingress.ports.error (string)
Error is to record the problem with the service port The format of the error shall comply with the following rules: - built-in error values shall be specified in this file and those shall use CamelCase names
- cloud provider specific error values must have names that comply with the format foo.example.com/CamelCase.
-
-
-
ServiceList
ServiceList holds a list of services.
-
apiVersion: v1
-
kind: ServiceList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]Service), required
List of services
Operations
get
read the specified Service
HTTP Request
GET /api/v1/namespaces/{namespace}/services/{name}
Parameters
-
name (in path): string, required
name of the Service
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Service): OK
401: Unauthorized
get
read status of the specified Service
HTTP Request
GET /api/v1/namespaces/{namespace}/services/{name}/status
Parameters
-
name (in path): string, required
name of the Service
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Service): OK
401: Unauthorized
list
list or watch objects of kind Service
HTTP Request
GET /api/v1/namespaces/{namespace}/services
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ServiceList): OK
401: Unauthorized
list
list or watch objects of kind Service
HTTP Request
GET /api/v1/services
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ServiceList): OK
401: Unauthorized
create
create a Service
HTTP Request
POST /api/v1/namespaces/{namespace}/services
Parameters
-
namespace (in path): string, required
-
body: Service, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Service): OK
201 (Service): Created
202 (Service): Accepted
401: Unauthorized
update
replace the specified Service
HTTP Request
PUT /api/v1/namespaces/{namespace}/services/{name}
Parameters
-
name (in path): string, required
name of the Service
-
namespace (in path): string, required
-
body: Service, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Service): OK
201 (Service): Created
401: Unauthorized
update
replace status of the specified Service
HTTP Request
PUT /api/v1/namespaces/{namespace}/services/{name}/status
Parameters
-
name (in path): string, required
name of the Service
-
namespace (in path): string, required
-
body: Service, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Service): OK
201 (Service): Created
401: Unauthorized
patch
partially update the specified Service
HTTP Request
PATCH /api/v1/namespaces/{namespace}/services/{name}
Parameters
-
name (in path): string, required
name of the Service
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Service): OK
201 (Service): Created
401: Unauthorized
patch
partially update status of the specified Service
HTTP Request
PATCH /api/v1/namespaces/{namespace}/services/{name}/status
Parameters
-
name (in path): string, required
name of the Service
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Service): OK
201 (Service): Created
401: Unauthorized
delete
delete a Service
HTTP Request
DELETE /api/v1/namespaces/{namespace}/services/{name}
Parameters
-
name (in path): string, required
name of the Service
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Service): OK
202 (Service): Accepted
401: Unauthorized
deletecollection
delete collection of Service
HTTP Request
DELETE /api/v1/namespaces/{namespace}/services
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.2.2 - Endpoints
apiVersion: v1
import "k8s.io/api/core/v1"
Endpoints
Endpoints is a collection of endpoints that implement the actual service. Example: Name: "mysvc", Subsets: [ { Addresses: [{"ip": "10.10.1.1"}, {"ip": "10.10.2.2"}], Ports: [{"name": "a", "port": 8675}, {"name": "b", "port": 309}] }, { Addresses: [{"ip": "10.10.3.3"}], Ports: [{"name": "a", "port": 93}, {"name": "b", "port": 76}] }, ]
-
apiVersion: v1
-
kind: Endpoints
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
subsets ([]EndpointSubset)
The set of all endpoints is the union of all subsets. Addresses are placed into subsets according to the IPs they share. A single address with multiple ports, some of which are ready and some of which are not (because they come from different containers) will result in the address being displayed in different subsets for the different ports. No address will appear in both Addresses and NotReadyAddresses in the same subset. Sets of addresses and ports that comprise a service.
EndpointSubset is a group of addresses with a common set of ports. The expanded set of endpoints is the Cartesian product of Addresses x Ports. For example, given: { Addresses: [{"ip": "10.10.1.1"}, {"ip": "10.10.2.2"}], Ports: [{"name": "a", "port": 8675}, {"name": "b", "port": 309}] } The resulting set of endpoints can be viewed as: a: [ 10.10.1.1:8675, 10.10.2.2:8675 ], b: [ 10.10.1.1:309, 10.10.2.2:309 ]
-
subsets.addresses ([]EndpointAddress)
IP addresses which offer the related ports that are marked as ready. These endpoints should be considered safe for load balancers and clients to utilize.
EndpointAddress is a tuple that describes single IP address.
-
subsets.addresses.ip (string), required
The IP of this endpoint. May not be loopback (127.0.0.0/8), link-local (169.254.0.0/16), or link-local multicast ((224.0.0.0/24). IPv6 is also accepted but not fully supported on all platforms. Also, certain kubernetes components, like kube-proxy, are not IPv6 ready.
-
subsets.addresses.hostname (string)
The Hostname of this endpoint
-
subsets.addresses.nodeName (string)
Optional: Node hosting this endpoint. This can be used to determine endpoints local to a node.
-
subsets.addresses.targetRef (ObjectReference)
Reference to object providing the endpoint.
-
-
subsets.notReadyAddresses ([]EndpointAddress)
IP addresses which offer the related ports but are not currently marked as ready because they have not yet finished starting, have recently failed a readiness check, or have recently failed a liveness check.
EndpointAddress is a tuple that describes single IP address.
-
subsets.notReadyAddresses.ip (string), required
The IP of this endpoint. May not be loopback (127.0.0.0/8), link-local (169.254.0.0/16), or link-local multicast ((224.0.0.0/24). IPv6 is also accepted but not fully supported on all platforms. Also, certain kubernetes components, like kube-proxy, are not IPv6 ready.
-
subsets.notReadyAddresses.hostname (string)
The Hostname of this endpoint
-
subsets.notReadyAddresses.nodeName (string)
Optional: Node hosting this endpoint. This can be used to determine endpoints local to a node.
-
subsets.notReadyAddresses.targetRef (ObjectReference)
Reference to object providing the endpoint.
-
-
subsets.ports ([]EndpointPort)
Port numbers available on the related IP addresses.
EndpointPort is a tuple that describes a single port.
-
subsets.ports.port (int32), required
The port number of the endpoint.
-
subsets.ports.protocol (string)
The IP protocol for this port. Must be UDP, TCP, or SCTP. Default is TCP.
Possible enum values:
"SCTP"
is the SCTP protocol."TCP"
is the TCP protocol."UDP"
is the UDP protocol.
-
subsets.ports.name (string)
The name of this port. This must match the 'name' field in the corresponding ServicePort. Must be a DNS_LABEL. Optional only if one port is defined.
-
subsets.ports.appProtocol (string)
The application protocol for this port. This field follows standard Kubernetes label syntax. Un-prefixed names are reserved for IANA standard service names (as per RFC-6335 and http://www.iana.org/assignments/service-names). Non-standard protocols should use prefixed names such as mycompany.com/my-custom-protocol.
-
-
EndpointsList
EndpointsList is a list of endpoints.
-
apiVersion: v1
-
kind: EndpointsList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]Endpoints), required
List of endpoints.
Operations
get
read the specified Endpoints
HTTP Request
GET /api/v1/namespaces/{namespace}/endpoints/{name}
Parameters
-
name (in path): string, required
name of the Endpoints
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Endpoints): OK
401: Unauthorized
list
list or watch objects of kind Endpoints
HTTP Request
GET /api/v1/namespaces/{namespace}/endpoints
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (EndpointsList): OK
401: Unauthorized
list
list or watch objects of kind Endpoints
HTTP Request
GET /api/v1/endpoints
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (EndpointsList): OK
401: Unauthorized
create
create Endpoints
HTTP Request
POST /api/v1/namespaces/{namespace}/endpoints
Parameters
-
namespace (in path): string, required
-
body: Endpoints, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Endpoints): OK
201 (Endpoints): Created
202 (Endpoints): Accepted
401: Unauthorized
update
replace the specified Endpoints
HTTP Request
PUT /api/v1/namespaces/{namespace}/endpoints/{name}
Parameters
-
name (in path): string, required
name of the Endpoints
-
namespace (in path): string, required
-
body: Endpoints, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Endpoints): OK
201 (Endpoints): Created
401: Unauthorized
patch
partially update the specified Endpoints
HTTP Request
PATCH /api/v1/namespaces/{namespace}/endpoints/{name}
Parameters
-
name (in path): string, required
name of the Endpoints
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Endpoints): OK
201 (Endpoints): Created
401: Unauthorized
delete
delete Endpoints
HTTP Request
DELETE /api/v1/namespaces/{namespace}/endpoints/{name}
Parameters
-
name (in path): string, required
name of the Endpoints
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of Endpoints
HTTP Request
DELETE /api/v1/namespaces/{namespace}/endpoints
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.2.3 - EndpointSlice
apiVersion: discovery.k8s.io/v1
import "k8s.io/api/discovery/v1"
EndpointSlice
EndpointSlice represents a subset of the endpoints that implement a service. For a given service there may be multiple EndpointSlice objects, selected by labels, which must be joined to produce the full set of endpoints.
-
apiVersion: discovery.k8s.io/v1
-
kind: EndpointSlice
-
metadata (ObjectMeta)
Standard object's metadata.
-
addressType (string), required
addressType specifies the type of address carried by this EndpointSlice. All addresses in this slice must be the same type. This field is immutable after creation. The following address types are currently supported: * IPv4: Represents an IPv4 Address. * IPv6: Represents an IPv6 Address. * FQDN: Represents a Fully Qualified Domain Name.
Possible enum values:
"FQDN"
represents a FQDN."IPv4"
represents an IPv4 Address."IPv6"
represents an IPv6 Address.
-
endpoints ([]Endpoint), required
Atomic: will be replaced during a merge
endpoints is a list of unique endpoints in this slice. Each slice may include a maximum of 1000 endpoints.
Endpoint represents a single logical "backend" implementing a service.
-
endpoints.addresses ([]string), required
Set: unique values will be kept during a merge
addresses of this endpoint. The contents of this field are interpreted according to the corresponding EndpointSlice addressType field. Consumers must handle different types of addresses in the context of their own capabilities. This must contain at least one address but no more than 100.
-
endpoints.conditions (EndpointConditions)
conditions contains information about the current status of the endpoint.
EndpointConditions represents the current condition of an endpoint.
-
endpoints.conditions.ready (boolean)
ready indicates that this endpoint is prepared to receive traffic, according to whatever system is managing the endpoint. A nil value indicates an unknown state. In most cases consumers should interpret this unknown state as ready. For compatibility reasons, ready should never be "true" for terminating endpoints.
-
endpoints.conditions.serving (boolean)
serving is identical to ready except that it is set regardless of the terminating state of endpoints. This condition should be set to true for a ready endpoint that is terminating. If nil, consumers should defer to the ready condition. This field can be enabled with the EndpointSliceTerminatingCondition feature gate.
-
endpoints.conditions.terminating (boolean)
terminating indicates that this endpoint is terminating. A nil value indicates an unknown state. Consumers should interpret this unknown state to mean that the endpoint is not terminating. This field can be enabled with the EndpointSliceTerminatingCondition feature gate.
-
-
endpoints.deprecatedTopology (map[string]string)
deprecatedTopology contains topology information part of the v1beta1 API. This field is deprecated, and will be removed when the v1beta1 API is removed (no sooner than kubernetes v1.24). While this field can hold values, it is not writable through the v1 API, and any attempts to write to it will be silently ignored. Topology information can be found in the zone and nodeName fields instead.
-
endpoints.hints (EndpointHints)
hints contains information associated with how an endpoint should be consumed.
EndpointHints provides hints describing how an endpoint should be consumed.
-
endpoints.hints.forZones ([]ForZone)
Atomic: will be replaced during a merge
forZones indicates the zone(s) this endpoint should be consumed by to enable topology aware routing.
ForZone provides information about which zones should consume this endpoint.
-
endpoints.hints.forZones.name (string), required
name represents the name of the zone.
-
-
-
endpoints.hostname (string)
hostname of this endpoint. This field may be used by consumers of endpoints to distinguish endpoints from each other (e.g. in DNS names). Multiple endpoints which use the same hostname should be considered fungible (e.g. multiple A values in DNS). Must be lowercase and pass DNS Label (RFC 1123) validation.
-
endpoints.nodeName (string)
nodeName represents the name of the Node hosting this endpoint. This can be used to determine endpoints local to a Node. This field can be enabled with the EndpointSliceNodeName feature gate.
-
endpoints.targetRef (ObjectReference)
targetRef is a reference to a Kubernetes object that represents this endpoint.
-
endpoints.zone (string)
zone is the name of the Zone this endpoint exists in.
-
-
ports ([]EndpointPort)
Atomic: will be replaced during a merge
ports specifies the list of network ports exposed by each endpoint in this slice. Each port must have a unique name. When ports is empty, it indicates that there are no defined ports. When a port is defined with a nil port value, it indicates "all ports". Each slice may include a maximum of 100 ports.
EndpointPort represents a Port used by an EndpointSlice
-
ports.port (int32)
The port number of the endpoint. If this is not specified, ports are not restricted and must be interpreted in the context of the specific consumer.
-
ports.protocol (string)
The IP protocol for this port. Must be UDP, TCP, or SCTP. Default is TCP.
-
ports.name (string)
The name of this port. All ports in an EndpointSlice must have a unique name. If the EndpointSlice is dervied from a Kubernetes service, this corresponds to the Service.ports[].name. Name must either be an empty string or pass DNS_LABEL validation: * must be no more than 63 characters long. * must consist of lower case alphanumeric characters or '-'. * must start and end with an alphanumeric character. Default is empty string.
-
ports.appProtocol (string)
The application protocol for this port. This field follows standard Kubernetes label syntax. Un-prefixed names are reserved for IANA standard service names (as per RFC-6335 and http://www.iana.org/assignments/service-names). Non-standard protocols should use prefixed names such as mycompany.com/my-custom-protocol.
-
EndpointSliceList
EndpointSliceList represents a list of endpoint slices
-
apiVersion: discovery.k8s.io/v1
-
kind: EndpointSliceList
-
metadata (ListMeta)
Standard list metadata.
-
items ([]EndpointSlice), required
List of endpoint slices
Operations
get
read the specified EndpointSlice
HTTP Request
GET /apis/discovery.k8s.io/v1/namespaces/{namespace}/endpointslices/{name}
Parameters
-
name (in path): string, required
name of the EndpointSlice
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (EndpointSlice): OK
401: Unauthorized
list
list or watch objects of kind EndpointSlice
HTTP Request
GET /apis/discovery.k8s.io/v1/namespaces/{namespace}/endpointslices
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (EndpointSliceList): OK
401: Unauthorized
list
list or watch objects of kind EndpointSlice
HTTP Request
GET /apis/discovery.k8s.io/v1/endpointslices
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (EndpointSliceList): OK
401: Unauthorized
create
create an EndpointSlice
HTTP Request
POST /apis/discovery.k8s.io/v1/namespaces/{namespace}/endpointslices
Parameters
-
namespace (in path): string, required
-
body: EndpointSlice, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (EndpointSlice): OK
201 (EndpointSlice): Created
202 (EndpointSlice): Accepted
401: Unauthorized
update
replace the specified EndpointSlice
HTTP Request
PUT /apis/discovery.k8s.io/v1/namespaces/{namespace}/endpointslices/{name}
Parameters
-
name (in path): string, required
name of the EndpointSlice
-
namespace (in path): string, required
-
body: EndpointSlice, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (EndpointSlice): OK
201 (EndpointSlice): Created
401: Unauthorized
patch
partially update the specified EndpointSlice
HTTP Request
PATCH /apis/discovery.k8s.io/v1/namespaces/{namespace}/endpointslices/{name}
Parameters
-
name (in path): string, required
name of the EndpointSlice
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (EndpointSlice): OK
201 (EndpointSlice): Created
401: Unauthorized
delete
delete an EndpointSlice
HTTP Request
DELETE /apis/discovery.k8s.io/v1/namespaces/{namespace}/endpointslices/{name}
Parameters
-
name (in path): string, required
name of the EndpointSlice
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of EndpointSlice
HTTP Request
DELETE /apis/discovery.k8s.io/v1/namespaces/{namespace}/endpointslices
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.2.4 - Ingress
apiVersion: networking.k8s.io/v1
import "k8s.io/api/networking/v1"
Ingress
Ingress is a collection of rules that allow inbound connections to reach the endpoints defined by a backend. An Ingress can be configured to give services externally-reachable urls, load balance traffic, terminate SSL, offer name based virtual hosting etc.
-
apiVersion: networking.k8s.io/v1
-
kind: Ingress
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (IngressSpec)
Spec is the desired state of the Ingress. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (IngressStatus)
Status is the current state of the Ingress. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
IngressSpec
IngressSpec describes the Ingress the user wishes to exist.
-
defaultBackend (IngressBackend)
DefaultBackend is the backend that should handle requests that don't match any rule. If Rules are not specified, DefaultBackend must be specified. If DefaultBackend is not set, the handling of requests that do not match any of the rules will be up to the Ingress controller.
-
ingressClassName (string)
IngressClassName is the name of the IngressClass cluster resource. The associated IngressClass defines which controller will implement the resource. This replaces the deprecated
kubernetes.io/ingress.class
annotation. For backwards compatibility, when that annotation is set, it must be given precedence over this field. The controller may emit a warning if the field and annotation have different values. Implementations of this API should ignore Ingresses without a class specified. An IngressClass resource may be marked as default, which can be used to set a default value for this field. For more information, refer to the IngressClass documentation. -
rules ([]IngressRule)
Atomic: will be replaced during a merge
A list of host rules used to configure the Ingress. If unspecified, or no rule matches, all traffic is sent to the default backend.
IngressRule represents the rules mapping the paths under a specified host to the related backend services. Incoming requests are first evaluated for a host match, then routed to the backend associated with the matching IngressRuleValue.
-
rules.host (string)
Host is the fully qualified domain name of a network host, as defined by RFC 3986. Note the following deviations from the "host" part of the URI as defined in RFC 3986: 1. IPs are not allowed. Currently an IngressRuleValue can only apply to the IP in the Spec of the parent Ingress. 2. The
:
delimiter is not respected because ports are not allowed. Currently the port of an Ingress is implicitly :80 for http and :443 for https. Both these may change in the future. Incoming requests are matched against the host before the IngressRuleValue. If the host is unspecified, the Ingress routes all traffic based on the specified IngressRuleValue.Host can be "precise" which is a domain name without the terminating dot of a network host (e.g. "foo.bar.com") or "wildcard", which is a domain name prefixed with a single wildcard label (e.g. ".foo.com"). The wildcard character '' must appear by itself as the first DNS label and matches only a single label. You cannot have a wildcard label by itself (e.g. Host == "*"). Requests will be matched against the Host field in the following way: 1. If Host is precise, the request matches this rule if the http host header is equal to Host. 2. If Host is a wildcard, then the request matches this rule if the http host header is to equal to the suffix (removing the first label) of the wildcard rule.
-
rules.http (HTTPIngressRuleValue)
HTTPIngressRuleValue is a list of http selectors pointing to backends. In the example: http://
/ ? -> backend where where parts of the url correspond to RFC 3986, this resource will be used to match against everything after the last '/' and before the first '?' or '#'. -
rules.http.paths ([]HTTPIngressPath), required
Atomic: will be replaced during a merge
A collection of paths that map requests to backends.
HTTPIngressPath associates a path with a backend. Incoming urls matching the path are forwarded to the backend.
-
rules.http.paths.backend (IngressBackend), required
Backend defines the referenced service endpoint to which the traffic will be forwarded to.
-
rules.http.paths.pathType (string), required
PathType determines the interpretation of the Path matching. PathType can be one of the following values: * Exact: Matches the URL path exactly. * Prefix: Matches based on a URL path prefix split by '/'. Matching is done on a path element by element basis. A path element refers is the list of labels in the path split by the '/' separator. A request is a match for path p if every p is an element-wise prefix of p of the request path. Note that if the last element of the path is a substring of the last element in request path, it is not a match (e.g. /foo/bar matches /foo/bar/baz, but does not match /foo/barbaz).
- ImplementationSpecific: Interpretation of the Path matching is up to the IngressClass. Implementations can treat this as a separate PathType or treat it identically to Prefix or Exact path types. Implementations are required to support all path types.
-
rules.http.paths.path (string)
Path is matched against the path of an incoming request. Currently it can contain characters disallowed from the conventional "path" part of a URL as defined by RFC 3986. Paths must begin with a '/' and must be present when using PathType with value "Exact" or "Prefix".
-
-
-
-
tls ([]IngressTLS)
Atomic: will be replaced during a merge
TLS configuration. Currently the Ingress only supports a single TLS port, 443. If multiple members of this list specify different hosts, they will be multiplexed on the same port according to the hostname specified through the SNI TLS extension, if the ingress controller fulfilling the ingress supports SNI.
IngressTLS describes the transport layer security associated with an Ingress.
-
tls.hosts ([]string)
Atomic: will be replaced during a merge
Hosts are a list of hosts included in the TLS certificate. The values in this list must match the name/s used in the tlsSecret. Defaults to the wildcard host setting for the loadbalancer controller fulfilling this Ingress, if left unspecified.
-
tls.secretName (string)
SecretName is the name of the secret used to terminate TLS traffic on port 443. Field is left optional to allow TLS routing based on SNI hostname alone. If the SNI host in a listener conflicts with the "Host" header field used by an IngressRule, the SNI host is used for termination and value of the Host header is used for routing.
-
IngressBackend
IngressBackend describes all endpoints for a given service and port.
-
resource (TypedLocalObjectReference)
Resource is an ObjectRef to another Kubernetes resource in the namespace of the Ingress object. If resource is specified, a service.Name and service.Port must not be specified. This is a mutually exclusive setting with "Service".
-
service (IngressServiceBackend)
Service references a Service as a Backend. This is a mutually exclusive setting with "Resource".
IngressServiceBackend references a Kubernetes Service as a Backend.
-
service.name (string), required
Name is the referenced service. The service must exist in the same namespace as the Ingress object.
-
service.port (ServiceBackendPort)
Port of the referenced service. A port name or port number is required for a IngressServiceBackend.
ServiceBackendPort is the service port being referenced.
-
service.port.name (string)
Name is the name of the port on the Service. This is a mutually exclusive setting with "Number".
-
service.port.number (int32)
Number is the numerical port number (e.g. 80) on the Service. This is a mutually exclusive setting with "Name".
-
-
IngressStatus
IngressStatus describe the current state of the Ingress.
-
loadBalancer (LoadBalancerStatus)
LoadBalancer contains the current status of the load-balancer.
LoadBalancerStatus represents the status of a load-balancer.
-
loadBalancer.ingress ([]LoadBalancerIngress)
Ingress is a list containing ingress points for the load-balancer. Traffic intended for the service should be sent to these ingress points.
LoadBalancerIngress represents the status of a load-balancer ingress point: traffic intended for the service should be sent to an ingress point.
-
loadBalancer.ingress.hostname (string)
Hostname is set for load-balancer ingress points that are DNS based (typically AWS load-balancers)
-
loadBalancer.ingress.ip (string)
IP is set for load-balancer ingress points that are IP based (typically GCE or OpenStack load-balancers)
-
loadBalancer.ingress.ports ([]PortStatus)
Atomic: will be replaced during a merge
Ports is a list of records of service ports If used, every port defined in the service should have an entry in it
-
loadBalancer.ingress.ports.port (int32), required
Port is the port number of the service port of which status is recorded here
-
loadBalancer.ingress.ports.protocol (string), required
Protocol is the protocol of the service port of which status is recorded here The supported values are: "TCP", "UDP", "SCTP"
Possible enum values:
"SCTP"
is the SCTP protocol."TCP"
is the TCP protocol."UDP"
is the UDP protocol.
-
loadBalancer.ingress.ports.error (string)
Error is to record the problem with the service port The format of the error shall comply with the following rules: - built-in error values shall be specified in this file and those shall use CamelCase names
- cloud provider specific error values must have names that comply with the format foo.example.com/CamelCase.
-
-
-
IngressList
IngressList is a collection of Ingress.
-
items ([]Ingress), required
Items is the list of Ingress.
-
apiVersion (string)
APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
-
kind (string)
Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
metadata (ListMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
Operations
get
read the specified Ingress
HTTP Request
GET /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses/{name}
Parameters
-
name (in path): string, required
name of the Ingress
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Ingress): OK
401: Unauthorized
get
read status of the specified Ingress
HTTP Request
GET /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses/{name}/status
Parameters
-
name (in path): string, required
name of the Ingress
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Ingress): OK
401: Unauthorized
list
list or watch objects of kind Ingress
HTTP Request
GET /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (IngressList): OK
401: Unauthorized
list
list or watch objects of kind Ingress
HTTP Request
GET /apis/networking.k8s.io/v1/ingresses
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (IngressList): OK
401: Unauthorized
create
create an Ingress
HTTP Request
POST /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses
Parameters
-
namespace (in path): string, required
-
body: Ingress, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Ingress): OK
201 (Ingress): Created
202 (Ingress): Accepted
401: Unauthorized
update
replace the specified Ingress
HTTP Request
PUT /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses/{name}
Parameters
-
name (in path): string, required
name of the Ingress
-
namespace (in path): string, required
-
body: Ingress, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Ingress): OK
201 (Ingress): Created
401: Unauthorized
update
replace status of the specified Ingress
HTTP Request
PUT /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses/{name}/status
Parameters
-
name (in path): string, required
name of the Ingress
-
namespace (in path): string, required
-
body: Ingress, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Ingress): OK
201 (Ingress): Created
401: Unauthorized
patch
partially update the specified Ingress
HTTP Request
PATCH /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses/{name}
Parameters
-
name (in path): string, required
name of the Ingress
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Ingress): OK
201 (Ingress): Created
401: Unauthorized
patch
partially update status of the specified Ingress
HTTP Request
PATCH /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses/{name}/status
Parameters
-
name (in path): string, required
name of the Ingress
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Ingress): OK
201 (Ingress): Created
401: Unauthorized
delete
delete an Ingress
HTTP Request
DELETE /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses/{name}
Parameters
-
name (in path): string, required
name of the Ingress
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of Ingress
HTTP Request
DELETE /apis/networking.k8s.io/v1/namespaces/{namespace}/ingresses
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.2.5 - IngressClass
apiVersion: networking.k8s.io/v1
import "k8s.io/api/networking/v1"
IngressClass
IngressClass represents the class of the Ingress, referenced by the Ingress Spec. The ingressclass.kubernetes.io/is-default-class
annotation can be used to indicate that an IngressClass should be considered default. When a single IngressClass resource has this annotation set to true, new Ingress resources without a class specified will be assigned this default class.
-
apiVersion: networking.k8s.io/v1
-
kind: IngressClass
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (IngressClassSpec)
Spec is the desired state of the IngressClass. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
IngressClassSpec
IngressClassSpec provides information about the class of an Ingress.
-
controller (string)
Controller refers to the name of the controller that should handle this class. This allows for different "flavors" that are controlled by the same controller. For example, you may have different Parameters for the same implementing controller. This should be specified as a domain-prefixed path no more than 250 characters in length, e.g. "acme.io/ingress-controller". This field is immutable.
-
parameters (IngressClassParametersReference)
Parameters is a link to a custom resource containing additional configuration for the controller. This is optional if the controller does not require extra parameters.
IngressClassParametersReference identifies an API object. This can be used to specify a cluster or namespace-scoped resource.
-
parameters.kind (string), required
Kind is the type of resource being referenced.
-
parameters.name (string), required
Name is the name of resource being referenced.
-
parameters.apiGroup (string)
APIGroup is the group for the resource being referenced. If APIGroup is not specified, the specified Kind must be in the core API group. For any other third-party types, APIGroup is required.
-
parameters.namespace (string)
Namespace is the namespace of the resource being referenced. This field is required when scope is set to "Namespace" and must be unset when scope is set to "Cluster".
-
parameters.scope (string)
Scope represents if this refers to a cluster or namespace scoped resource. This may be set to "Cluster" (default) or "Namespace".
-
IngressClassList
IngressClassList is a collection of IngressClasses.
-
apiVersion: networking.k8s.io/v1
-
kind: IngressClassList
-
metadata (ListMeta)
Standard list metadata.
-
items ([]IngressClass), required
Items is the list of IngressClasses.
Operations
get
read the specified IngressClass
HTTP Request
GET /apis/networking.k8s.io/v1/ingressclasses/{name}
Parameters
-
name (in path): string, required
name of the IngressClass
-
pretty (in query): string
Response
200 (IngressClass): OK
401: Unauthorized
list
list or watch objects of kind IngressClass
HTTP Request
GET /apis/networking.k8s.io/v1/ingressclasses
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (IngressClassList): OK
401: Unauthorized
create
create an IngressClass
HTTP Request
POST /apis/networking.k8s.io/v1/ingressclasses
Parameters
-
body: IngressClass, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (IngressClass): OK
201 (IngressClass): Created
202 (IngressClass): Accepted
401: Unauthorized
update
replace the specified IngressClass
HTTP Request
PUT /apis/networking.k8s.io/v1/ingressclasses/{name}
Parameters
-
name (in path): string, required
name of the IngressClass
-
body: IngressClass, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (IngressClass): OK
201 (IngressClass): Created
401: Unauthorized
patch
partially update the specified IngressClass
HTTP Request
PATCH /apis/networking.k8s.io/v1/ingressclasses/{name}
Parameters
-
name (in path): string, required
name of the IngressClass
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (IngressClass): OK
201 (IngressClass): Created
401: Unauthorized
delete
delete an IngressClass
HTTP Request
DELETE /apis/networking.k8s.io/v1/ingressclasses/{name}
Parameters
-
name (in path): string, required
name of the IngressClass
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of IngressClass
HTTP Request
DELETE /apis/networking.k8s.io/v1/ingressclasses
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.3 - Config and Storage Resources
6.5.3.1 - ConfigMap
apiVersion: v1
import "k8s.io/api/core/v1"
ConfigMap
ConfigMap holds configuration data for pods to consume.
-
apiVersion: v1
-
kind: ConfigMap
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
binaryData (map[string][]byte)
BinaryData contains the binary data. Each key must consist of alphanumeric characters, '-', '_' or '.'. BinaryData can contain byte sequences that are not in the UTF-8 range. The keys stored in BinaryData must not overlap with the ones in the Data field, this is enforced during validation process. Using this field will require 1.10+ apiserver and kubelet.
-
data (map[string]string)
Data contains the configuration data. Each key must consist of alphanumeric characters, '-', '_' or '.'. Values with non-UTF-8 byte sequences must use the BinaryData field. The keys stored in Data must not overlap with the keys in the BinaryData field, this is enforced during validation process.
-
immutable (boolean)
Immutable, if set to true, ensures that data stored in the ConfigMap cannot be updated (only object metadata can be modified). If not set to true, the field can be modified at any time. Defaulted to nil.
ConfigMapList
ConfigMapList is a resource containing a list of ConfigMap objects.
-
apiVersion: v1
-
kind: ConfigMapList
-
metadata (ListMeta)
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]ConfigMap), required
Items is the list of ConfigMaps.
Operations
get
read the specified ConfigMap
HTTP Request
GET /api/v1/namespaces/{namespace}/configmaps/{name}
Parameters
-
name (in path): string, required
name of the ConfigMap
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (ConfigMap): OK
401: Unauthorized
list
list or watch objects of kind ConfigMap
HTTP Request
GET /api/v1/namespaces/{namespace}/configmaps
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ConfigMapList): OK
401: Unauthorized
list
list or watch objects of kind ConfigMap
HTTP Request
GET /api/v1/configmaps
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ConfigMapList): OK
401: Unauthorized
create
create a ConfigMap
HTTP Request
POST /api/v1/namespaces/{namespace}/configmaps
Parameters
-
namespace (in path): string, required
-
body: ConfigMap, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ConfigMap): OK
201 (ConfigMap): Created
202 (ConfigMap): Accepted
401: Unauthorized
update
replace the specified ConfigMap
HTTP Request
PUT /api/v1/namespaces/{namespace}/configmaps/{name}
Parameters
-
name (in path): string, required
name of the ConfigMap
-
namespace (in path): string, required
-
body: ConfigMap, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ConfigMap): OK
201 (ConfigMap): Created
401: Unauthorized
patch
partially update the specified ConfigMap
HTTP Request
PATCH /api/v1/namespaces/{namespace}/configmaps/{name}
Parameters
-
name (in path): string, required
name of the ConfigMap
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ConfigMap): OK
201 (ConfigMap): Created
401: Unauthorized
delete
delete a ConfigMap
HTTP Request
DELETE /api/v1/namespaces/{namespace}/configmaps/{name}
Parameters
-
name (in path): string, required
name of the ConfigMap
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of ConfigMap
HTTP Request
DELETE /api/v1/namespaces/{namespace}/configmaps
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.3.2 - Secret
apiVersion: v1
import "k8s.io/api/core/v1"
Secret
Secret holds secret data of a certain type. The total bytes of the values in the Data field must be less than MaxSecretSize bytes.
-
apiVersion: v1
-
kind: Secret
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
data (map[string][]byte)
Data contains the secret data. Each key must consist of alphanumeric characters, '-', '_' or '.'. The serialized form of the secret data is a base64 encoded string, representing the arbitrary (possibly non-string) data value here. Described in https://tools.ietf.org/html/rfc4648#section-4
-
immutable (boolean)
Immutable, if set to true, ensures that data stored in the Secret cannot be updated (only object metadata can be modified). If not set to true, the field can be modified at any time. Defaulted to nil.
-
stringData (map[string]string)
stringData allows specifying non-binary secret data in string form. It is provided as a write-only input field for convenience. All keys and values are merged into the data field on write, overwriting any existing values. The stringData field is never output when reading from the API.
-
type (string)
Used to facilitate programmatic handling of secret data. More info: https://kubernetes.io/docs/concepts/configuration/secret/#secret-types
SecretList
SecretList is a list of Secret.
-
apiVersion: v1
-
kind: SecretList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]Secret), required
Items is a list of secret objects. More info: https://kubernetes.io/docs/concepts/configuration/secret
Operations
get
read the specified Secret
HTTP Request
GET /api/v1/namespaces/{namespace}/secrets/{name}
Parameters
-
name (in path): string, required
name of the Secret
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Secret): OK
401: Unauthorized
list
list or watch objects of kind Secret
HTTP Request
GET /api/v1/namespaces/{namespace}/secrets
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (SecretList): OK
401: Unauthorized
list
list or watch objects of kind Secret
HTTP Request
GET /api/v1/secrets
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (SecretList): OK
401: Unauthorized
create
create a Secret
HTTP Request
POST /api/v1/namespaces/{namespace}/secrets
Parameters
-
namespace (in path): string, required
-
body: Secret, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Secret): OK
201 (Secret): Created
202 (Secret): Accepted
401: Unauthorized
update
replace the specified Secret
HTTP Request
PUT /api/v1/namespaces/{namespace}/secrets/{name}
Parameters
-
name (in path): string, required
name of the Secret
-
namespace (in path): string, required
-
body: Secret, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Secret): OK
201 (Secret): Created
401: Unauthorized
patch
partially update the specified Secret
HTTP Request
PATCH /api/v1/namespaces/{namespace}/secrets/{name}
Parameters
-
name (in path): string, required
name of the Secret
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Secret): OK
201 (Secret): Created
401: Unauthorized
delete
delete a Secret
HTTP Request
DELETE /api/v1/namespaces/{namespace}/secrets/{name}
Parameters
-
name (in path): string, required
name of the Secret
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of Secret
HTTP Request
DELETE /api/v1/namespaces/{namespace}/secrets
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.3.3 - Volume
import "k8s.io/api/core/v1"
Volume
Volume represents a named volume in a pod that may be accessed by any container in the pod.
-
name (string), required
Volume's name. Must be a DNS_LABEL and unique within the pod. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
Exposed Persistent volumes
-
persistentVolumeClaim (PersistentVolumeClaimVolumeSource)
PersistentVolumeClaimVolumeSource represents a reference to a PersistentVolumeClaim in the same namespace. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistentvolumeclaims
PersistentVolumeClaimVolumeSource references the user's PVC in the same namespace. This volume finds the bound PV and mounts that volume for the pod. A PersistentVolumeClaimVolumeSource is, essentially, a wrapper around another type of volume that is owned by someone else (the system).
-
persistentVolumeClaim.claimName (string), required
ClaimName is the name of a PersistentVolumeClaim in the same namespace as the pod using this volume. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistentvolumeclaims
-
persistentVolumeClaim.readOnly (boolean)
Will force the ReadOnly setting in VolumeMounts. Default false.
-
Projections
-
configMap (ConfigMapVolumeSource)
ConfigMap represents a configMap that should populate this volume
*Adapts a ConfigMap into a volume.
The contents of the target ConfigMap's Data field will be presented in a volume as files using the keys in the Data field as the file names, unless the items element is populated with specific mappings of keys to paths. ConfigMap volumes support ownership management and SELinux relabeling.*
-
configMap.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
configMap.optional (boolean)
Specify whether the ConfigMap or its keys must be defined
-
configMap.defaultMode (int32)
Optional: mode bits used to set permissions on created files by default. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. Defaults to 0644. Directories within the path are not affected by this setting. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.
-
configMap.items ([]KeyToPath)
If unspecified, each key-value pair in the Data field of the referenced ConfigMap will be projected into the volume as a file whose name is the key and content is the value. If specified, the listed keys will be projected into the specified paths, and unlisted keys will not be present. If a key is specified which is not present in the ConfigMap, the volume setup will error unless it is marked optional. Paths must be relative and may not contain the '..' path or start with '..'.
-
-
secret (SecretVolumeSource)
Secret represents a secret that should populate this volume. More info: https://kubernetes.io/docs/concepts/storage/volumes#secret
*Adapts a Secret into a volume.
The contents of the target Secret's Data field will be presented in a volume as files using the keys in the Data field as the file names. Secret volumes support ownership management and SELinux relabeling.*
-
secret.secretName (string)
Name of the secret in the pod's namespace to use. More info: https://kubernetes.io/docs/concepts/storage/volumes#secret
-
secret.optional (boolean)
Specify whether the Secret or its keys must be defined
-
secret.defaultMode (int32)
Optional: mode bits used to set permissions on created files by default. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. Defaults to 0644. Directories within the path are not affected by this setting. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.
-
secret.items ([]KeyToPath)
If unspecified, each key-value pair in the Data field of the referenced Secret will be projected into the volume as a file whose name is the key and content is the value. If specified, the listed keys will be projected into the specified paths, and unlisted keys will not be present. If a key is specified which is not present in the Secret, the volume setup will error unless it is marked optional. Paths must be relative and may not contain the '..' path or start with '..'.
-
-
downwardAPI (DownwardAPIVolumeSource)
DownwardAPI represents downward API about the pod that should populate this volume
DownwardAPIVolumeSource represents a volume containing downward API info. Downward API volumes support ownership management and SELinux relabeling.
-
downwardAPI.defaultMode (int32)
Optional: mode bits to use on created files by default. Must be a Optional: mode bits used to set permissions on created files by default. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. Defaults to 0644. Directories within the path are not affected by this setting. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.
-
downwardAPI.items ([]DownwardAPIVolumeFile)
Items is a list of downward API volume file
-
-
projected (ProjectedVolumeSource)
Items for all in one resources secrets, configmaps, and downward API
Represents a projected volume source
-
projected.defaultMode (int32)
Mode bits used to set permissions on created files by default. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. Directories within the path are not affected by this setting. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.
-
projected.sources ([]VolumeProjection)
list of volume projections
Projection that may be projected along with other supported volume types
-
projected.sources.configMap (ConfigMapProjection)
information about the configMap data to project
*Adapts a ConfigMap into a projected volume.
The contents of the target ConfigMap's Data field will be presented in a projected volume as files using the keys in the Data field as the file names, unless the items element is populated with specific mappings of keys to paths. Note that this is identical to a configmap volume source without the default mode.*
-
projected.sources.configMap.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
projected.sources.configMap.optional (boolean)
Specify whether the ConfigMap or its keys must be defined
-
projected.sources.configMap.items ([]KeyToPath)
If unspecified, each key-value pair in the Data field of the referenced ConfigMap will be projected into the volume as a file whose name is the key and content is the value. If specified, the listed keys will be projected into the specified paths, and unlisted keys will not be present. If a key is specified which is not present in the ConfigMap, the volume setup will error unless it is marked optional. Paths must be relative and may not contain the '..' path or start with '..'.
-
-
projected.sources.downwardAPI (DownwardAPIProjection)
information about the downwardAPI data to project
Represents downward API info for projecting into a projected volume. Note that this is identical to a downwardAPI volume source without the default mode.
-
projected.sources.downwardAPI.items ([]DownwardAPIVolumeFile)
Items is a list of DownwardAPIVolume file
-
-
projected.sources.secret (SecretProjection)
information about the secret data to project
*Adapts a secret into a projected volume.
The contents of the target Secret's Data field will be presented in a projected volume as files using the keys in the Data field as the file names. Note that this is identical to a secret volume source without the default mode.*
-
projected.sources.secret.name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
projected.sources.secret.optional (boolean)
Specify whether the Secret or its key must be defined
-
projected.sources.secret.items ([]KeyToPath)
If unspecified, each key-value pair in the Data field of the referenced Secret will be projected into the volume as a file whose name is the key and content is the value. If specified, the listed keys will be projected into the specified paths, and unlisted keys will not be present. If a key is specified which is not present in the Secret, the volume setup will error unless it is marked optional. Paths must be relative and may not contain the '..' path or start with '..'.
-
-
projected.sources.serviceAccountToken (ServiceAccountTokenProjection)
information about the serviceAccountToken data to project
ServiceAccountTokenProjection represents a projected service account token volume. This projection can be used to insert a service account token into the pods runtime filesystem for use against APIs (Kubernetes API Server or otherwise).
-
projected.sources.serviceAccountToken.path (string), required
Path is the path relative to the mount point of the file to project the token into.
-
projected.sources.serviceAccountToken.audience (string)
Audience is the intended audience of the token. A recipient of a token must identify itself with an identifier specified in the audience of the token, and otherwise should reject the token. The audience defaults to the identifier of the apiserver.
-
projected.sources.serviceAccountToken.expirationSeconds (int64)
ExpirationSeconds is the requested duration of validity of the service account token. As the token approaches expiration, the kubelet volume plugin will proactively rotate the service account token. The kubelet will start trying to rotate the token if the token is older than 80 percent of its time to live or if the token is older than 24 hours.Defaults to 1 hour and must be at least 10 minutes.
-
-
-
Local / Temporary Directory
-
emptyDir (EmptyDirVolumeSource)
EmptyDir represents a temporary directory that shares a pod's lifetime. More info: https://kubernetes.io/docs/concepts/storage/volumes#emptydir
Represents an empty directory for a pod. Empty directory volumes support ownership management and SELinux relabeling.
-
emptyDir.medium (string)
What type of storage medium should back this directory. The default is "" which means to use the node's default medium. Must be an empty string (default) or Memory. More info: https://kubernetes.io/docs/concepts/storage/volumes#emptydir
-
emptyDir.sizeLimit (Quantity)
Total amount of local storage required for this EmptyDir volume. The size limit is also applicable for memory medium. The maximum usage on memory medium EmptyDir would be the minimum value between the SizeLimit specified here and the sum of memory limits of all containers in a pod. The default is nil which means that the limit is undefined. More info: http://kubernetes.io/docs/user-guide/volumes#emptydir
-
-
hostPath (HostPathVolumeSource)
HostPath represents a pre-existing file or directory on the host machine that is directly exposed to the container. This is generally used for system agents or other privileged things that are allowed to see the host machine. Most containers will NOT need this. More info: https://kubernetes.io/docs/concepts/storage/volumes#hostpath
Represents a host path mapped into a pod. Host path volumes do not support ownership management or SELinux relabeling.
-
hostPath.path (string), required
Path of the directory on the host. If the path is a symlink, it will follow the link to the real path. More info: https://kubernetes.io/docs/concepts/storage/volumes#hostpath
-
hostPath.type (string)
Type for HostPath Volume Defaults to "" More info: https://kubernetes.io/docs/concepts/storage/volumes#hostpath
-
Persistent volumes
-
awsElasticBlockStore (AWSElasticBlockStoreVolumeSource)
AWSElasticBlockStore represents an AWS Disk resource that is attached to a kubelet's host machine and then exposed to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore
*Represents a Persistent Disk resource in AWS.
An AWS EBS disk must exist before mounting to a container. The disk must also be in the same AWS zone as the kubelet. An AWS EBS disk can only be mounted as read/write once. AWS EBS volumes support ownership management and SELinux relabeling.*
-
awsElasticBlockStore.volumeID (string), required
Unique ID of the persistent disk resource in AWS (Amazon EBS volume). More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore
-
awsElasticBlockStore.fsType (string)
Filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore
-
awsElasticBlockStore.partition (int32)
The partition in the volume that you want to mount. If omitted, the default is to mount by volume name. Examples: For volume /dev/sda1, you specify the partition as "1". Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
-
awsElasticBlockStore.readOnly (boolean)
Specify "true" to force and set the ReadOnly property in VolumeMounts to "true". If omitted, the default is "false". More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore
-
-
azureDisk (AzureDiskVolumeSource)
AzureDisk represents an Azure Data Disk mount on the host and bind mount to the pod.
AzureDisk represents an Azure Data Disk mount on the host and bind mount to the pod.
-
azureDisk.diskName (string), required
The Name of the data disk in the blob storage
-
azureDisk.diskURI (string), required
The URI the data disk in the blob storage
-
azureDisk.cachingMode (string)
Host Caching mode: None, Read Only, Read Write.
-
azureDisk.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
azureDisk.kind (string)
Expected values Shared: multiple blob disks per storage account Dedicated: single blob disk per storage account Managed: azure managed data disk (only in managed availability set). defaults to shared
-
azureDisk.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
-
azureFile (AzureFileVolumeSource)
AzureFile represents an Azure File Service mount on the host and bind mount to the pod.
AzureFile represents an Azure File Service mount on the host and bind mount to the pod.
-
azureFile.secretName (string), required
the name of secret that contains Azure Storage Account Name and Key
-
azureFile.shareName (string), required
Share Name
-
azureFile.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
-
cephfs (CephFSVolumeSource)
CephFS represents a Ceph FS mount on the host that shares a pod's lifetime
Represents a Ceph Filesystem mount that lasts the lifetime of a pod Cephfs volumes do not support ownership management or SELinux relabeling.
-
cephfs.monitors ([]string), required
Required: Monitors is a collection of Ceph monitors More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
-
cephfs.path (string)
Optional: Used as the mounted root, rather than the full Ceph tree, default is /
-
cephfs.readOnly (boolean)
Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts. More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
-
cephfs.secretFile (string)
Optional: SecretFile is the path to key ring for User, default is /etc/ceph/user.secret More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
-
cephfs.secretRef (LocalObjectReference)
Optional: SecretRef is reference to the authentication secret for User, default is empty. More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
-
cephfs.user (string)
Optional: User is the rados user name, default is admin More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
-
-
cinder (CinderVolumeSource)
Cinder represents a cinder volume attached and mounted on kubelets host machine. More info: https://examples.k8s.io/mysql-cinder-pd/README.md
Represents a cinder volume resource in Openstack. A Cinder volume must exist before mounting to a container. The volume must also be in the same region as the kubelet. Cinder volumes support ownership management and SELinux relabeling.
-
cinder.volumeID (string), required
volume id used to identify the volume in cinder. More info: https://examples.k8s.io/mysql-cinder-pd/README.md
-
cinder.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://examples.k8s.io/mysql-cinder-pd/README.md
-
cinder.readOnly (boolean)
Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts. More info: https://examples.k8s.io/mysql-cinder-pd/README.md
-
cinder.secretRef (LocalObjectReference)
Optional: points to a secret object containing parameters used to connect to OpenStack.
-
-
csi (CSIVolumeSource)
CSI (Container Storage Interface) represents ephemeral storage that is handled by certain external CSI drivers (Beta feature).
Represents a source location of a volume to mount, managed by an external CSI driver
-
csi.driver (string), required
Driver is the name of the CSI driver that handles this volume. Consult with your admin for the correct name as registered in the cluster.
-
csi.fsType (string)
Filesystem type to mount. Ex. "ext4", "xfs", "ntfs". If not provided, the empty value is passed to the associated CSI driver which will determine the default filesystem to apply.
-
csi.nodePublishSecretRef (LocalObjectReference)
NodePublishSecretRef is a reference to the secret object containing sensitive information to pass to the CSI driver to complete the CSI NodePublishVolume and NodeUnpublishVolume calls. This field is optional, and may be empty if no secret is required. If the secret object contains more than one secret, all secret references are passed.
-
csi.readOnly (boolean)
Specifies a read-only configuration for the volume. Defaults to false (read/write).
-
csi.volumeAttributes (map[string]string)
VolumeAttributes stores driver-specific properties that are passed to the CSI driver. Consult your driver's documentation for supported values.
-
-
fc (FCVolumeSource)
FC represents a Fibre Channel resource that is attached to a kubelet's host machine and then exposed to the pod.
Represents a Fibre Channel volume. Fibre Channel volumes can only be mounted as read/write once. Fibre Channel volumes support ownership management and SELinux relabeling.
-
fc.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
fc.lun (int32)
Optional: FC target lun number
-
fc.readOnly (boolean)
Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
fc.targetWWNs ([]string)
Optional: FC target worldwide names (WWNs)
-
fc.wwids ([]string)
Optional: FC volume world wide identifiers (wwids) Either wwids or combination of targetWWNs and lun must be set, but not both simultaneously.
-
-
flexVolume (FlexVolumeSource)
FlexVolume represents a generic volume resource that is provisioned/attached using an exec based plugin.
FlexVolume represents a generic volume resource that is provisioned/attached using an exec based plugin.
-
flexVolume.driver (string), required
Driver is the name of the driver to use for this volume.
-
flexVolume.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". The default filesystem depends on FlexVolume script.
-
flexVolume.options (map[string]string)
Optional: Extra command options if any.
-
flexVolume.readOnly (boolean)
Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
flexVolume.secretRef (LocalObjectReference)
Optional: SecretRef is reference to the secret object containing sensitive information to pass to the plugin scripts. This may be empty if no secret object is specified. If the secret object contains more than one secret, all secrets are passed to the plugin scripts.
-
-
flocker (FlockerVolumeSource)
Flocker represents a Flocker volume attached to a kubelet's host machine. This depends on the Flocker control service being running
Represents a Flocker volume mounted by the Flocker agent. One and only one of datasetName and datasetUUID should be set. Flocker volumes do not support ownership management or SELinux relabeling.
-
flocker.datasetName (string)
Name of the dataset stored as metadata -> name on the dataset for Flocker should be considered as deprecated
-
flocker.datasetUUID (string)
UUID of the dataset. This is unique identifier of a Flocker dataset
-
-
gcePersistentDisk (GCEPersistentDiskVolumeSource)
GCEPersistentDisk represents a GCE Disk resource that is attached to a kubelet's host machine and then exposed to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
*Represents a Persistent Disk resource in Google Compute Engine.
A GCE PD must exist before mounting to a container. The disk must also be in the same GCE project and zone as the kubelet. A GCE PD can only be mounted as read/write once or read-only many times. GCE PDs support ownership management and SELinux relabeling.*
-
gcePersistentDisk.pdName (string), required
Unique name of the PD resource in GCE. Used to identify the disk in GCE. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
-
gcePersistentDisk.fsType (string)
Filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
-
gcePersistentDisk.partition (int32)
The partition in the volume that you want to mount. If omitted, the default is to mount by volume name. Examples: For volume /dev/sda1, you specify the partition as "1". Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty). More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
-
gcePersistentDisk.readOnly (boolean)
ReadOnly here will force the ReadOnly setting in VolumeMounts. Defaults to false. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
-
-
glusterfs (GlusterfsVolumeSource)
Glusterfs represents a Glusterfs mount on the host that shares a pod's lifetime. More info: https://examples.k8s.io/volumes/glusterfs/README.md
Represents a Glusterfs mount that lasts the lifetime of a pod. Glusterfs volumes do not support ownership management or SELinux relabeling.
-
glusterfs.endpoints (string), required
EndpointsName is the endpoint name that details Glusterfs topology. More info: https://examples.k8s.io/volumes/glusterfs/README.md#create-a-pod
-
glusterfs.path (string), required
Path is the Glusterfs volume path. More info: https://examples.k8s.io/volumes/glusterfs/README.md#create-a-pod
-
glusterfs.readOnly (boolean)
ReadOnly here will force the Glusterfs volume to be mounted with read-only permissions. Defaults to false. More info: https://examples.k8s.io/volumes/glusterfs/README.md#create-a-pod
-
-
iscsi (ISCSIVolumeSource)
ISCSI represents an ISCSI Disk resource that is attached to a kubelet's host machine and then exposed to the pod. More info: https://examples.k8s.io/volumes/iscsi/README.md
Represents an ISCSI disk. ISCSI volumes can only be mounted as read/write once. ISCSI volumes support ownership management and SELinux relabeling.
-
iscsi.iqn (string), required
Target iSCSI Qualified Name.
-
iscsi.lun (int32), required
iSCSI Target Lun number.
-
iscsi.targetPortal (string), required
iSCSI Target Portal. The Portal is either an IP or ip_addr:port if the port is other than default (typically TCP ports 860 and 3260).
-
iscsi.chapAuthDiscovery (boolean)
whether support iSCSI Discovery CHAP authentication
-
iscsi.chapAuthSession (boolean)
whether support iSCSI Session CHAP authentication
-
iscsi.fsType (string)
Filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#iscsi
-
iscsi.initiatorName (string)
Custom iSCSI Initiator Name. If initiatorName is specified with iscsiInterface simultaneously, new iSCSI interface <target portal>:<volume name> will be created for the connection.
-
iscsi.iscsiInterface (string)
iSCSI Interface Name that uses an iSCSI transport. Defaults to 'default' (tcp).
-
iscsi.portals ([]string)
iSCSI Target Portal List. The portal is either an IP or ip_addr:port if the port is other than default (typically TCP ports 860 and 3260).
-
iscsi.readOnly (boolean)
ReadOnly here will force the ReadOnly setting in VolumeMounts. Defaults to false.
-
iscsi.secretRef (LocalObjectReference)
CHAP Secret for iSCSI target and initiator authentication
-
-
nfs (NFSVolumeSource)
NFS represents an NFS mount on the host that shares a pod's lifetime More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs
Represents an NFS mount that lasts the lifetime of a pod. NFS volumes do not support ownership management or SELinux relabeling.
-
nfs.path (string), required
Path that is exported by the NFS server. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs
-
nfs.server (string), required
Server is the hostname or IP address of the NFS server. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs
-
nfs.readOnly (boolean)
ReadOnly here will force the NFS export to be mounted with read-only permissions. Defaults to false. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs
-
-
photonPersistentDisk (PhotonPersistentDiskVolumeSource)
PhotonPersistentDisk represents a PhotonController persistent disk attached and mounted on kubelets host machine
Represents a Photon Controller persistent disk resource.
-
photonPersistentDisk.pdID (string), required
ID that identifies Photon Controller persistent disk
-
photonPersistentDisk.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
-
portworxVolume (PortworxVolumeSource)
PortworxVolume represents a portworx volume attached and mounted on kubelets host machine
PortworxVolumeSource represents a Portworx volume resource.
-
portworxVolume.volumeID (string), required
VolumeID uniquely identifies a Portworx volume
-
portworxVolume.fsType (string)
FSType represents the filesystem type to mount Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs". Implicitly inferred to be "ext4" if unspecified.
-
portworxVolume.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
-
quobyte (QuobyteVolumeSource)
Quobyte represents a Quobyte mount on the host that shares a pod's lifetime
Represents a Quobyte mount that lasts the lifetime of a pod. Quobyte volumes do not support ownership management or SELinux relabeling.
-
quobyte.registry (string), required
Registry represents a single or multiple Quobyte Registry services specified as a string as host:port pair (multiple entries are separated with commas) which acts as the central registry for volumes
-
quobyte.volume (string), required
Volume is a string that references an already created Quobyte volume by name.
-
quobyte.group (string)
Group to map volume access to Default is no group
-
quobyte.readOnly (boolean)
ReadOnly here will force the Quobyte volume to be mounted with read-only permissions. Defaults to false.
-
quobyte.tenant (string)
Tenant owning the given Quobyte volume in the Backend Used with dynamically provisioned Quobyte volumes, value is set by the plugin
-
quobyte.user (string)
User to map volume access to Defaults to serivceaccount user
-
-
rbd (RBDVolumeSource)
RBD represents a Rados Block Device mount on the host that shares a pod's lifetime. More info: https://examples.k8s.io/volumes/rbd/README.md
Represents a Rados Block Device mount that lasts the lifetime of a pod. RBD volumes support ownership management and SELinux relabeling.
-
rbd.image (string), required
The rados image name. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.monitors ([]string), required
A collection of Ceph monitors. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.fsType (string)
Filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#rbd
-
rbd.keyring (string)
Keyring is the path to key ring for RBDUser. Default is /etc/ceph/keyring. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.pool (string)
The rados pool name. Default is rbd. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.readOnly (boolean)
ReadOnly here will force the ReadOnly setting in VolumeMounts. Defaults to false. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.secretRef (LocalObjectReference)
SecretRef is name of the authentication secret for RBDUser. If provided overrides keyring. Default is nil. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.user (string)
The rados user name. Default is admin. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
-
scaleIO (ScaleIOVolumeSource)
ScaleIO represents a ScaleIO persistent volume attached and mounted on Kubernetes nodes.
ScaleIOVolumeSource represents a persistent ScaleIO volume
-
scaleIO.gateway (string), required
The host address of the ScaleIO API Gateway.
-
scaleIO.secretRef (LocalObjectReference), required
SecretRef references to the secret for ScaleIO user and other sensitive information. If this is not provided, Login operation will fail.
-
scaleIO.system (string), required
The name of the storage system as configured in ScaleIO.
-
scaleIO.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Default is "xfs".
-
scaleIO.protectionDomain (string)
The name of the ScaleIO Protection Domain for the configured storage.
-
scaleIO.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
scaleIO.sslEnabled (boolean)
Flag to enable/disable SSL communication with Gateway, default false
-
scaleIO.storageMode (string)
Indicates whether the storage for a volume should be ThickProvisioned or ThinProvisioned. Default is ThinProvisioned.
-
scaleIO.storagePool (string)
The ScaleIO Storage Pool associated with the protection domain.
-
scaleIO.volumeName (string)
The name of a volume already created in the ScaleIO system that is associated with this volume source.
-
-
storageos (StorageOSVolumeSource)
StorageOS represents a StorageOS volume attached and mounted on Kubernetes nodes.
Represents a StorageOS persistent volume resource.
-
storageos.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
storageos.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
storageos.secretRef (LocalObjectReference)
SecretRef specifies the secret to use for obtaining the StorageOS API credentials. If not specified, default values will be attempted.
-
storageos.volumeName (string)
VolumeName is the human-readable name of the StorageOS volume. Volume names are only unique within a namespace.
-
storageos.volumeNamespace (string)
VolumeNamespace specifies the scope of the volume within StorageOS. If no namespace is specified then the Pod's namespace will be used. This allows the Kubernetes name scoping to be mirrored within StorageOS for tighter integration. Set VolumeName to any name to override the default behaviour. Set to "default" if you are not using namespaces within StorageOS. Namespaces that do not pre-exist within StorageOS will be created.
-
-
vsphereVolume (VsphereVirtualDiskVolumeSource)
VsphereVolume represents a vSphere volume attached and mounted on kubelets host machine
Represents a vSphere volume resource.
-
vsphereVolume.volumePath (string), required
Path that identifies vSphere volume vmdk
-
vsphereVolume.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
vsphereVolume.storagePolicyID (string)
Storage Policy Based Management (SPBM) profile ID associated with the StoragePolicyName.
-
vsphereVolume.storagePolicyName (string)
Storage Policy Based Management (SPBM) profile name.
-
Alpha level
-
ephemeral (EphemeralVolumeSource)
Ephemeral represents a volume that is handled by a cluster storage driver. The volume's lifecycle is tied to the pod that defines it - it will be created before the pod starts, and deleted when the pod is removed.
Use this if: a) the volume is only needed while the pod runs, b) features of normal volumes like restoring from snapshot or capacity tracking are needed, c) the storage driver is specified through a storage class, and d) the storage driver supports dynamic volume provisioning through a PersistentVolumeClaim (see EphemeralVolumeSource for more information on the connection between this volume type and PersistentVolumeClaim).
Use PersistentVolumeClaim or one of the vendor-specific APIs for volumes that persist for longer than the lifecycle of an individual pod.
Use CSI for light-weight local ephemeral volumes if the CSI driver is meant to be used that way - see the documentation of the driver for more information.
A pod can use both types of ephemeral volumes and persistent volumes at the same time.
Represents an ephemeral volume that is handled by a normal storage driver.
-
ephemeral.volumeClaimTemplate (PersistentVolumeClaimTemplate)
Will be used to create a stand-alone PVC to provision the volume. The pod in which this EphemeralVolumeSource is embedded will be the owner of the PVC, i.e. the PVC will be deleted together with the pod. The name of the PVC will be
\<pod name>-\<volume name>
where\<volume name>
is the name from thePodSpec.Volumes
array entry. Pod validation will reject the pod if the concatenated name is not valid for a PVC (for example, too long).An existing PVC with that name that is not owned by the pod will not be used for the pod to avoid using an unrelated volume by mistake. Starting the pod is then blocked until the unrelated PVC is removed. If such a pre-created PVC is meant to be used by the pod, the PVC has to updated with an owner reference to the pod once the pod exists. Normally this should not be necessary, but it may be useful when manually reconstructing a broken cluster.
This field is read-only and no changes will be made by Kubernetes to the PVC after it has been created.
Required, must not be nil.
PersistentVolumeClaimTemplate is used to produce PersistentVolumeClaim objects as part of an EphemeralVolumeSource.
-
ephemeral.volumeClaimTemplate.spec (PersistentVolumeClaimSpec), required
The specification for the PersistentVolumeClaim. The entire content is copied unchanged into the PVC that gets created from this template. The same fields as in a PersistentVolumeClaim are also valid here.
-
ephemeral.volumeClaimTemplate.metadata (ObjectMeta)
May contain labels and annotations that will be copied into the PVC when creating it. No other fields are allowed and will be rejected during validation.
-
-
Deprecated
-
gitRepo (GitRepoVolumeSource)
GitRepo represents a git repository at a particular revision. DEPRECATED: GitRepo is deprecated. To provision a container with a git repo, mount an EmptyDir into an InitContainer that clones the repo using git, then mount the EmptyDir into the Pod's container.
*Represents a volume that is populated with the contents of a git repository. Git repo volumes do not support ownership management. Git repo volumes support SELinux relabeling.
DEPRECATED: GitRepo is deprecated. To provision a container with a git repo, mount an EmptyDir into an InitContainer that clones the repo using git, then mount the EmptyDir into the Pod's container.*
-
gitRepo.repository (string), required
Repository URL
-
gitRepo.directory (string)
Target directory name. Must not contain or start with '..'. If '.' is supplied, the volume directory will be the git repository. Otherwise, if specified, the volume will contain the git repository in the subdirectory with the given name.
-
gitRepo.revision (string)
Commit hash for the specified revision.
-
DownwardAPIVolumeFile
DownwardAPIVolumeFile represents information to create the file containing the pod field
-
path (string), required
Required: Path is the relative path name of the file to be created. Must not be absolute or contain the '..' path. Must be utf-8 encoded. The first item of the relative path must not start with '..'
-
fieldRef (ObjectFieldSelector)
Required: Selects a field of the pod: only annotations, labels, name and namespace are supported.
-
mode (int32)
Optional: mode bits used to set permissions on this file, must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. If not specified, the volume defaultMode will be used. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.
-
resourceFieldRef (ResourceFieldSelector)
Selects a resource of the container: only resources limits and requests (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported.
KeyToPath
Maps a string key to a path within a volume.
-
key (string), required
The key to project.
-
path (string), required
The relative path of the file to map the key to. May not be an absolute path. May not contain the path element '..'. May not start with the string '..'.
-
mode (int32)
Optional: mode bits used to set permissions on this file. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. If not specified, the volume defaultMode will be used. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.
6.5.3.4 - PersistentVolumeClaim
apiVersion: v1
import "k8s.io/api/core/v1"
PersistentVolumeClaim
PersistentVolumeClaim is a user's request for and claim to a persistent volume
-
apiVersion: v1
-
kind: PersistentVolumeClaim
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (PersistentVolumeClaimSpec)
Spec defines the desired characteristics of a volume requested by a pod author. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistentvolumeclaims
-
status (PersistentVolumeClaimStatus)
Status represents the current information/status of a persistent volume claim. Read-only. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistentvolumeclaims
PersistentVolumeClaimSpec
PersistentVolumeClaimSpec describes the common attributes of storage devices and allows a Source for provider-specific attributes
-
accessModes ([]string)
AccessModes contains the desired access modes the volume should have. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#access-modes-1
-
selector (LabelSelector)
A label query over volumes to consider for binding.
-
resources (ResourceRequirements)
Resources represents the minimum resources the volume should have. If RecoverVolumeExpansionFailure feature is enabled users are allowed to specify resource requirements that are lower than previous value but must still be higher than capacity recorded in the status field of the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#resources
ResourceRequirements describes the compute resource requirements.
-
resources.limits (map[string]Quantity)
Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
-
resources.requests (map[string]Quantity)
Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
-
-
volumeName (string)
VolumeName is the binding reference to the PersistentVolume backing this claim.
-
storageClassName (string)
Name of the StorageClass required by the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#class-1
-
volumeMode (string)
volumeMode defines what type of volume is required by the claim. Value of Filesystem is implied when not included in claim spec.
Alpha level
-
dataSource (TypedLocalObjectReference)
This field can be used to specify either: * An existing VolumeSnapshot object (snapshot.storage.k8s.io/VolumeSnapshot) * An existing PVC (PersistentVolumeClaim) If the provisioner or an external controller can support the specified data source, it will create a new volume based on the contents of the specified data source. If the AnyVolumeDataSource feature gate is enabled, this field will always have the same contents as the DataSourceRef field.
-
dataSourceRef (TypedLocalObjectReference)
Specifies the object from which to populate the volume with data, if a non-empty volume is desired. This may be any local object from a non-empty API group (non core object) or a PersistentVolumeClaim object. When this field is specified, volume binding will only succeed if the type of the specified object matches some installed volume populator or dynamic provisioner. This field will replace the functionality of the DataSource field and as such if both fields are non-empty, they must have the same value. For backwards compatibility, both fields (DataSource and DataSourceRef) will be set to the same value automatically if one of them is empty and the other is non-empty. There are two important differences between DataSource and DataSourceRef: * While DataSource only allows two specific types of objects, DataSourceRef allows any non-core object, as well as PersistentVolumeClaim objects.
- While DataSource ignores disallowed values (dropping them), DataSourceRef preserves all values, and generates an error if a disallowed value is specified. (Alpha) Using this field requires the AnyVolumeDataSource feature gate to be enabled.
PersistentVolumeClaimStatus
PersistentVolumeClaimStatus is the current status of a persistent volume claim.
-
accessModes ([]string)
AccessModes contains the actual access modes the volume backing the PVC has. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#access-modes-1
-
allocatedResources (map[string]Quantity)
The storage resource within AllocatedResources tracks the capacity allocated to a PVC. It may be larger than the actual capacity when a volume expansion operation is requested. For storage quota, the larger value from allocatedResources and PVC.spec.resources is used. If allocatedResources is not set, PVC.spec.resources alone is used for quota calculation. If a volume expansion capacity request is lowered, allocatedResources is only lowered if there are no expansion operations in progress and if the actual volume capacity is equal or lower than the requested capacity. This is an alpha field and requires enabling RecoverVolumeExpansionFailure feature.
-
capacity (map[string]Quantity)
Represents the actual resources of the underlying volume.
-
conditions ([]PersistentVolumeClaimCondition)
Patch strategy: merge on key
type
Current Condition of persistent volume claim. If underlying persistent volume is being resized then the Condition will be set to 'ResizeStarted'.
PersistentVolumeClaimCondition contails details about state of pvc
-
conditions.status (string), required
-
conditions.type (string), required
Possible enum values:
"FileSystemResizePending"
- controller resize is finished and a file system resize is pending on node"Resizing"
- a user trigger resize of pvc has been started
-
conditions.lastProbeTime (Time)
Last time we probed the condition.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.lastTransitionTime (Time)
Last time the condition transitioned from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
Human-readable message indicating details about last transition.
-
conditions.reason (string)
Unique, this should be a short, machine understandable string that gives the reason for condition's last transition. If it reports "ResizeStarted" that means the underlying persistent volume is being resized.
-
-
phase (string)
Phase represents the current phase of PersistentVolumeClaim.
Possible enum values:
"Bound"
used for PersistentVolumeClaims that are bound"Lost"
used for PersistentVolumeClaims that lost their underlying PersistentVolume. The claim was bound to a PersistentVolume and this volume does not exist any longer and all data on it was lost."Pending"
used for PersistentVolumeClaims that are not yet bound
-
resizeStatus (string)
ResizeStatus stores status of resize operation. ResizeStatus is not set by default but when expansion is complete resizeStatus is set to empty string by resize controller or kubelet. This is an alpha field and requires enabling RecoverVolumeExpansionFailure feature.
PersistentVolumeClaimList
PersistentVolumeClaimList is a list of PersistentVolumeClaim items.
-
apiVersion: v1
-
kind: PersistentVolumeClaimList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]PersistentVolumeClaim), required
A list of persistent volume claims. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistentvolumeclaims
Operations
get
read the specified PersistentVolumeClaim
HTTP Request
GET /api/v1/namespaces/{namespace}/persistentvolumeclaims/{name}
Parameters
-
name (in path): string, required
name of the PersistentVolumeClaim
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (PersistentVolumeClaim): OK
401: Unauthorized
get
read status of the specified PersistentVolumeClaim
HTTP Request
GET /api/v1/namespaces/{namespace}/persistentvolumeclaims/{name}/status
Parameters
-
name (in path): string, required
name of the PersistentVolumeClaim
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (PersistentVolumeClaim): OK
401: Unauthorized
list
list or watch objects of kind PersistentVolumeClaim
HTTP Request
GET /api/v1/namespaces/{namespace}/persistentvolumeclaims
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PersistentVolumeClaimList): OK
401: Unauthorized
list
list or watch objects of kind PersistentVolumeClaim
HTTP Request
GET /api/v1/persistentvolumeclaims
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PersistentVolumeClaimList): OK
401: Unauthorized
create
create a PersistentVolumeClaim
HTTP Request
POST /api/v1/namespaces/{namespace}/persistentvolumeclaims
Parameters
-
namespace (in path): string, required
-
body: PersistentVolumeClaim, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PersistentVolumeClaim): OK
201 (PersistentVolumeClaim): Created
202 (PersistentVolumeClaim): Accepted
401: Unauthorized
update
replace the specified PersistentVolumeClaim
HTTP Request
PUT /api/v1/namespaces/{namespace}/persistentvolumeclaims/{name}
Parameters
-
name (in path): string, required
name of the PersistentVolumeClaim
-
namespace (in path): string, required
-
body: PersistentVolumeClaim, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PersistentVolumeClaim): OK
201 (PersistentVolumeClaim): Created
401: Unauthorized
update
replace status of the specified PersistentVolumeClaim
HTTP Request
PUT /api/v1/namespaces/{namespace}/persistentvolumeclaims/{name}/status
Parameters
-
name (in path): string, required
name of the PersistentVolumeClaim
-
namespace (in path): string, required
-
body: PersistentVolumeClaim, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PersistentVolumeClaim): OK
201 (PersistentVolumeClaim): Created
401: Unauthorized
patch
partially update the specified PersistentVolumeClaim
HTTP Request
PATCH /api/v1/namespaces/{namespace}/persistentvolumeclaims/{name}
Parameters
-
name (in path): string, required
name of the PersistentVolumeClaim
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PersistentVolumeClaim): OK
201 (PersistentVolumeClaim): Created
401: Unauthorized
patch
partially update status of the specified PersistentVolumeClaim
HTTP Request
PATCH /api/v1/namespaces/{namespace}/persistentvolumeclaims/{name}/status
Parameters
-
name (in path): string, required
name of the PersistentVolumeClaim
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PersistentVolumeClaim): OK
201 (PersistentVolumeClaim): Created
401: Unauthorized
delete
delete a PersistentVolumeClaim
HTTP Request
DELETE /api/v1/namespaces/{namespace}/persistentvolumeclaims/{name}
Parameters
-
name (in path): string, required
name of the PersistentVolumeClaim
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (PersistentVolumeClaim): OK
202 (PersistentVolumeClaim): Accepted
401: Unauthorized
deletecollection
delete collection of PersistentVolumeClaim
HTTP Request
DELETE /api/v1/namespaces/{namespace}/persistentvolumeclaims
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.3.5 - PersistentVolume
apiVersion: v1
import "k8s.io/api/core/v1"
PersistentVolume
PersistentVolume (PV) is a storage resource provisioned by an administrator. It is analogous to a node. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes
-
apiVersion: v1
-
kind: PersistentVolume
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (PersistentVolumeSpec)
Spec defines a specification of a persistent volume owned by the cluster. Provisioned by an administrator. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistent-volumes
-
status (PersistentVolumeStatus)
Status represents the current information/status for the persistent volume. Populated by the system. Read-only. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistent-volumes
PersistentVolumeSpec
PersistentVolumeSpec is the specification of a persistent volume.
-
accessModes ([]string)
AccessModes contains all ways the volume can be mounted. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#access-modes
-
capacity (map[string]Quantity)
A description of the persistent volume's resources and capacity. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#capacity
-
claimRef (ObjectReference)
ClaimRef is part of a bi-directional binding between PersistentVolume and PersistentVolumeClaim. Expected to be non-nil when bound. claim.VolumeName is the authoritative bind between PV and PVC. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#binding
-
mountOptions ([]string)
A list of mount options, e.g. ["ro", "soft"]. Not validated - mount will simply fail if one is invalid. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#mount-options
-
nodeAffinity (VolumeNodeAffinity)
NodeAffinity defines constraints that limit what nodes this volume can be accessed from. This field influences the scheduling of pods that use this volume.
VolumeNodeAffinity defines constraints that limit what nodes this volume can be accessed from.
-
nodeAffinity.required (NodeSelector)
Required specifies hard node constraints that must be met.
A node selector represents the union of the results of one or more label queries over a set of nodes; that is, it represents the OR of the selectors represented by the node selector terms.
-
nodeAffinity.required.nodeSelectorTerms ([]NodeSelectorTerm), required
Required. A list of node selector terms. The terms are ORed.
A null or empty node selector term matches no objects. The requirements of them are ANDed. The TopologySelectorTerm type implements a subset of the NodeSelectorTerm.
-
nodeAffinity.required.nodeSelectorTerms.matchExpressions ([]NodeSelectorRequirement)
A list of node selector requirements by node's labels.
-
nodeAffinity.required.nodeSelectorTerms.matchFields ([]NodeSelectorRequirement)
A list of node selector requirements by node's fields.
-
-
-
-
persistentVolumeReclaimPolicy (string)
What happens to a persistent volume when released from its claim. Valid options are Retain (default for manually created PersistentVolumes), Delete (default for dynamically provisioned PersistentVolumes), and Recycle (deprecated). Recycle must be supported by the volume plugin underlying this PersistentVolume. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#reclaiming
Possible enum values:
"Delete"
means the volume will be deleted from Kubernetes on release from its claim. The volume plugin must support Deletion."Recycle"
means the volume will be recycled back into the pool of unbound persistent volumes on release from its claim. The volume plugin must support Recycling."Retain"
means the volume will be left in its current phase (Released) for manual reclamation by the administrator. The default policy is Retain.
-
storageClassName (string)
Name of StorageClass to which this persistent volume belongs. Empty value means that this volume does not belong to any StorageClass.
-
volumeMode (string)
volumeMode defines if a volume is intended to be used with a formatted filesystem or to remain in raw block state. Value of Filesystem is implied when not included in spec.
Local
-
hostPath (HostPathVolumeSource)
HostPath represents a directory on the host. Provisioned by a developer or tester. This is useful for single-node development and testing only! On-host storage is not supported in any way and WILL NOT WORK in a multi-node cluster. More info: https://kubernetes.io/docs/concepts/storage/volumes#hostpath
Represents a host path mapped into a pod. Host path volumes do not support ownership management or SELinux relabeling.
-
hostPath.path (string), required
Path of the directory on the host. If the path is a symlink, it will follow the link to the real path. More info: https://kubernetes.io/docs/concepts/storage/volumes#hostpath
-
hostPath.type (string)
Type for HostPath Volume Defaults to "" More info: https://kubernetes.io/docs/concepts/storage/volumes#hostpath
-
-
local (LocalVolumeSource)
Local represents directly-attached storage with node affinity
Local represents directly-attached storage with node affinity (Beta feature)
-
local.path (string), required
The full path to the volume on the node. It can be either a directory or block device (disk, partition, ...).
-
local.fsType (string)
Filesystem type to mount. It applies only when the Path is a block device. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". The default value is to auto-select a filesystem if unspecified.
-
Persistent volumes
-
awsElasticBlockStore (AWSElasticBlockStoreVolumeSource)
AWSElasticBlockStore represents an AWS Disk resource that is attached to a kubelet's host machine and then exposed to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore
*Represents a Persistent Disk resource in AWS.
An AWS EBS disk must exist before mounting to a container. The disk must also be in the same AWS zone as the kubelet. An AWS EBS disk can only be mounted as read/write once. AWS EBS volumes support ownership management and SELinux relabeling.*
-
awsElasticBlockStore.volumeID (string), required
Unique ID of the persistent disk resource in AWS (Amazon EBS volume). More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore
-
awsElasticBlockStore.fsType (string)
Filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore
-
awsElasticBlockStore.partition (int32)
The partition in the volume that you want to mount. If omitted, the default is to mount by volume name. Examples: For volume /dev/sda1, you specify the partition as "1". Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
-
awsElasticBlockStore.readOnly (boolean)
Specify "true" to force and set the ReadOnly property in VolumeMounts to "true". If omitted, the default is "false". More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore
-
-
azureDisk (AzureDiskVolumeSource)
AzureDisk represents an Azure Data Disk mount on the host and bind mount to the pod.
AzureDisk represents an Azure Data Disk mount on the host and bind mount to the pod.
-
azureDisk.diskName (string), required
The Name of the data disk in the blob storage
-
azureDisk.diskURI (string), required
The URI the data disk in the blob storage
-
azureDisk.cachingMode (string)
Host Caching mode: None, Read Only, Read Write.
-
azureDisk.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
azureDisk.kind (string)
Expected values Shared: multiple blob disks per storage account Dedicated: single blob disk per storage account Managed: azure managed data disk (only in managed availability set). defaults to shared
-
azureDisk.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
-
azureFile (AzureFilePersistentVolumeSource)
AzureFile represents an Azure File Service mount on the host and bind mount to the pod.
AzureFile represents an Azure File Service mount on the host and bind mount to the pod.
-
azureFile.secretName (string), required
the name of secret that contains Azure Storage Account Name and Key
-
azureFile.shareName (string), required
Share Name
-
azureFile.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
azureFile.secretNamespace (string)
the namespace of the secret that contains Azure Storage Account Name and Key default is the same as the Pod
-
-
cephfs (CephFSPersistentVolumeSource)
CephFS represents a Ceph FS mount on the host that shares a pod's lifetime
Represents a Ceph Filesystem mount that lasts the lifetime of a pod Cephfs volumes do not support ownership management or SELinux relabeling.
-
cephfs.monitors ([]string), required
Required: Monitors is a collection of Ceph monitors More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
-
cephfs.path (string)
Optional: Used as the mounted root, rather than the full Ceph tree, default is /
-
cephfs.readOnly (boolean)
Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts. More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
-
cephfs.secretFile (string)
Optional: SecretFile is the path to key ring for User, default is /etc/ceph/user.secret More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
-
cephfs.secretRef (SecretReference)
Optional: SecretRef is reference to the authentication secret for User, default is empty. More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
cephfs.secretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
cephfs.secretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
cephfs.user (string)
Optional: User is the rados user name, default is admin More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it
-
-
cinder (CinderPersistentVolumeSource)
Cinder represents a cinder volume attached and mounted on kubelets host machine. More info: https://examples.k8s.io/mysql-cinder-pd/README.md
Represents a cinder volume resource in Openstack. A Cinder volume must exist before mounting to a container. The volume must also be in the same region as the kubelet. Cinder volumes support ownership management and SELinux relabeling.
-
cinder.volumeID (string), required
volume id used to identify the volume in cinder. More info: https://examples.k8s.io/mysql-cinder-pd/README.md
-
cinder.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://examples.k8s.io/mysql-cinder-pd/README.md
-
cinder.readOnly (boolean)
Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts. More info: https://examples.k8s.io/mysql-cinder-pd/README.md
-
cinder.secretRef (SecretReference)
Optional: points to a secret object containing parameters used to connect to OpenStack.
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
cinder.secretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
cinder.secretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
-
csi (CSIPersistentVolumeSource)
CSI represents storage that is handled by an external CSI driver (Beta feature).
Represents storage that is managed by an external CSI volume driver (Beta feature)
-
csi.driver (string), required
Driver is the name of the driver to use for this volume. Required.
-
csi.volumeHandle (string), required
VolumeHandle is the unique volume name returned by the CSI volume plugin’s CreateVolume to refer to the volume on all subsequent calls. Required.
-
csi.controllerExpandSecretRef (SecretReference)
ControllerExpandSecretRef is a reference to the secret object containing sensitive information to pass to the CSI driver to complete the CSI ControllerExpandVolume call. This is an alpha field and requires enabling ExpandCSIVolumes feature gate. This field is optional, and may be empty if no secret is required. If the secret object contains more than one secret, all secrets are passed.
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
csi.controllerExpandSecretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
csi.controllerExpandSecretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
csi.controllerPublishSecretRef (SecretReference)
ControllerPublishSecretRef is a reference to the secret object containing sensitive information to pass to the CSI driver to complete the CSI ControllerPublishVolume and ControllerUnpublishVolume calls. This field is optional, and may be empty if no secret is required. If the secret object contains more than one secret, all secrets are passed.
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
csi.controllerPublishSecretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
csi.controllerPublishSecretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
csi.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs".
-
csi.nodePublishSecretRef (SecretReference)
NodePublishSecretRef is a reference to the secret object containing sensitive information to pass to the CSI driver to complete the CSI NodePublishVolume and NodeUnpublishVolume calls. This field is optional, and may be empty if no secret is required. If the secret object contains more than one secret, all secrets are passed.
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
csi.nodePublishSecretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
csi.nodePublishSecretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
csi.nodeStageSecretRef (SecretReference)
NodeStageSecretRef is a reference to the secret object containing sensitive information to pass to the CSI driver to complete the CSI NodeStageVolume and NodeStageVolume and NodeUnstageVolume calls. This field is optional, and may be empty if no secret is required. If the secret object contains more than one secret, all secrets are passed.
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
csi.nodeStageSecretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
csi.nodeStageSecretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
csi.readOnly (boolean)
Optional: The value to pass to ControllerPublishVolumeRequest. Defaults to false (read/write).
-
csi.volumeAttributes (map[string]string)
Attributes of the volume to publish.
-
-
fc (FCVolumeSource)
FC represents a Fibre Channel resource that is attached to a kubelet's host machine and then exposed to the pod.
Represents a Fibre Channel volume. Fibre Channel volumes can only be mounted as read/write once. Fibre Channel volumes support ownership management and SELinux relabeling.
-
fc.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
fc.lun (int32)
Optional: FC target lun number
-
fc.readOnly (boolean)
Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
fc.targetWWNs ([]string)
Optional: FC target worldwide names (WWNs)
-
fc.wwids ([]string)
Optional: FC volume world wide identifiers (wwids) Either wwids or combination of targetWWNs and lun must be set, but not both simultaneously.
-
-
flexVolume (FlexPersistentVolumeSource)
FlexVolume represents a generic volume resource that is provisioned/attached using an exec based plugin.
FlexPersistentVolumeSource represents a generic persistent volume resource that is provisioned/attached using an exec based plugin.
-
flexVolume.driver (string), required
Driver is the name of the driver to use for this volume.
-
flexVolume.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". The default filesystem depends on FlexVolume script.
-
flexVolume.options (map[string]string)
Optional: Extra command options if any.
-
flexVolume.readOnly (boolean)
Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
flexVolume.secretRef (SecretReference)
Optional: SecretRef is reference to the secret object containing sensitive information to pass to the plugin scripts. This may be empty if no secret object is specified. If the secret object contains more than one secret, all secrets are passed to the plugin scripts.
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
flexVolume.secretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
flexVolume.secretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
-
flocker (FlockerVolumeSource)
Flocker represents a Flocker volume attached to a kubelet's host machine and exposed to the pod for its usage. This depends on the Flocker control service being running
Represents a Flocker volume mounted by the Flocker agent. One and only one of datasetName and datasetUUID should be set. Flocker volumes do not support ownership management or SELinux relabeling.
-
flocker.datasetName (string)
Name of the dataset stored as metadata -> name on the dataset for Flocker should be considered as deprecated
-
flocker.datasetUUID (string)
UUID of the dataset. This is unique identifier of a Flocker dataset
-
-
gcePersistentDisk (GCEPersistentDiskVolumeSource)
GCEPersistentDisk represents a GCE Disk resource that is attached to a kubelet's host machine and then exposed to the pod. Provisioned by an admin. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
*Represents a Persistent Disk resource in Google Compute Engine.
A GCE PD must exist before mounting to a container. The disk must also be in the same GCE project and zone as the kubelet. A GCE PD can only be mounted as read/write once or read-only many times. GCE PDs support ownership management and SELinux relabeling.*
-
gcePersistentDisk.pdName (string), required
Unique name of the PD resource in GCE. Used to identify the disk in GCE. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
-
gcePersistentDisk.fsType (string)
Filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
-
gcePersistentDisk.partition (int32)
The partition in the volume that you want to mount. If omitted, the default is to mount by volume name. Examples: For volume /dev/sda1, you specify the partition as "1". Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty). More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
-
gcePersistentDisk.readOnly (boolean)
ReadOnly here will force the ReadOnly setting in VolumeMounts. Defaults to false. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk
-
-
glusterfs (GlusterfsPersistentVolumeSource)
Glusterfs represents a Glusterfs volume that is attached to a host and exposed to the pod. Provisioned by an admin. More info: https://examples.k8s.io/volumes/glusterfs/README.md
Represents a Glusterfs mount that lasts the lifetime of a pod. Glusterfs volumes do not support ownership management or SELinux relabeling.
-
glusterfs.endpoints (string), required
EndpointsName is the endpoint name that details Glusterfs topology. More info: https://examples.k8s.io/volumes/glusterfs/README.md#create-a-pod
-
glusterfs.path (string), required
Path is the Glusterfs volume path. More info: https://examples.k8s.io/volumes/glusterfs/README.md#create-a-pod
-
glusterfs.endpointsNamespace (string)
EndpointsNamespace is the namespace that contains Glusterfs endpoint. If this field is empty, the EndpointNamespace defaults to the same namespace as the bound PVC. More info: https://examples.k8s.io/volumes/glusterfs/README.md#create-a-pod
-
glusterfs.readOnly (boolean)
ReadOnly here will force the Glusterfs volume to be mounted with read-only permissions. Defaults to false. More info: https://examples.k8s.io/volumes/glusterfs/README.md#create-a-pod
-
-
iscsi (ISCSIPersistentVolumeSource)
ISCSI represents an ISCSI Disk resource that is attached to a kubelet's host machine and then exposed to the pod. Provisioned by an admin.
ISCSIPersistentVolumeSource represents an ISCSI disk. ISCSI volumes can only be mounted as read/write once. ISCSI volumes support ownership management and SELinux relabeling.
-
iscsi.iqn (string), required
Target iSCSI Qualified Name.
-
iscsi.lun (int32), required
iSCSI Target Lun number.
-
iscsi.targetPortal (string), required
iSCSI Target Portal. The Portal is either an IP or ip_addr:port if the port is other than default (typically TCP ports 860 and 3260).
-
iscsi.chapAuthDiscovery (boolean)
whether support iSCSI Discovery CHAP authentication
-
iscsi.chapAuthSession (boolean)
whether support iSCSI Session CHAP authentication
-
iscsi.fsType (string)
Filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#iscsi
-
iscsi.initiatorName (string)
Custom iSCSI Initiator Name. If initiatorName is specified with iscsiInterface simultaneously, new iSCSI interface <target portal>:<volume name> will be created for the connection.
-
iscsi.iscsiInterface (string)
iSCSI Interface Name that uses an iSCSI transport. Defaults to 'default' (tcp).
-
iscsi.portals ([]string)
iSCSI Target Portal List. The Portal is either an IP or ip_addr:port if the port is other than default (typically TCP ports 860 and 3260).
-
iscsi.readOnly (boolean)
ReadOnly here will force the ReadOnly setting in VolumeMounts. Defaults to false.
-
iscsi.secretRef (SecretReference)
CHAP Secret for iSCSI target and initiator authentication
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
iscsi.secretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
iscsi.secretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
-
nfs (NFSVolumeSource)
NFS represents an NFS mount on the host. Provisioned by an admin. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs
Represents an NFS mount that lasts the lifetime of a pod. NFS volumes do not support ownership management or SELinux relabeling.
-
nfs.path (string), required
Path that is exported by the NFS server. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs
-
nfs.server (string), required
Server is the hostname or IP address of the NFS server. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs
-
nfs.readOnly (boolean)
ReadOnly here will force the NFS export to be mounted with read-only permissions. Defaults to false. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs
-
-
photonPersistentDisk (PhotonPersistentDiskVolumeSource)
PhotonPersistentDisk represents a PhotonController persistent disk attached and mounted on kubelets host machine
Represents a Photon Controller persistent disk resource.
-
photonPersistentDisk.pdID (string), required
ID that identifies Photon Controller persistent disk
-
photonPersistentDisk.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
-
portworxVolume (PortworxVolumeSource)
PortworxVolume represents a portworx volume attached and mounted on kubelets host machine
PortworxVolumeSource represents a Portworx volume resource.
-
portworxVolume.volumeID (string), required
VolumeID uniquely identifies a Portworx volume
-
portworxVolume.fsType (string)
FSType represents the filesystem type to mount Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs". Implicitly inferred to be "ext4" if unspecified.
-
portworxVolume.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
-
quobyte (QuobyteVolumeSource)
Quobyte represents a Quobyte mount on the host that shares a pod's lifetime
Represents a Quobyte mount that lasts the lifetime of a pod. Quobyte volumes do not support ownership management or SELinux relabeling.
-
quobyte.registry (string), required
Registry represents a single or multiple Quobyte Registry services specified as a string as host:port pair (multiple entries are separated with commas) which acts as the central registry for volumes
-
quobyte.volume (string), required
Volume is a string that references an already created Quobyte volume by name.
-
quobyte.group (string)
Group to map volume access to Default is no group
-
quobyte.readOnly (boolean)
ReadOnly here will force the Quobyte volume to be mounted with read-only permissions. Defaults to false.
-
quobyte.tenant (string)
Tenant owning the given Quobyte volume in the Backend Used with dynamically provisioned Quobyte volumes, value is set by the plugin
-
quobyte.user (string)
User to map volume access to Defaults to serivceaccount user
-
-
rbd (RBDPersistentVolumeSource)
RBD represents a Rados Block Device mount on the host that shares a pod's lifetime. More info: https://examples.k8s.io/volumes/rbd/README.md
Represents a Rados Block Device mount that lasts the lifetime of a pod. RBD volumes support ownership management and SELinux relabeling.
-
rbd.image (string), required
The rados image name. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.monitors ([]string), required
A collection of Ceph monitors. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.fsType (string)
Filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#rbd
-
rbd.keyring (string)
Keyring is the path to key ring for RBDUser. Default is /etc/ceph/keyring. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.pool (string)
The rados pool name. Default is rbd. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.readOnly (boolean)
ReadOnly here will force the ReadOnly setting in VolumeMounts. Defaults to false. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
rbd.secretRef (SecretReference)
SecretRef is name of the authentication secret for RBDUser. If provided overrides keyring. Default is nil. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
rbd.secretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
rbd.secretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
rbd.user (string)
The rados user name. Default is admin. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it
-
-
scaleIO (ScaleIOPersistentVolumeSource)
ScaleIO represents a ScaleIO persistent volume attached and mounted on Kubernetes nodes.
ScaleIOPersistentVolumeSource represents a persistent ScaleIO volume
-
scaleIO.gateway (string), required
The host address of the ScaleIO API Gateway.
-
scaleIO.secretRef (SecretReference), required
SecretRef references to the secret for ScaleIO user and other sensitive information. If this is not provided, Login operation will fail.
SecretReference represents a Secret Reference. It has enough information to retrieve secret in any namespace
-
scaleIO.secretRef.name (string)
Name is unique within a namespace to reference a secret resource.
-
scaleIO.secretRef.namespace (string)
Namespace defines the space within which the secret name must be unique.
-
-
scaleIO.system (string), required
The name of the storage system as configured in ScaleIO.
-
scaleIO.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Default is "xfs"
-
scaleIO.protectionDomain (string)
The name of the ScaleIO Protection Domain for the configured storage.
-
scaleIO.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
scaleIO.sslEnabled (boolean)
Flag to enable/disable SSL communication with Gateway, default false
-
scaleIO.storageMode (string)
Indicates whether the storage for a volume should be ThickProvisioned or ThinProvisioned. Default is ThinProvisioned.
-
scaleIO.storagePool (string)
The ScaleIO Storage Pool associated with the protection domain.
-
scaleIO.volumeName (string)
The name of a volume already created in the ScaleIO system that is associated with this volume source.
-
-
storageos (StorageOSPersistentVolumeSource)
StorageOS represents a StorageOS volume that is attached to the kubelet's host machine and mounted into the pod More info: https://examples.k8s.io/volumes/storageos/README.md
Represents a StorageOS persistent volume resource.
-
storageos.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
storageos.readOnly (boolean)
Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.
-
storageos.secretRef (ObjectReference)
SecretRef specifies the secret to use for obtaining the StorageOS API credentials. If not specified, default values will be attempted.
-
storageos.volumeName (string)
VolumeName is the human-readable name of the StorageOS volume. Volume names are only unique within a namespace.
-
storageos.volumeNamespace (string)
VolumeNamespace specifies the scope of the volume within StorageOS. If no namespace is specified then the Pod's namespace will be used. This allows the Kubernetes name scoping to be mirrored within StorageOS for tighter integration. Set VolumeName to any name to override the default behaviour. Set to "default" if you are not using namespaces within StorageOS. Namespaces that do not pre-exist within StorageOS will be created.
-
-
vsphereVolume (VsphereVirtualDiskVolumeSource)
VsphereVolume represents a vSphere volume attached and mounted on kubelets host machine
Represents a vSphere volume resource.
-
vsphereVolume.volumePath (string), required
Path that identifies vSphere volume vmdk
-
vsphereVolume.fsType (string)
Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
-
vsphereVolume.storagePolicyID (string)
Storage Policy Based Management (SPBM) profile ID associated with the StoragePolicyName.
-
vsphereVolume.storagePolicyName (string)
Storage Policy Based Management (SPBM) profile name.
-
PersistentVolumeStatus
PersistentVolumeStatus is the current status of a persistent volume.
-
message (string)
A human-readable message indicating details about why the volume is in this state.
-
phase (string)
Phase indicates if a volume is available, bound to a claim, or released by a claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#phase
Possible enum values:
"Available"
used for PersistentVolumes that are not yet bound Available volumes are held by the binder and matched to PersistentVolumeClaims"Bound"
used for PersistentVolumes that are bound"Failed"
used for PersistentVolumes that failed to be correctly recycled or deleted after being released from a claim"Pending"
used for PersistentVolumes that are not available"Released"
used for PersistentVolumes where the bound PersistentVolumeClaim was deleted released volumes must be recycled before becoming available again this phase is used by the persistent volume claim binder to signal to another process to reclaim the resource
-
reason (string)
Reason is a brief CamelCase string that describes any failure and is meant for machine parsing and tidy display in the CLI.
PersistentVolumeList
PersistentVolumeList is a list of PersistentVolume items.
-
apiVersion: v1
-
kind: PersistentVolumeList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]PersistentVolume), required
List of persistent volumes. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes
Operations
get
read the specified PersistentVolume
HTTP Request
GET /api/v1/persistentvolumes/{name}
Parameters
-
name (in path): string, required
name of the PersistentVolume
-
pretty (in query): string
Response
200 (PersistentVolume): OK
401: Unauthorized
get
read status of the specified PersistentVolume
HTTP Request
GET /api/v1/persistentvolumes/{name}/status
Parameters
-
name (in path): string, required
name of the PersistentVolume
-
pretty (in query): string
Response
200 (PersistentVolume): OK
401: Unauthorized
list
list or watch objects of kind PersistentVolume
HTTP Request
GET /api/v1/persistentvolumes
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PersistentVolumeList): OK
401: Unauthorized
create
create a PersistentVolume
HTTP Request
POST /api/v1/persistentvolumes
Parameters
-
body: PersistentVolume, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PersistentVolume): OK
201 (PersistentVolume): Created
202 (PersistentVolume): Accepted
401: Unauthorized
update
replace the specified PersistentVolume
HTTP Request
PUT /api/v1/persistentvolumes/{name}
Parameters
-
name (in path): string, required
name of the PersistentVolume
-
body: PersistentVolume, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PersistentVolume): OK
201 (PersistentVolume): Created
401: Unauthorized
update
replace status of the specified PersistentVolume
HTTP Request
PUT /api/v1/persistentvolumes/{name}/status
Parameters
-
name (in path): string, required
name of the PersistentVolume
-
body: PersistentVolume, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PersistentVolume): OK
201 (PersistentVolume): Created
401: Unauthorized
patch
partially update the specified PersistentVolume
HTTP Request
PATCH /api/v1/persistentvolumes/{name}
Parameters
-
name (in path): string, required
name of the PersistentVolume
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PersistentVolume): OK
201 (PersistentVolume): Created
401: Unauthorized
patch
partially update status of the specified PersistentVolume
HTTP Request
PATCH /api/v1/persistentvolumes/{name}/status
Parameters
-
name (in path): string, required
name of the PersistentVolume
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PersistentVolume): OK
201 (PersistentVolume): Created
401: Unauthorized
delete
delete a PersistentVolume
HTTP Request
DELETE /api/v1/persistentvolumes/{name}
Parameters
-
name (in path): string, required
name of the PersistentVolume
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (PersistentVolume): OK
202 (PersistentVolume): Accepted
401: Unauthorized
deletecollection
delete collection of PersistentVolume
HTTP Request
DELETE /api/v1/persistentvolumes
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.3.6 - StorageClass
apiVersion: storage.k8s.io/v1
import "k8s.io/api/storage/v1"
StorageClass
StorageClass describes the parameters for a class of storage for which PersistentVolumes can be dynamically provisioned.
StorageClasses are non-namespaced; the name of the storage class according to etcd is in ObjectMeta.Name.
-
apiVersion: storage.k8s.io/v1
-
kind: StorageClass
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
provisioner (string), required
Provisioner indicates the type of the provisioner.
-
allowVolumeExpansion (boolean)
AllowVolumeExpansion shows whether the storage class allow volume expand
-
allowedTopologies ([]TopologySelectorTerm)
Atomic: will be replaced during a merge
Restrict the node topologies where volumes can be dynamically provisioned. Each volume plugin defines its own supported topology specifications. An empty TopologySelectorTerm list means there is no topology restriction. This field is only honored by servers that enable the VolumeScheduling feature.
A topology selector term represents the result of label queries. A null or empty topology selector term matches no objects. The requirements of them are ANDed. It provides a subset of functionality as NodeSelectorTerm. This is an alpha feature and may change in the future.
-
allowedTopologies.matchLabelExpressions ([]TopologySelectorLabelRequirement)
A list of topology selector requirements by labels.
A topology selector requirement is a selector that matches given label. This is an alpha feature and may change in the future.
-
allowedTopologies.matchLabelExpressions.key (string), required
The label key that the selector applies to.
-
allowedTopologies.matchLabelExpressions.values ([]string), required
An array of string values. One value must match the label to be selected. Each entry in Values is ORed.
-
-
-
mountOptions ([]string)
Dynamically provisioned PersistentVolumes of this storage class are created with these mountOptions, e.g. ["ro", "soft"]. Not validated - mount of the PVs will simply fail if one is invalid.
-
parameters (map[string]string)
Parameters holds the parameters for the provisioner that should create volumes of this storage class.
-
reclaimPolicy (string)
Dynamically provisioned PersistentVolumes of this storage class are created with this reclaimPolicy. Defaults to Delete.
-
volumeBindingMode (string)
VolumeBindingMode indicates how PersistentVolumeClaims should be provisioned and bound. When unset, VolumeBindingImmediate is used. This field is only honored by servers that enable the VolumeScheduling feature.
StorageClassList
StorageClassList is a collection of storage classes.
-
apiVersion: storage.k8s.io/v1
-
kind: StorageClassList
-
metadata (ListMeta)
Standard list metadata More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]StorageClass), required
Items is the list of StorageClasses
Operations
get
read the specified StorageClass
HTTP Request
GET /apis/storage.k8s.io/v1/storageclasses/{name}
Parameters
-
name (in path): string, required
name of the StorageClass
-
pretty (in query): string
Response
200 (StorageClass): OK
401: Unauthorized
list
list or watch objects of kind StorageClass
HTTP Request
GET /apis/storage.k8s.io/v1/storageclasses
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (StorageClassList): OK
401: Unauthorized
create
create a StorageClass
HTTP Request
POST /apis/storage.k8s.io/v1/storageclasses
Parameters
-
body: StorageClass, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (StorageClass): OK
201 (StorageClass): Created
202 (StorageClass): Accepted
401: Unauthorized
update
replace the specified StorageClass
HTTP Request
PUT /apis/storage.k8s.io/v1/storageclasses/{name}
Parameters
-
name (in path): string, required
name of the StorageClass
-
body: StorageClass, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (StorageClass): OK
201 (StorageClass): Created
401: Unauthorized
patch
partially update the specified StorageClass
HTTP Request
PATCH /apis/storage.k8s.io/v1/storageclasses/{name}
Parameters
-
name (in path): string, required
name of the StorageClass
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (StorageClass): OK
201 (StorageClass): Created
401: Unauthorized
delete
delete a StorageClass
HTTP Request
DELETE /apis/storage.k8s.io/v1/storageclasses/{name}
Parameters
-
name (in path): string, required
name of the StorageClass
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (StorageClass): OK
202 (StorageClass): Accepted
401: Unauthorized
deletecollection
delete collection of StorageClass
HTTP Request
DELETE /apis/storage.k8s.io/v1/storageclasses
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.3.7 - VolumeAttachment
apiVersion: storage.k8s.io/v1
import "k8s.io/api/storage/v1"
VolumeAttachment
VolumeAttachment captures the intent to attach or detach the specified volume to/from the specified node.
VolumeAttachment objects are non-namespaced.
-
apiVersion: storage.k8s.io/v1
-
kind: VolumeAttachment
-
metadata (ObjectMeta)
Standard object metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (VolumeAttachmentSpec), required
Specification of the desired attach/detach volume behavior. Populated by the Kubernetes system.
-
status (VolumeAttachmentStatus)
Status of the VolumeAttachment request. Populated by the entity completing the attach or detach operation, i.e. the external-attacher.
VolumeAttachmentSpec
VolumeAttachmentSpec is the specification of a VolumeAttachment request.
-
attacher (string), required
Attacher indicates the name of the volume driver that MUST handle this request. This is the name returned by GetPluginName().
-
nodeName (string), required
The node that the volume should be attached to.
-
source (VolumeAttachmentSource), required
Source represents the volume that should be attached.
VolumeAttachmentSource represents a volume that should be attached. Right now only PersistenVolumes can be attached via external attacher, in future we may allow also inline volumes in pods. Exactly one member can be set.
-
source.inlineVolumeSpec (PersistentVolumeSpec)
inlineVolumeSpec contains all the information necessary to attach a persistent volume defined by a pod's inline VolumeSource. This field is populated only for the CSIMigration feature. It contains translated fields from a pod's inline VolumeSource to a PersistentVolumeSpec. This field is beta-level and is only honored by servers that enabled the CSIMigration feature.
-
source.persistentVolumeName (string)
Name of the persistent volume to attach.
-
VolumeAttachmentStatus
VolumeAttachmentStatus is the status of a VolumeAttachment request.
-
attached (boolean), required
Indicates the volume is successfully attached. This field must only be set by the entity completing the attach operation, i.e. the external-attacher.
-
attachError (VolumeError)
The last error encountered during attach operation, if any. This field must only be set by the entity completing the attach operation, i.e. the external-attacher.
VolumeError captures an error encountered during a volume operation.
-
attachError.message (string)
String detailing the error encountered during Attach or Detach operation. This string may be logged, so it should not contain sensitive information.
-
attachError.time (Time)
Time the error was encountered.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
-
attachmentMetadata (map[string]string)
Upon successful attach, this field is populated with any information returned by the attach operation that must be passed into subsequent WaitForAttach or Mount calls. This field must only be set by the entity completing the attach operation, i.e. the external-attacher.
-
detachError (VolumeError)
The last error encountered during detach operation, if any. This field must only be set by the entity completing the detach operation, i.e. the external-attacher.
VolumeError captures an error encountered during a volume operation.
-
detachError.message (string)
String detailing the error encountered during Attach or Detach operation. This string may be logged, so it should not contain sensitive information.
-
detachError.time (Time)
Time the error was encountered.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
VolumeAttachmentList
VolumeAttachmentList is a collection of VolumeAttachment objects.
-
apiVersion: storage.k8s.io/v1
-
kind: VolumeAttachmentList
-
metadata (ListMeta)
Standard list metadata More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]VolumeAttachment), required
Items is the list of VolumeAttachments
Operations
get
read the specified VolumeAttachment
HTTP Request
GET /apis/storage.k8s.io/v1/volumeattachments/{name}
Parameters
-
name (in path): string, required
name of the VolumeAttachment
-
pretty (in query): string
Response
200 (VolumeAttachment): OK
401: Unauthorized
get
read status of the specified VolumeAttachment
HTTP Request
GET /apis/storage.k8s.io/v1/volumeattachments/{name}/status
Parameters
-
name (in path): string, required
name of the VolumeAttachment
-
pretty (in query): string
Response
200 (VolumeAttachment): OK
401: Unauthorized
list
list or watch objects of kind VolumeAttachment
HTTP Request
GET /apis/storage.k8s.io/v1/volumeattachments
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (VolumeAttachmentList): OK
401: Unauthorized
create
create a VolumeAttachment
HTTP Request
POST /apis/storage.k8s.io/v1/volumeattachments
Parameters
-
body: VolumeAttachment, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (VolumeAttachment): OK
201 (VolumeAttachment): Created
202 (VolumeAttachment): Accepted
401: Unauthorized
update
replace the specified VolumeAttachment
HTTP Request
PUT /apis/storage.k8s.io/v1/volumeattachments/{name}
Parameters
-
name (in path): string, required
name of the VolumeAttachment
-
body: VolumeAttachment, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (VolumeAttachment): OK
201 (VolumeAttachment): Created
401: Unauthorized
update
replace status of the specified VolumeAttachment
HTTP Request
PUT /apis/storage.k8s.io/v1/volumeattachments/{name}/status
Parameters
-
name (in path): string, required
name of the VolumeAttachment
-
body: VolumeAttachment, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (VolumeAttachment): OK
201 (VolumeAttachment): Created
401: Unauthorized
patch
partially update the specified VolumeAttachment
HTTP Request
PATCH /apis/storage.k8s.io/v1/volumeattachments/{name}
Parameters
-
name (in path): string, required
name of the VolumeAttachment
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (VolumeAttachment): OK
201 (VolumeAttachment): Created
401: Unauthorized
patch
partially update status of the specified VolumeAttachment
HTTP Request
PATCH /apis/storage.k8s.io/v1/volumeattachments/{name}/status
Parameters
-
name (in path): string, required
name of the VolumeAttachment
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (VolumeAttachment): OK
201 (VolumeAttachment): Created
401: Unauthorized
delete
delete a VolumeAttachment
HTTP Request
DELETE /apis/storage.k8s.io/v1/volumeattachments/{name}
Parameters
-
name (in path): string, required
name of the VolumeAttachment
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (VolumeAttachment): OK
202 (VolumeAttachment): Accepted
401: Unauthorized
deletecollection
delete collection of VolumeAttachment
HTTP Request
DELETE /apis/storage.k8s.io/v1/volumeattachments
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.3.8 - CSIDriver
apiVersion: storage.k8s.io/v1
import "k8s.io/api/storage/v1"
CSIDriver
CSIDriver captures information about a Container Storage Interface (CSI) volume driver deployed on the cluster. Kubernetes attach detach controller uses this object to determine whether attach is required. Kubelet uses this object to determine whether pod information needs to be passed on mount. CSIDriver objects are non-namespaced.
-
apiVersion: storage.k8s.io/v1
-
kind: CSIDriver
-
metadata (ObjectMeta)
Standard object metadata. metadata.Name indicates the name of the CSI driver that this object refers to; it MUST be the same name returned by the CSI GetPluginName() call for that driver. The driver name must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), dots (.), and alphanumerics between. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (CSIDriverSpec), required
Specification of the CSI Driver.
CSIDriverSpec
CSIDriverSpec is the specification of a CSIDriver.
-
attachRequired (boolean)
attachRequired indicates this CSI volume driver requires an attach operation (because it implements the CSI ControllerPublishVolume() method), and that the Kubernetes attach detach controller should call the attach volume interface which checks the volumeattachment status and waits until the volume is attached before proceeding to mounting. The CSI external-attacher coordinates with CSI volume driver and updates the volumeattachment status when the attach operation is complete. If the CSIDriverRegistry feature gate is enabled and the value is specified to false, the attach operation will be skipped. Otherwise the attach operation will be called.
This field is immutable.
-
fsGroupPolicy (string)
Defines if the underlying volume supports changing ownership and permission of the volume before being mounted. Refer to the specific FSGroupPolicy values for additional details.
This field is immutable.
Defaults to ReadWriteOnceWithFSType, which will examine each volume to determine if Kubernetes should modify ownership and permissions of the volume. With the default policy the defined fsGroup will only be applied if a fstype is defined and the volume's access mode contains ReadWriteOnce.
-
podInfoOnMount (boolean)
If set to true, podInfoOnMount indicates this CSI volume driver requires additional pod information (like podName, podUID, etc.) during mount operations. If set to false, pod information will not be passed on mount. Default is false. The CSI driver specifies podInfoOnMount as part of driver deployment. If true, Kubelet will pass pod information as VolumeContext in the CSI NodePublishVolume() calls. The CSI driver is responsible for parsing and validating the information passed in as VolumeContext. The following VolumeConext will be passed if podInfoOnMount is set to true. This list might grow, but the prefix will be used. "csi.storage.k8s.io/pod.name": pod.Name "csi.storage.k8s.io/pod.namespace": pod.Namespace "csi.storage.k8s.io/pod.uid": string(pod.UID) "csi.storage.k8s.io/ephemeral": "true" if the volume is an ephemeral inline volume defined by a CSIVolumeSource, otherwise "false"
"csi.storage.k8s.io/ephemeral" is a new feature in Kubernetes 1.16. It is only required for drivers which support both the "Persistent" and "Ephemeral" VolumeLifecycleMode. Other drivers can leave pod info disabled and/or ignore this field. As Kubernetes 1.15 doesn't support this field, drivers can only support one mode when deployed on such a cluster and the deployment determines which mode that is, for example via a command line parameter of the driver.
This field is immutable.
-
requiresRepublish (boolean)
RequiresRepublish indicates the CSI driver wants
NodePublishVolume
being periodically called to reflect any possible change in the mounted volume. This field defaults to false.Note: After a successful initial NodePublishVolume call, subsequent calls to NodePublishVolume should only update the contents of the volume. New mount points will not be seen by a running container.
-
storageCapacity (boolean)
If set to true, storageCapacity indicates that the CSI volume driver wants pod scheduling to consider the storage capacity that the driver deployment will report by creating CSIStorageCapacity objects with capacity information.
The check can be enabled immediately when deploying a driver. In that case, provisioning new volumes with late binding will pause until the driver deployment has published some suitable CSIStorageCapacity object.
Alternatively, the driver can be deployed with the field unset or false and it can be flipped later when storage capacity information has been published.
This field was immutable in Kubernetes <= 1.22 and now is mutable.
This is a beta field and only available when the CSIStorageCapacity feature is enabled. The default is false.
-
tokenRequests ([]TokenRequest)
Atomic: will be replaced during a merge
TokenRequests indicates the CSI driver needs pods' service account tokens it is mounting volume for to do necessary authentication. Kubelet will pass the tokens in VolumeContext in the CSI NodePublishVolume calls. The CSI driver should parse and validate the following VolumeContext: "csi.storage.k8s.io/serviceAccount.tokens": { "<audience>": { "token": <token>, "expirationTimestamp": <expiration timestamp in RFC3339>, }, ... }
Note: Audience in each TokenRequest should be different and at most one token is empty string. To receive a new token after expiry, RequiresRepublish can be used to trigger NodePublishVolume periodically.
TokenRequest contains parameters of a service account token.
-
tokenRequests.audience (string), required
Audience is the intended audience of the token in "TokenRequestSpec". It will default to the audiences of kube apiserver.
-
tokenRequests.expirationSeconds (int64)
ExpirationSeconds is the duration of validity of the token in "TokenRequestSpec". It has the same default value of "ExpirationSeconds" in "TokenRequestSpec".
-
-
volumeLifecycleModes ([]string)
Set: unique values will be kept during a merge
volumeLifecycleModes defines what kind of volumes this CSI volume driver supports. The default if the list is empty is "Persistent", which is the usage defined by the CSI specification and implemented in Kubernetes via the usual PV/PVC mechanism. The other mode is "Ephemeral". In this mode, volumes are defined inline inside the pod spec with CSIVolumeSource and their lifecycle is tied to the lifecycle of that pod. A driver has to be aware of this because it is only going to get a NodePublishVolume call for such a volume. For more information about implementing this mode, see https://kubernetes-csi.github.io/docs/ephemeral-local-volumes.html A driver can support one or more of these modes and more modes may be added in the future. This field is beta.
This field is immutable.
CSIDriverList
CSIDriverList is a collection of CSIDriver objects.
-
apiVersion: storage.k8s.io/v1
-
kind: CSIDriverList
-
metadata (ListMeta)
Standard list metadata More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]CSIDriver), required
items is the list of CSIDriver
Operations
get
read the specified CSIDriver
HTTP Request
GET /apis/storage.k8s.io/v1/csidrivers/{name}
Parameters
-
name (in path): string, required
name of the CSIDriver
-
pretty (in query): string
Response
200 (CSIDriver): OK
401: Unauthorized
list
list or watch objects of kind CSIDriver
HTTP Request
GET /apis/storage.k8s.io/v1/csidrivers
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (CSIDriverList): OK
401: Unauthorized
create
create a CSIDriver
HTTP Request
POST /apis/storage.k8s.io/v1/csidrivers
Parameters
-
body: CSIDriver, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CSIDriver): OK
201 (CSIDriver): Created
202 (CSIDriver): Accepted
401: Unauthorized
update
replace the specified CSIDriver
HTTP Request
PUT /apis/storage.k8s.io/v1/csidrivers/{name}
Parameters
-
name (in path): string, required
name of the CSIDriver
-
body: CSIDriver, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CSIDriver): OK
201 (CSIDriver): Created
401: Unauthorized
patch
partially update the specified CSIDriver
HTTP Request
PATCH /apis/storage.k8s.io/v1/csidrivers/{name}
Parameters
-
name (in path): string, required
name of the CSIDriver
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CSIDriver): OK
201 (CSIDriver): Created
401: Unauthorized
delete
delete a CSIDriver
HTTP Request
DELETE /apis/storage.k8s.io/v1/csidrivers/{name}
Parameters
-
name (in path): string, required
name of the CSIDriver
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (CSIDriver): OK
202 (CSIDriver): Accepted
401: Unauthorized
deletecollection
delete collection of CSIDriver
HTTP Request
DELETE /apis/storage.k8s.io/v1/csidrivers
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.3.9 - CSINode
apiVersion: storage.k8s.io/v1
import "k8s.io/api/storage/v1"
CSINode
CSINode holds information about all CSI drivers installed on a node. CSI drivers do not need to create the CSINode object directly. As long as they use the node-driver-registrar sidecar container, the kubelet will automatically populate the CSINode object for the CSI driver as part of kubelet plugin registration. CSINode has the same name as a node. If the object is missing, it means either there are no CSI Drivers available on the node, or the Kubelet version is low enough that it doesn't create this object. CSINode has an OwnerReference that points to the corresponding node object.
-
apiVersion: storage.k8s.io/v1
-
kind: CSINode
-
metadata (ObjectMeta)
metadata.name must be the Kubernetes node name.
-
spec (CSINodeSpec), required
spec is the specification of CSINode
CSINodeSpec
CSINodeSpec holds information about the specification of all CSI drivers installed on a node
-
drivers ([]CSINodeDriver), required
Patch strategy: merge on key
name
drivers is a list of information of all CSI Drivers existing on a node. If all drivers in the list are uninstalled, this can become empty.
CSINodeDriver holds information about the specification of one CSI driver installed on a node
-
drivers.name (string), required
This is the name of the CSI driver that this object refers to. This MUST be the same name returned by the CSI GetPluginName() call for that driver.
-
drivers.nodeID (string), required
nodeID of the node from the driver point of view. This field enables Kubernetes to communicate with storage systems that do not share the same nomenclature for nodes. For example, Kubernetes may refer to a given node as "node1", but the storage system may refer to the same node as "nodeA". When Kubernetes issues a command to the storage system to attach a volume to a specific node, it can use this field to refer to the node name using the ID that the storage system will understand, e.g. "nodeA" instead of "node1". This field is required.
-
drivers.allocatable (VolumeNodeResources)
allocatable represents the volume resources of a node that are available for scheduling. This field is beta.
VolumeNodeResources is a set of resource limits for scheduling of volumes.
-
drivers.allocatable.count (int32)
Maximum number of unique volumes managed by the CSI driver that can be used on a node. A volume that is both attached and mounted on a node is considered to be used once, not twice. The same rule applies for a unique volume that is shared among multiple pods on the same node. If this field is not specified, then the supported number of volumes on this node is unbounded.
-
-
drivers.topologyKeys ([]string)
topologyKeys is the list of keys supported by the driver. When a driver is initialized on a cluster, it provides a set of topology keys that it understands (e.g. "company.com/zone", "company.com/region"). When a driver is initialized on a node, it provides the same topology keys along with values. Kubelet will expose these topology keys as labels on its own node object. When Kubernetes does topology aware provisioning, it can use this list to determine which labels it should retrieve from the node object and pass back to the driver. It is possible for different nodes to use different topology keys. This can be empty if driver does not support topology.
-
CSINodeList
CSINodeList is a collection of CSINode objects.
-
apiVersion: storage.k8s.io/v1
-
kind: CSINodeList
-
metadata (ListMeta)
Standard list metadata More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]CSINode), required
items is the list of CSINode
Operations
get
read the specified CSINode
HTTP Request
GET /apis/storage.k8s.io/v1/csinodes/{name}
Parameters
-
name (in path): string, required
name of the CSINode
-
pretty (in query): string
Response
200 (CSINode): OK
401: Unauthorized
list
list or watch objects of kind CSINode
HTTP Request
GET /apis/storage.k8s.io/v1/csinodes
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (CSINodeList): OK
401: Unauthorized
create
create a CSINode
HTTP Request
POST /apis/storage.k8s.io/v1/csinodes
Parameters
-
body: CSINode, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CSINode): OK
201 (CSINode): Created
202 (CSINode): Accepted
401: Unauthorized
update
replace the specified CSINode
HTTP Request
PUT /apis/storage.k8s.io/v1/csinodes/{name}
Parameters
-
name (in path): string, required
name of the CSINode
-
body: CSINode, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CSINode): OK
201 (CSINode): Created
401: Unauthorized
patch
partially update the specified CSINode
HTTP Request
PATCH /apis/storage.k8s.io/v1/csinodes/{name}
Parameters
-
name (in path): string, required
name of the CSINode
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CSINode): OK
201 (CSINode): Created
401: Unauthorized
delete
delete a CSINode
HTTP Request
DELETE /apis/storage.k8s.io/v1/csinodes/{name}
Parameters
-
name (in path): string, required
name of the CSINode
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (CSINode): OK
202 (CSINode): Accepted
401: Unauthorized
deletecollection
delete collection of CSINode
HTTP Request
DELETE /apis/storage.k8s.io/v1/csinodes
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.3.10 - CSIStorageCapacity v1beta1
apiVersion: storage.k8s.io/v1beta1
import "k8s.io/api/storage/v1beta1"
CSIStorageCapacity
CSIStorageCapacity stores the result of one CSI GetCapacity call. For a given StorageClass, this describes the available capacity in a particular topology segment. This can be used when considering where to instantiate new PersistentVolumes.
For example this can express things like: - StorageClass "standard" has "1234 GiB" available in "topology.kubernetes.io/zone=us-east1" - StorageClass "localssd" has "10 GiB" available in "kubernetes.io/hostname=knode-abc123"
The following three cases all imply that no capacity is available for a certain combination: - no object exists with suitable topology and storage class name - such an object exists, but the capacity is unset - such an object exists, but the capacity is zero
The producer of these objects can decide which approach is more suitable.
They are consumed by the kube-scheduler if the CSIStorageCapacity beta feature gate is enabled there and a CSI driver opts into capacity-aware scheduling with CSIDriver.StorageCapacity.
-
apiVersion: storage.k8s.io/v1beta1
-
kind: CSIStorageCapacity
-
metadata (ObjectMeta)
Standard object's metadata. The name has no particular meaning. It must be be a DNS subdomain (dots allowed, 253 characters). To ensure that there are no conflicts with other CSI drivers on the cluster, the recommendation is to use csisc-<uuid>, a generated name, or a reverse-domain name which ends with the unique CSI driver name.
Objects are namespaced.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
storageClassName (string), required
The name of the StorageClass that the reported capacity applies to. It must meet the same requirements as the name of a StorageClass object (non-empty, DNS subdomain). If that object no longer exists, the CSIStorageCapacity object is obsolete and should be removed by its creator. This field is immutable.
-
capacity (Quantity)
Capacity is the value reported by the CSI driver in its GetCapacityResponse for a GetCapacityRequest with topology and parameters that match the previous fields.
The semantic is currently (CSI spec 1.2) defined as: The available capacity, in bytes, of the storage that can be used to provision volumes. If not set, that information is currently unavailable and treated like zero capacity.
-
maximumVolumeSize (Quantity)
MaximumVolumeSize is the value reported by the CSI driver in its GetCapacityResponse for a GetCapacityRequest with topology and parameters that match the previous fields.
This is defined since CSI spec 1.4.0 as the largest size that may be used in a CreateVolumeRequest.capacity_range.required_bytes field to create a volume with the same parameters as those in GetCapacityRequest. The corresponding value in the Kubernetes API is ResourceRequirements.Requests in a volume claim.
-
nodeTopology (LabelSelector)
NodeTopology defines which nodes have access to the storage for which capacity was reported. If not set, the storage is not accessible from any node in the cluster. If empty, the storage is accessible from all nodes. This field is immutable.
CSIStorageCapacityList
CSIStorageCapacityList is a collection of CSIStorageCapacity objects.
-
apiVersion: storage.k8s.io/v1beta1
-
kind: CSIStorageCapacityList
-
metadata (ListMeta)
Standard list metadata More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]CSIStorageCapacity), required
Map: unique values on key name will be kept during a merge
Items is the list of CSIStorageCapacity objects.
Operations
get
read the specified CSIStorageCapacity
HTTP Request
GET /apis/storage.k8s.io/v1beta1/namespaces/{namespace}/csistoragecapacities/{name}
Parameters
-
name (in path): string, required
name of the CSIStorageCapacity
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (CSIStorageCapacity): OK
401: Unauthorized
list
list or watch objects of kind CSIStorageCapacity
HTTP Request
GET /apis/storage.k8s.io/v1beta1/namespaces/{namespace}/csistoragecapacities
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (CSIStorageCapacityList): OK
401: Unauthorized
list
list or watch objects of kind CSIStorageCapacity
HTTP Request
GET /apis/storage.k8s.io/v1beta1/csistoragecapacities
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (CSIStorageCapacityList): OK
401: Unauthorized
create
create a CSIStorageCapacity
HTTP Request
POST /apis/storage.k8s.io/v1beta1/namespaces/{namespace}/csistoragecapacities
Parameters
-
namespace (in path): string, required
-
body: CSIStorageCapacity, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CSIStorageCapacity): OK
201 (CSIStorageCapacity): Created
202 (CSIStorageCapacity): Accepted
401: Unauthorized
update
replace the specified CSIStorageCapacity
HTTP Request
PUT /apis/storage.k8s.io/v1beta1/namespaces/{namespace}/csistoragecapacities/{name}
Parameters
-
name (in path): string, required
name of the CSIStorageCapacity
-
namespace (in path): string, required
-
body: CSIStorageCapacity, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CSIStorageCapacity): OK
201 (CSIStorageCapacity): Created
401: Unauthorized
patch
partially update the specified CSIStorageCapacity
HTTP Request
PATCH /apis/storage.k8s.io/v1beta1/namespaces/{namespace}/csistoragecapacities/{name}
Parameters
-
name (in path): string, required
name of the CSIStorageCapacity
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CSIStorageCapacity): OK
201 (CSIStorageCapacity): Created
401: Unauthorized
delete
delete a CSIStorageCapacity
HTTP Request
DELETE /apis/storage.k8s.io/v1beta1/namespaces/{namespace}/csistoragecapacities/{name}
Parameters
-
name (in path): string, required
name of the CSIStorageCapacity
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of CSIStorageCapacity
HTTP Request
DELETE /apis/storage.k8s.io/v1beta1/namespaces/{namespace}/csistoragecapacities
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.4 - Authentication Resources
6.5.4.1 - ServiceAccount
apiVersion: v1
import "k8s.io/api/core/v1"
ServiceAccount
ServiceAccount binds together: * a name, understood by users, and perhaps by peripheral systems, for an identity * a principal that can be authenticated and authorized * a set of secrets
-
apiVersion: v1
-
kind: ServiceAccount
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
automountServiceAccountToken (boolean)
AutomountServiceAccountToken indicates whether pods running as this service account should have an API token automatically mounted. Can be overridden at the pod level.
-
imagePullSecrets ([]LocalObjectReference)
ImagePullSecrets is a list of references to secrets in the same namespace to use for pulling any images in pods that reference this ServiceAccount. ImagePullSecrets are distinct from Secrets because Secrets can be mounted in the pod, but ImagePullSecrets are only accessed by the kubelet. More info: https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod
-
secrets ([]ObjectReference)
Patch strategy: merge on key
name
Secrets is the list of secrets allowed to be used by pods running using this ServiceAccount. More info: https://kubernetes.io/docs/concepts/configuration/secret
ServiceAccountList
ServiceAccountList is a list of ServiceAccount objects
-
apiVersion: v1
-
kind: ServiceAccountList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]ServiceAccount), required
List of ServiceAccounts. More info: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
Operations
get
read the specified ServiceAccount
HTTP Request
GET /api/v1/namespaces/{namespace}/serviceaccounts/{name}
Parameters
-
name (in path): string, required
name of the ServiceAccount
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (ServiceAccount): OK
401: Unauthorized
list
list or watch objects of kind ServiceAccount
HTTP Request
GET /api/v1/namespaces/{namespace}/serviceaccounts
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ServiceAccountList): OK
401: Unauthorized
list
list or watch objects of kind ServiceAccount
HTTP Request
GET /api/v1/serviceaccounts
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ServiceAccountList): OK
401: Unauthorized
create
create a ServiceAccount
HTTP Request
POST /api/v1/namespaces/{namespace}/serviceaccounts
Parameters
-
namespace (in path): string, required
-
body: ServiceAccount, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ServiceAccount): OK
201 (ServiceAccount): Created
202 (ServiceAccount): Accepted
401: Unauthorized
update
replace the specified ServiceAccount
HTTP Request
PUT /api/v1/namespaces/{namespace}/serviceaccounts/{name}
Parameters
-
name (in path): string, required
name of the ServiceAccount
-
namespace (in path): string, required
-
body: ServiceAccount, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ServiceAccount): OK
201 (ServiceAccount): Created
401: Unauthorized
patch
partially update the specified ServiceAccount
HTTP Request
PATCH /api/v1/namespaces/{namespace}/serviceaccounts/{name}
Parameters
-
name (in path): string, required
name of the ServiceAccount
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ServiceAccount): OK
201 (ServiceAccount): Created
401: Unauthorized
delete
delete a ServiceAccount
HTTP Request
DELETE /api/v1/namespaces/{namespace}/serviceaccounts/{name}
Parameters
-
name (in path): string, required
name of the ServiceAccount
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (ServiceAccount): OK
202 (ServiceAccount): Accepted
401: Unauthorized
deletecollection
delete collection of ServiceAccount
HTTP Request
DELETE /api/v1/namespaces/{namespace}/serviceaccounts
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.4.2 - TokenRequest
apiVersion: authentication.k8s.io/v1
import "k8s.io/api/authentication/v1"
TokenRequest
TokenRequest requests a token for a given service account.
-
apiVersion: authentication.k8s.io/v1
-
kind: TokenRequest
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (TokenRequestSpec), required
Spec holds information about the request being evaluated
-
status (TokenRequestStatus)
Status is filled in by the server and indicates whether the token can be authenticated.
TokenRequestSpec
TokenRequestSpec contains client provided parameters of a token request.
-
audiences ([]string), required
Audiences are the intendend audiences of the token. A recipient of a token must identitfy themself with an identifier in the list of audiences of the token, and otherwise should reject the token. A token issued for multiple audiences may be used to authenticate against any of the audiences listed but implies a high degree of trust between the target audiences.
-
boundObjectRef (BoundObjectReference)
BoundObjectRef is a reference to an object that the token will be bound to. The token will only be valid for as long as the bound object exists. NOTE: The API server's TokenReview endpoint will validate the BoundObjectRef, but other audiences may not. Keep ExpirationSeconds small if you want prompt revocation.
BoundObjectReference is a reference to an object that a token is bound to.
-
boundObjectRef.apiVersion (string)
API version of the referent.
-
boundObjectRef.kind (string)
Kind of the referent. Valid kinds are 'Pod' and 'Secret'.
-
boundObjectRef.name (string)
Name of the referent.
-
boundObjectRef.uid (string)
UID of the referent.
-
-
expirationSeconds (int64)
ExpirationSeconds is the requested duration of validity of the request. The token issuer may return a token with a different validity duration so a client needs to check the 'expiration' field in a response.
TokenRequestStatus
TokenRequestStatus is the result of a token request.
-
expirationTimestamp (Time), required
ExpirationTimestamp is the time of expiration of the returned token.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
token (string), required
Token is the opaque bearer token.
Operations
create
create token of a ServiceAccount
HTTP Request
POST /api/v1/namespaces/{namespace}/serviceaccounts/{name}/token
Parameters
-
name (in path): string, required
name of the TokenRequest
-
namespace (in path): string, required
-
body: TokenRequest, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (TokenRequest): OK
201 (TokenRequest): Created
202 (TokenRequest): Accepted
401: Unauthorized
6.5.4.3 - TokenReview
apiVersion: authentication.k8s.io/v1
import "k8s.io/api/authentication/v1"
TokenReview
TokenReview attempts to authenticate a token to a known user. Note: TokenReview requests may be cached by the webhook token authenticator plugin in the kube-apiserver.
-
apiVersion: authentication.k8s.io/v1
-
kind: TokenReview
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (TokenReviewSpec), required
Spec holds information about the request being evaluated
-
status (TokenReviewStatus)
Status is filled in by the server and indicates whether the request can be authenticated.
TokenReviewSpec
TokenReviewSpec is a description of the token authentication request.
-
audiences ([]string)
Audiences is a list of the identifiers that the resource server presented with the token identifies as. Audience-aware token authenticators will verify that the token was intended for at least one of the audiences in this list. If no audiences are provided, the audience will default to the audience of the Kubernetes apiserver.
-
token (string)
Token is the opaque bearer token.
TokenReviewStatus
TokenReviewStatus is the result of the token authentication request.
-
audiences ([]string)
Audiences are audience identifiers chosen by the authenticator that are compatible with both the TokenReview and token. An identifier is any identifier in the intersection of the TokenReviewSpec audiences and the token's audiences. A client of the TokenReview API that sets the spec.audiences field should validate that a compatible audience identifier is returned in the status.audiences field to ensure that the TokenReview server is audience aware. If a TokenReview returns an empty status.audience field where status.authenticated is "true", the token is valid against the audience of the Kubernetes API server.
-
authenticated (boolean)
Authenticated indicates that the token was associated with a known user.
-
error (string)
Error indicates that the token couldn't be checked
-
user (UserInfo)
User is the UserInfo associated with the provided token.
UserInfo holds the information about the user needed to implement the user.Info interface.
-
user.extra (map[string][]string)
Any additional information provided by the authenticator.
-
user.groups ([]string)
The names of groups this user is a part of.
-
user.uid (string)
A unique value that identifies this user across time. If this user is deleted and another user by the same name is added, they will have different UIDs.
-
user.username (string)
The name that uniquely identifies this user among all active users.
-
Operations
create
create a TokenReview
HTTP Request
POST /apis/authentication.k8s.io/v1/tokenreviews
Parameters
-
body: TokenReview, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (TokenReview): OK
201 (TokenReview): Created
202 (TokenReview): Accepted
401: Unauthorized
6.5.4.4 - CertificateSigningRequest
apiVersion: certificates.k8s.io/v1
import "k8s.io/api/certificates/v1"
CertificateSigningRequest
CertificateSigningRequest objects provide a mechanism to obtain x509 certificates by submitting a certificate signing request, and having it asynchronously approved and issued.
Kubelets use this API to obtain:
- client certificates to authenticate to kube-apiserver (with the "kubernetes.io/kube-apiserver-client-kubelet" signerName).
- serving certificates for TLS endpoints kube-apiserver can connect to securely (with the "kubernetes.io/kubelet-serving" signerName).
This API can be used to request client certificates to authenticate to kube-apiserver (with the "kubernetes.io/kube-apiserver-client" signerName), or to obtain certificates from custom non-Kubernetes signers.
-
apiVersion: certificates.k8s.io/v1
-
kind: CertificateSigningRequest
-
metadata (ObjectMeta)
-
spec (CertificateSigningRequestSpec), required
spec contains the certificate request, and is immutable after creation. Only the request, signerName, expirationSeconds, and usages fields can be set on creation. Other fields are derived by Kubernetes and cannot be modified by users.
-
status (CertificateSigningRequestStatus)
status contains information about whether the request is approved or denied, and the certificate issued by the signer, or the failure condition indicating signer failure.
CertificateSigningRequestSpec
CertificateSigningRequestSpec contains the certificate request.
-
request ([]byte), required
Atomic: will be replaced during a merge
request contains an x509 certificate signing request encoded in a "CERTIFICATE REQUEST" PEM block. When serialized as JSON or YAML, the data is additionally base64-encoded.
-
signerName (string), required
signerName indicates the requested signer, and is a qualified name.
List/watch requests for CertificateSigningRequests can filter on this field using a "spec.signerName=NAME" fieldSelector.
Well-known Kubernetes signers are:
- "kubernetes.io/kube-apiserver-client": issues client certificates that can be used to authenticate to kube-apiserver. Requests for this signer are never auto-approved by kube-controller-manager, can be issued by the "csrsigning" controller in kube-controller-manager.
- "kubernetes.io/kube-apiserver-client-kubelet": issues client certificates that kubelets use to authenticate to kube-apiserver. Requests for this signer can be auto-approved by the "csrapproving" controller in kube-controller-manager, and can be issued by the "csrsigning" controller in kube-controller-manager.
- "kubernetes.io/kubelet-serving" issues serving certificates that kubelets use to serve TLS endpoints, which kube-apiserver can connect to securely. Requests for this signer are never auto-approved by kube-controller-manager, and can be issued by the "csrsigning" controller in kube-controller-manager.
More details are available at https://k8s.io/docs/reference/access-authn-authz/certificate-signing-requests/#kubernetes-signers
Custom signerNames can also be specified. The signer defines:
- Trust distribution: how trust (CA bundles) are distributed.
- Permitted subjects: and behavior when a disallowed subject is requested.
- Required, permitted, or forbidden x509 extensions in the request (including whether subjectAltNames are allowed, which types, restrictions on allowed values) and behavior when a disallowed extension is requested.
- Required, permitted, or forbidden key usages / extended key usages.
- Expiration/certificate lifetime: whether it is fixed by the signer, configurable by the admin.
- Whether or not requests for CA certificates are allowed.
-
expirationSeconds (int32)
expirationSeconds is the requested duration of validity of the issued certificate. The certificate signer may issue a certificate with a different validity duration so a client must check the delta between the notBefore and and notAfter fields in the issued certificate to determine the actual duration.
The v1.22+ in-tree implementations of the well-known Kubernetes signers will honor this field as long as the requested duration is not greater than the maximum duration they will honor per the --cluster-signing-duration CLI flag to the Kubernetes controller manager.
Certificate signers may not honor this field for various reasons:
- Old signer that is unaware of the field (such as the in-tree implementations prior to v1.22)
- Signer whose configured maximum is shorter than the requested duration
- Signer whose configured minimum is longer than the requested duration
The minimum valid value for expirationSeconds is 600, i.e. 10 minutes.
As of v1.22, this field is beta and is controlled via the CSRDuration feature gate.
-
extra (map[string][]string)
extra contains extra attributes of the user that created the CertificateSigningRequest. Populated by the API server on creation and immutable.
-
groups ([]string)
Atomic: will be replaced during a merge
groups contains group membership of the user that created the CertificateSigningRequest. Populated by the API server on creation and immutable.
-
uid (string)
uid contains the uid of the user that created the CertificateSigningRequest. Populated by the API server on creation and immutable.
-
usages ([]string)
Atomic: will be replaced during a merge
usages specifies a set of key usages requested in the issued certificate.
Requests for TLS client certificates typically request: "digital signature", "key encipherment", "client auth".
Requests for TLS serving certificates typically request: "key encipherment", "digital signature", "server auth".
Valid values are: "signing", "digital signature", "content commitment", "key encipherment", "key agreement", "data encipherment", "cert sign", "crl sign", "encipher only", "decipher only", "any", "server auth", "client auth", "code signing", "email protection", "s/mime", "ipsec end system", "ipsec tunnel", "ipsec user", "timestamping", "ocsp signing", "microsoft sgc", "netscape sgc"
-
username (string)
username contains the name of the user that created the CertificateSigningRequest. Populated by the API server on creation and immutable.
CertificateSigningRequestStatus
CertificateSigningRequestStatus contains conditions used to indicate approved/denied/failed status of the request, and the issued certificate.
-
certificate ([]byte)
Atomic: will be replaced during a merge
certificate is populated with an issued certificate by the signer after an Approved condition is present. This field is set via the /status subresource. Once populated, this field is immutable.
If the certificate signing request is denied, a condition of type "Denied" is added and this field remains empty. If the signer cannot issue the certificate, a condition of type "Failed" is added and this field remains empty.
Validation requirements:
- certificate must contain one or more PEM blocks.
- All PEM blocks must have the "CERTIFICATE" label, contain no headers, and the encoded data must be a BER-encoded ASN.1 Certificate structure as described in section 4 of RFC5280.
- Non-PEM content may appear before or after the "CERTIFICATE" PEM blocks and is unvalidated, to allow for explanatory text as described in section 5.2 of RFC7468.
If more than one PEM block is present, and the definition of the requested spec.signerName does not indicate otherwise, the first block is the issued certificate, and subsequent blocks should be treated as intermediate certificates and presented in TLS handshakes.
The certificate is encoded in PEM format.
When serialized as JSON or YAML, the data is additionally base64-encoded, so it consists of:
base64( -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE----- )
-
conditions ([]CertificateSigningRequestCondition)
Map: unique values on key type will be kept during a merge
conditions applied to the request. Known conditions are "Approved", "Denied", and "Failed".
CertificateSigningRequestCondition describes a condition of a CertificateSigningRequest object
-
conditions.status (string), required
status of the condition, one of True, False, Unknown. Approved, Denied, and Failed conditions may not be "False" or "Unknown".
-
conditions.type (string), required
type of the condition. Known conditions are "Approved", "Denied", and "Failed".
An "Approved" condition is added via the /approval subresource, indicating the request was approved and should be issued by the signer.
A "Denied" condition is added via the /approval subresource, indicating the request was denied and should not be issued by the signer.
A "Failed" condition is added via the /status subresource, indicating the signer failed to issue the certificate.
Approved and Denied conditions are mutually exclusive. Approved, Denied, and Failed conditions cannot be removed once added.
Only one condition of a given type is allowed.
Possible enum values:
"Approved"
Approved indicates the request was approved and should be issued by the signer."Denied"
Denied indicates the request was denied and should not be issued by the signer."Failed"
Failed indicates the signer failed to issue the certificate.
-
conditions.lastTransitionTime (Time)
lastTransitionTime is the time the condition last transitioned from one status to another. If unset, when a new condition type is added or an existing condition's status is changed, the server defaults this to the current time.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.lastUpdateTime (Time)
lastUpdateTime is the time of the last update to this condition
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
message contains a human readable message with details about the request state
-
conditions.reason (string)
reason indicates a brief reason for the request state
-
CertificateSigningRequestList
CertificateSigningRequestList is a collection of CertificateSigningRequest objects
-
apiVersion: certificates.k8s.io/v1
-
kind: CertificateSigningRequestList
-
metadata (ListMeta)
-
items ([]CertificateSigningRequest), required
items is a collection of CertificateSigningRequest objects
Operations
get
read the specified CertificateSigningRequest
HTTP Request
GET /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
401: Unauthorized
get
read approval of the specified CertificateSigningRequest
HTTP Request
GET /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}/approval
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
401: Unauthorized
get
read status of the specified CertificateSigningRequest
HTTP Request
GET /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}/status
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
401: Unauthorized
list
list or watch objects of kind CertificateSigningRequest
HTTP Request
GET /apis/certificates.k8s.io/v1/certificatesigningrequests
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (CertificateSigningRequestList): OK
401: Unauthorized
create
create a CertificateSigningRequest
HTTP Request
POST /apis/certificates.k8s.io/v1/certificatesigningrequests
Parameters
-
body: CertificateSigningRequest, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
201 (CertificateSigningRequest): Created
202 (CertificateSigningRequest): Accepted
401: Unauthorized
update
replace the specified CertificateSigningRequest
HTTP Request
PUT /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
body: CertificateSigningRequest, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
201 (CertificateSigningRequest): Created
401: Unauthorized
update
replace approval of the specified CertificateSigningRequest
HTTP Request
PUT /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}/approval
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
body: CertificateSigningRequest, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
201 (CertificateSigningRequest): Created
401: Unauthorized
update
replace status of the specified CertificateSigningRequest
HTTP Request
PUT /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}/status
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
body: CertificateSigningRequest, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
201 (CertificateSigningRequest): Created
401: Unauthorized
patch
partially update the specified CertificateSigningRequest
HTTP Request
PATCH /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
201 (CertificateSigningRequest): Created
401: Unauthorized
patch
partially update approval of the specified CertificateSigningRequest
HTTP Request
PATCH /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}/approval
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
201 (CertificateSigningRequest): Created
401: Unauthorized
patch
partially update status of the specified CertificateSigningRequest
HTTP Request
PATCH /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}/status
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CertificateSigningRequest): OK
201 (CertificateSigningRequest): Created
401: Unauthorized
delete
delete a CertificateSigningRequest
HTTP Request
DELETE /apis/certificates.k8s.io/v1/certificatesigningrequests/{name}
Parameters
-
name (in path): string, required
name of the CertificateSigningRequest
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of CertificateSigningRequest
HTTP Request
DELETE /apis/certificates.k8s.io/v1/certificatesigningrequests
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.5 - Authorization Resources
6.5.5.1 - LocalSubjectAccessReview
apiVersion: authorization.k8s.io/v1
import "k8s.io/api/authorization/v1"
LocalSubjectAccessReview
LocalSubjectAccessReview checks whether or not a user or group can perform an action in a given namespace. Having a namespace scoped resource makes it much easier to grant namespace scoped policy that includes permissions checking.
-
apiVersion: authorization.k8s.io/v1
-
kind: LocalSubjectAccessReview
-
metadata (ObjectMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (SubjectAccessReviewSpec), required
Spec holds information about the request being evaluated. spec.namespace must be equal to the namespace you made the request against. If empty, it is defaulted.
-
status (SubjectAccessReviewStatus)
Status is filled in by the server and indicates whether the request is allowed or not
Operations
create
create a LocalSubjectAccessReview
HTTP Request
POST /apis/authorization.k8s.io/v1/namespaces/{namespace}/localsubjectaccessreviews
Parameters
-
namespace (in path): string, required
-
body: LocalSubjectAccessReview, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (LocalSubjectAccessReview): OK
201 (LocalSubjectAccessReview): Created
202 (LocalSubjectAccessReview): Accepted
401: Unauthorized
6.5.5.2 - SelfSubjectAccessReview
apiVersion: authorization.k8s.io/v1
import "k8s.io/api/authorization/v1"
SelfSubjectAccessReview
SelfSubjectAccessReview checks whether or the current user can perform an action. Not filling in a spec.namespace means "in all namespaces". Self is a special case, because users should always be able to check whether they can perform an action
-
apiVersion: authorization.k8s.io/v1
-
kind: SelfSubjectAccessReview
-
metadata (ObjectMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (SelfSubjectAccessReviewSpec), required
Spec holds information about the request being evaluated. user and groups must be empty
-
status (SubjectAccessReviewStatus)
Status is filled in by the server and indicates whether the request is allowed or not
SelfSubjectAccessReviewSpec
SelfSubjectAccessReviewSpec is a description of the access request. Exactly one of ResourceAuthorizationAttributes and NonResourceAuthorizationAttributes must be set
-
nonResourceAttributes (NonResourceAttributes)
NonResourceAttributes describes information for a non-resource access request
NonResourceAttributes includes the authorization attributes available for non-resource requests to the Authorizer interface
-
nonResourceAttributes.path (string)
Path is the URL path of the request
-
nonResourceAttributes.verb (string)
Verb is the standard HTTP verb
-
-
resourceAttributes (ResourceAttributes)
ResourceAuthorizationAttributes describes information for a resource access request
ResourceAttributes includes the authorization attributes available for resource requests to the Authorizer interface
-
resourceAttributes.group (string)
Group is the API Group of the Resource. "*" means all.
-
resourceAttributes.name (string)
Name is the name of the resource being requested for a "get" or deleted for a "delete". "" (empty) means all.
-
resourceAttributes.namespace (string)
Namespace is the namespace of the action being requested. Currently, there is no distinction between no namespace and all namespaces "" (empty) is defaulted for LocalSubjectAccessReviews "" (empty) is empty for cluster-scoped resources "" (empty) means "all" for namespace scoped resources from a SubjectAccessReview or SelfSubjectAccessReview
-
resourceAttributes.resource (string)
Resource is one of the existing resource types. "*" means all.
-
resourceAttributes.subresource (string)
Subresource is one of the existing resource types. "" means none.
-
resourceAttributes.verb (string)
Verb is a kubernetes resource API verb, like: get, list, watch, create, update, delete, proxy. "*" means all.
-
resourceAttributes.version (string)
Version is the API Version of the Resource. "*" means all.
-
Operations
create
create a SelfSubjectAccessReview
HTTP Request
POST /apis/authorization.k8s.io/v1/selfsubjectaccessreviews
Parameters
-
body: SelfSubjectAccessReview, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (SelfSubjectAccessReview): OK
201 (SelfSubjectAccessReview): Created
202 (SelfSubjectAccessReview): Accepted
401: Unauthorized
6.5.5.3 - SelfSubjectRulesReview
apiVersion: authorization.k8s.io/v1
import "k8s.io/api/authorization/v1"
SelfSubjectRulesReview
SelfSubjectRulesReview enumerates the set of actions the current user can perform within a namespace. The returned list of actions may be incomplete depending on the server's authorization mode, and any errors experienced during the evaluation. SelfSubjectRulesReview should be used by UIs to show/hide actions, or to quickly let an end user reason about their permissions. It should NOT Be used by external systems to drive authorization decisions as this raises confused deputy, cache lifetime/revocation, and correctness concerns. SubjectAccessReview, and LocalAccessReview are the correct way to defer authorization decisions to the API server.
-
apiVersion: authorization.k8s.io/v1
-
kind: SelfSubjectRulesReview
-
metadata (ObjectMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (SelfSubjectRulesReviewSpec), required
Spec holds information about the request being evaluated.
-
status (SubjectRulesReviewStatus)
Status is filled in by the server and indicates the set of actions a user can perform.
SubjectRulesReviewStatus contains the result of a rules check. This check can be incomplete depending on the set of authorizers the server is configured with and any errors experienced during evaluation. Because authorization rules are additive, if a rule appears in a list it's safe to assume the subject has that permission, even if that list is incomplete.
-
status.incomplete (boolean), required
Incomplete is true when the rules returned by this call are incomplete. This is most commonly encountered when an authorizer, such as an external authorizer, doesn't support rules evaluation.
-
status.nonResourceRules ([]NonResourceRule), required
NonResourceRules is the list of actions the subject is allowed to perform on non-resources. The list ordering isn't significant, may contain duplicates, and possibly be incomplete.
NonResourceRule holds information that describes a rule for the non-resource
-
status.nonResourceRules.verbs ([]string), required
Verb is a list of kubernetes non-resource API verbs, like: get, post, put, delete, patch, head, options. "*" means all.
-
status.nonResourceRules.nonResourceURLs ([]string)
NonResourceURLs is a set of partial urls that a user should have access to. s are allowed, but only as the full, final step in the path. "" means all.
-
-
status.resourceRules ([]ResourceRule), required
ResourceRules is the list of actions the subject is allowed to perform on resources. The list ordering isn't significant, may contain duplicates, and possibly be incomplete.
ResourceRule is the list of actions the subject is allowed to perform on resources. The list ordering isn't significant, may contain duplicates, and possibly be incomplete.
-
status.resourceRules.verbs ([]string), required
Verb is a list of kubernetes resource API verbs, like: get, list, watch, create, update, delete, proxy. "*" means all.
-
status.resourceRules.apiGroups ([]string)
APIGroups is the name of the APIGroup that contains the resources. If multiple API groups are specified, any action requested against one of the enumerated resources in any API group will be allowed. "*" means all.
-
status.resourceRules.resourceNames ([]string)
ResourceNames is an optional white list of names that the rule applies to. An empty set means that everything is allowed. "*" means all.
-
status.resourceRules.resources ([]string)
Resources is a list of resources this rule applies to. "" means all in the specified apiGroups. "/foo" represents the subresource 'foo' for all resources in the specified apiGroups.
-
-
status.evaluationError (string)
EvaluationError can appear in combination with Rules. It indicates an error occurred during rule evaluation, such as an authorizer that doesn't support rule evaluation, and that ResourceRules and/or NonResourceRules may be incomplete.
-
SelfSubjectRulesReviewSpec
SelfSubjectRulesReviewSpec defines the specification for SelfSubjectRulesReview.
-
namespace (string)
Namespace to evaluate rules for. Required.
Operations
create
create a SelfSubjectRulesReview
HTTP Request
POST /apis/authorization.k8s.io/v1/selfsubjectrulesreviews
Parameters
-
body: SelfSubjectRulesReview, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (SelfSubjectRulesReview): OK
201 (SelfSubjectRulesReview): Created
202 (SelfSubjectRulesReview): Accepted
401: Unauthorized
6.5.5.4 - SubjectAccessReview
apiVersion: authorization.k8s.io/v1
import "k8s.io/api/authorization/v1"
SubjectAccessReview
SubjectAccessReview checks whether or not a user or group can perform an action.
-
apiVersion: authorization.k8s.io/v1
-
kind: SubjectAccessReview
-
metadata (ObjectMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (SubjectAccessReviewSpec), required
Spec holds information about the request being evaluated
-
status (SubjectAccessReviewStatus)
Status is filled in by the server and indicates whether the request is allowed or not
SubjectAccessReviewSpec
SubjectAccessReviewSpec is a description of the access request. Exactly one of ResourceAuthorizationAttributes and NonResourceAuthorizationAttributes must be set
-
extra (map[string][]string)
Extra corresponds to the user.Info.GetExtra() method from the authenticator. Since that is input to the authorizer it needs a reflection here.
-
groups ([]string)
Groups is the groups you're testing for.
-
nonResourceAttributes (NonResourceAttributes)
NonResourceAttributes describes information for a non-resource access request
NonResourceAttributes includes the authorization attributes available for non-resource requests to the Authorizer interface
-
nonResourceAttributes.path (string)
Path is the URL path of the request
-
nonResourceAttributes.verb (string)
Verb is the standard HTTP verb
-
-
resourceAttributes (ResourceAttributes)
ResourceAuthorizationAttributes describes information for a resource access request
ResourceAttributes includes the authorization attributes available for resource requests to the Authorizer interface
-
resourceAttributes.group (string)
Group is the API Group of the Resource. "*" means all.
-
resourceAttributes.name (string)
Name is the name of the resource being requested for a "get" or deleted for a "delete". "" (empty) means all.
-
resourceAttributes.namespace (string)
Namespace is the namespace of the action being requested. Currently, there is no distinction between no namespace and all namespaces "" (empty) is defaulted for LocalSubjectAccessReviews "" (empty) is empty for cluster-scoped resources "" (empty) means "all" for namespace scoped resources from a SubjectAccessReview or SelfSubjectAccessReview
-
resourceAttributes.resource (string)
Resource is one of the existing resource types. "*" means all.
-
resourceAttributes.subresource (string)
Subresource is one of the existing resource types. "" means none.
-
resourceAttributes.verb (string)
Verb is a kubernetes resource API verb, like: get, list, watch, create, update, delete, proxy. "*" means all.
-
resourceAttributes.version (string)
Version is the API Version of the Resource. "*" means all.
-
-
uid (string)
UID information about the requesting user.
-
user (string)
User is the user you're testing for. If you specify "User" but not "Groups", then is it interpreted as "What if User were not a member of any groups
SubjectAccessReviewStatus
SubjectAccessReviewStatus
-
allowed (boolean), required
Allowed is required. True if the action would be allowed, false otherwise.
-
denied (boolean)
Denied is optional. True if the action would be denied, otherwise false. If both allowed is false and denied is false, then the authorizer has no opinion on whether to authorize the action. Denied may not be true if Allowed is true.
-
evaluationError (string)
EvaluationError is an indication that some error occurred during the authorization check. It is entirely possible to get an error and be able to continue determine authorization status in spite of it. For instance, RBAC can be missing a role, but enough roles are still present and bound to reason about the request.
-
reason (string)
Reason is optional. It indicates why a request was allowed or denied.
Operations
create
create a SubjectAccessReview
HTTP Request
POST /apis/authorization.k8s.io/v1/subjectaccessreviews
Parameters
-
body: SubjectAccessReview, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (SubjectAccessReview): OK
201 (SubjectAccessReview): Created
202 (SubjectAccessReview): Accepted
401: Unauthorized
6.5.5.5 - ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
import "k8s.io/api/rbac/v1"
ClusterRole
ClusterRole is a cluster level, logical grouping of PolicyRules that can be referenced as a unit by a RoleBinding or ClusterRoleBinding.
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: ClusterRole
-
metadata (ObjectMeta)
Standard object's metadata.
-
aggregationRule (AggregationRule)
AggregationRule is an optional field that describes how to build the Rules for this ClusterRole. If AggregationRule is set, then the Rules are controller managed and direct changes to Rules will be stomped by the controller.
AggregationRule describes how to locate ClusterRoles to aggregate into the ClusterRole
-
aggregationRule.clusterRoleSelectors ([]LabelSelector)
ClusterRoleSelectors holds a list of selectors which will be used to find ClusterRoles and create the rules. If any of the selectors match, then the ClusterRole's permissions will be added
-
-
rules ([]PolicyRule)
Rules holds all the PolicyRules for this ClusterRole
PolicyRule holds information that describes a policy rule, but does not contain information about who the rule applies to or which namespace the rule applies to.
-
rules.apiGroups ([]string)
APIGroups is the name of the APIGroup that contains the resources. If multiple API groups are specified, any action requested against one of the enumerated resources in any API group will be allowed.
-
rules.resources ([]string)
Resources is a list of resources this rule applies to. '*' represents all resources.
-
rules.verbs ([]string), required
Verbs is a list of Verbs that apply to ALL the ResourceKinds contained in this rule. '*' represents all verbs.
-
rules.resourceNames ([]string)
ResourceNames is an optional white list of names that the rule applies to. An empty set means that everything is allowed.
-
rules.nonResourceURLs ([]string)
NonResourceURLs is a set of partial urls that a user should have access to. *s are allowed, but only as the full, final step in the path Since non-resource URLs are not namespaced, this field is only applicable for ClusterRoles referenced from a ClusterRoleBinding. Rules can either apply to API resources (such as "pods" or "secrets") or non-resource URL paths (such as "/api"), but not both.
-
ClusterRoleList
ClusterRoleList is a collection of ClusterRoles
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: ClusterRoleList
-
metadata (ListMeta)
Standard object's metadata.
-
items ([]ClusterRole), required
Items is a list of ClusterRoles
Operations
get
read the specified ClusterRole
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/clusterroles/{name}
Parameters
-
name (in path): string, required
name of the ClusterRole
-
pretty (in query): string
Response
200 (ClusterRole): OK
401: Unauthorized
list
list or watch objects of kind ClusterRole
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/clusterroles
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ClusterRoleList): OK
401: Unauthorized
create
create a ClusterRole
HTTP Request
POST /apis/rbac.authorization.k8s.io/v1/clusterroles
Parameters
-
body: ClusterRole, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ClusterRole): OK
201 (ClusterRole): Created
202 (ClusterRole): Accepted
401: Unauthorized
update
replace the specified ClusterRole
HTTP Request
PUT /apis/rbac.authorization.k8s.io/v1/clusterroles/{name}
Parameters
-
name (in path): string, required
name of the ClusterRole
-
body: ClusterRole, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ClusterRole): OK
201 (ClusterRole): Created
401: Unauthorized
patch
partially update the specified ClusterRole
HTTP Request
PATCH /apis/rbac.authorization.k8s.io/v1/clusterroles/{name}
Parameters
-
name (in path): string, required
name of the ClusterRole
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ClusterRole): OK
201 (ClusterRole): Created
401: Unauthorized
delete
delete a ClusterRole
HTTP Request
DELETE /apis/rbac.authorization.k8s.io/v1/clusterroles/{name}
Parameters
-
name (in path): string, required
name of the ClusterRole
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of ClusterRole
HTTP Request
DELETE /apis/rbac.authorization.k8s.io/v1/clusterroles
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.5.6 - ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
import "k8s.io/api/rbac/v1"
ClusterRoleBinding
ClusterRoleBinding references a ClusterRole, but not contain it. It can reference a ClusterRole in the global namespace, and adds who information via Subject.
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: ClusterRoleBinding
-
metadata (ObjectMeta)
Standard object's metadata.
-
roleRef (RoleRef), required
RoleRef can only reference a ClusterRole in the global namespace. If the RoleRef cannot be resolved, the Authorizer must return an error.
RoleRef contains information that points to the role being used
-
roleRef.apiGroup (string), required
APIGroup is the group for the resource being referenced
-
roleRef.kind (string), required
Kind is the type of resource being referenced
-
roleRef.name (string), required
Name is the name of resource being referenced
-
-
subjects ([]Subject)
Subjects holds references to the objects the role applies to.
Subject contains a reference to the object or user identities a role binding applies to. This can either hold a direct API object reference, or a value for non-objects such as user and group names.
-
subjects.kind (string), required
Kind of object being referenced. Values defined by this API group are "User", "Group", and "ServiceAccount". If the Authorizer does not recognized the kind value, the Authorizer should report an error.
-
subjects.name (string), required
Name of the object being referenced.
-
subjects.apiGroup (string)
APIGroup holds the API group of the referenced subject. Defaults to "" for ServiceAccount subjects. Defaults to "rbac.authorization.k8s.io" for User and Group subjects.
-
subjects.namespace (string)
Namespace of the referenced object. If the object kind is non-namespace, such as "User" or "Group", and this value is not empty the Authorizer should report an error.
-
ClusterRoleBindingList
ClusterRoleBindingList is a collection of ClusterRoleBindings
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: ClusterRoleBindingList
-
metadata (ListMeta)
Standard object's metadata.
-
items ([]ClusterRoleBinding), required
Items is a list of ClusterRoleBindings
Operations
get
read the specified ClusterRoleBinding
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/{name}
Parameters
-
name (in path): string, required
name of the ClusterRoleBinding
-
pretty (in query): string
Response
200 (ClusterRoleBinding): OK
401: Unauthorized
list
list or watch objects of kind ClusterRoleBinding
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/clusterrolebindings
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ClusterRoleBindingList): OK
401: Unauthorized
create
create a ClusterRoleBinding
HTTP Request
POST /apis/rbac.authorization.k8s.io/v1/clusterrolebindings
Parameters
-
body: ClusterRoleBinding, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ClusterRoleBinding): OK
201 (ClusterRoleBinding): Created
202 (ClusterRoleBinding): Accepted
401: Unauthorized
update
replace the specified ClusterRoleBinding
HTTP Request
PUT /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/{name}
Parameters
-
name (in path): string, required
name of the ClusterRoleBinding
-
body: ClusterRoleBinding, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ClusterRoleBinding): OK
201 (ClusterRoleBinding): Created
401: Unauthorized
patch
partially update the specified ClusterRoleBinding
HTTP Request
PATCH /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/{name}
Parameters
-
name (in path): string, required
name of the ClusterRoleBinding
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ClusterRoleBinding): OK
201 (ClusterRoleBinding): Created
401: Unauthorized
delete
delete a ClusterRoleBinding
HTTP Request
DELETE /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/{name}
Parameters
-
name (in path): string, required
name of the ClusterRoleBinding
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of ClusterRoleBinding
HTTP Request
DELETE /apis/rbac.authorization.k8s.io/v1/clusterrolebindings
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.5.7 - Role
apiVersion: rbac.authorization.k8s.io/v1
import "k8s.io/api/rbac/v1"
Role
Role is a namespaced, logical grouping of PolicyRules that can be referenced as a unit by a RoleBinding.
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: Role
-
metadata (ObjectMeta)
Standard object's metadata.
-
rules ([]PolicyRule)
Rules holds all the PolicyRules for this Role
PolicyRule holds information that describes a policy rule, but does not contain information about who the rule applies to or which namespace the rule applies to.
-
rules.apiGroups ([]string)
APIGroups is the name of the APIGroup that contains the resources. If multiple API groups are specified, any action requested against one of the enumerated resources in any API group will be allowed.
-
rules.resources ([]string)
Resources is a list of resources this rule applies to. '*' represents all resources.
-
rules.verbs ([]string), required
Verbs is a list of Verbs that apply to ALL the ResourceKinds contained in this rule. '*' represents all verbs.
-
rules.resourceNames ([]string)
ResourceNames is an optional white list of names that the rule applies to. An empty set means that everything is allowed.
-
rules.nonResourceURLs ([]string)
NonResourceURLs is a set of partial urls that a user should have access to. *s are allowed, but only as the full, final step in the path Since non-resource URLs are not namespaced, this field is only applicable for ClusterRoles referenced from a ClusterRoleBinding. Rules can either apply to API resources (such as "pods" or "secrets") or non-resource URL paths (such as "/api"), but not both.
-
RoleList
RoleList is a collection of Roles
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: RoleList
-
metadata (ListMeta)
Standard object's metadata.
-
items ([]Role), required
Items is a list of Roles
Operations
get
read the specified Role
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/roles/{name}
Parameters
-
name (in path): string, required
name of the Role
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Role): OK
401: Unauthorized
list
list or watch objects of kind Role
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/roles
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (RoleList): OK
401: Unauthorized
list
list or watch objects of kind Role
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/roles
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (RoleList): OK
401: Unauthorized
create
create a Role
HTTP Request
POST /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/roles
Parameters
-
namespace (in path): string, required
-
body: Role, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Role): OK
201 (Role): Created
202 (Role): Accepted
401: Unauthorized
update
replace the specified Role
HTTP Request
PUT /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/roles/{name}
Parameters
-
name (in path): string, required
name of the Role
-
namespace (in path): string, required
-
body: Role, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Role): OK
201 (Role): Created
401: Unauthorized
patch
partially update the specified Role
HTTP Request
PATCH /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/roles/{name}
Parameters
-
name (in path): string, required
name of the Role
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Role): OK
201 (Role): Created
401: Unauthorized
delete
delete a Role
HTTP Request
DELETE /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/roles/{name}
Parameters
-
name (in path): string, required
name of the Role
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of Role
HTTP Request
DELETE /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/roles
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.5.8 - RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
import "k8s.io/api/rbac/v1"
RoleBinding
RoleBinding references a role, but does not contain it. It can reference a Role in the same namespace or a ClusterRole in the global namespace. It adds who information via Subjects and namespace information by which namespace it exists in. RoleBindings in a given namespace only have effect in that namespace.
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: RoleBinding
-
metadata (ObjectMeta)
Standard object's metadata.
-
roleRef (RoleRef), required
RoleRef can reference a Role in the current namespace or a ClusterRole in the global namespace. If the RoleRef cannot be resolved, the Authorizer must return an error.
RoleRef contains information that points to the role being used
-
roleRef.apiGroup (string), required
APIGroup is the group for the resource being referenced
-
roleRef.kind (string), required
Kind is the type of resource being referenced
-
roleRef.name (string), required
Name is the name of resource being referenced
-
-
subjects ([]Subject)
Subjects holds references to the objects the role applies to.
Subject contains a reference to the object or user identities a role binding applies to. This can either hold a direct API object reference, or a value for non-objects such as user and group names.
-
subjects.kind (string), required
Kind of object being referenced. Values defined by this API group are "User", "Group", and "ServiceAccount". If the Authorizer does not recognized the kind value, the Authorizer should report an error.
-
subjects.name (string), required
Name of the object being referenced.
-
subjects.apiGroup (string)
APIGroup holds the API group of the referenced subject. Defaults to "" for ServiceAccount subjects. Defaults to "rbac.authorization.k8s.io" for User and Group subjects.
-
subjects.namespace (string)
Namespace of the referenced object. If the object kind is non-namespace, such as "User" or "Group", and this value is not empty the Authorizer should report an error.
-
RoleBindingList
RoleBindingList is a collection of RoleBindings
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: RoleBindingList
-
metadata (ListMeta)
Standard object's metadata.
-
items ([]RoleBinding), required
Items is a list of RoleBindings
Operations
get
read the specified RoleBinding
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/rolebindings/{name}
Parameters
-
name (in path): string, required
name of the RoleBinding
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (RoleBinding): OK
401: Unauthorized
list
list or watch objects of kind RoleBinding
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/rolebindings
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (RoleBindingList): OK
401: Unauthorized
list
list or watch objects of kind RoleBinding
HTTP Request
GET /apis/rbac.authorization.k8s.io/v1/rolebindings
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (RoleBindingList): OK
401: Unauthorized
create
create a RoleBinding
HTTP Request
POST /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/rolebindings
Parameters
-
namespace (in path): string, required
-
body: RoleBinding, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (RoleBinding): OK
201 (RoleBinding): Created
202 (RoleBinding): Accepted
401: Unauthorized
update
replace the specified RoleBinding
HTTP Request
PUT /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/rolebindings/{name}
Parameters
-
name (in path): string, required
name of the RoleBinding
-
namespace (in path): string, required
-
body: RoleBinding, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (RoleBinding): OK
201 (RoleBinding): Created
401: Unauthorized
patch
partially update the specified RoleBinding
HTTP Request
PATCH /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/rolebindings/{name}
Parameters
-
name (in path): string, required
name of the RoleBinding
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (RoleBinding): OK
201 (RoleBinding): Created
401: Unauthorized
delete
delete a RoleBinding
HTTP Request
DELETE /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/rolebindings/{name}
Parameters
-
name (in path): string, required
name of the RoleBinding
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of RoleBinding
HTTP Request
DELETE /apis/rbac.authorization.k8s.io/v1/namespaces/{namespace}/rolebindings
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.6 - Policy Resources
6.5.6.1 - LimitRange
apiVersion: v1
import "k8s.io/api/core/v1"
LimitRange
LimitRange sets resource usage limits for each kind of resource in a Namespace.
-
apiVersion: v1
-
kind: LimitRange
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (LimitRangeSpec)
Spec defines the limits enforced. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
LimitRangeSpec
LimitRangeSpec defines a min/max usage limit for resources that match on kind.
-
limits ([]LimitRangeItem), required
Limits is the list of LimitRangeItem objects that are enforced.
LimitRangeItem defines a min/max usage limit for any resource that matches on kind.
-
limits.type (string), required
Type of resource that this limit applies to.
Possible enum values:
"Container"
Limit that applies to all containers in a namespace"PersistentVolumeClaim"
Limit that applies to all persistent volume claims in a namespace"Pod"
Limit that applies to all pods in a namespace
-
limits.default (map[string]Quantity)
Default resource requirement limit value by resource name if resource limit is omitted.
-
limits.defaultRequest (map[string]Quantity)
DefaultRequest is the default resource requirement request value by resource name if resource request is omitted.
-
limits.max (map[string]Quantity)
Max usage constraints on this kind by resource name.
-
limits.maxLimitRequestRatio (map[string]Quantity)
MaxLimitRequestRatio if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value; this represents the max burst for the named resource.
-
limits.min (map[string]Quantity)
Min usage constraints on this kind by resource name.
-
LimitRangeList
LimitRangeList is a list of LimitRange items.
-
apiVersion: v1
-
kind: LimitRangeList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]LimitRange), required
Items is a list of LimitRange objects. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
Operations
get
read the specified LimitRange
HTTP Request
GET /api/v1/namespaces/{namespace}/limitranges/{name}
Parameters
-
name (in path): string, required
name of the LimitRange
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (LimitRange): OK
401: Unauthorized
list
list or watch objects of kind LimitRange
HTTP Request
GET /api/v1/namespaces/{namespace}/limitranges
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (LimitRangeList): OK
401: Unauthorized
list
list or watch objects of kind LimitRange
HTTP Request
GET /api/v1/limitranges
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (LimitRangeList): OK
401: Unauthorized
create
create a LimitRange
HTTP Request
POST /api/v1/namespaces/{namespace}/limitranges
Parameters
-
namespace (in path): string, required
-
body: LimitRange, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (LimitRange): OK
201 (LimitRange): Created
202 (LimitRange): Accepted
401: Unauthorized
update
replace the specified LimitRange
HTTP Request
PUT /api/v1/namespaces/{namespace}/limitranges/{name}
Parameters
-
name (in path): string, required
name of the LimitRange
-
namespace (in path): string, required
-
body: LimitRange, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (LimitRange): OK
201 (LimitRange): Created
401: Unauthorized
patch
partially update the specified LimitRange
HTTP Request
PATCH /api/v1/namespaces/{namespace}/limitranges/{name}
Parameters
-
name (in path): string, required
name of the LimitRange
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (LimitRange): OK
201 (LimitRange): Created
401: Unauthorized
delete
delete a LimitRange
HTTP Request
DELETE /api/v1/namespaces/{namespace}/limitranges/{name}
Parameters
-
name (in path): string, required
name of the LimitRange
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of LimitRange
HTTP Request
DELETE /api/v1/namespaces/{namespace}/limitranges
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.6.2 - ResourceQuota
apiVersion: v1
import "k8s.io/api/core/v1"
ResourceQuota
ResourceQuota sets aggregate quota restrictions enforced per namespace
-
apiVersion: v1
-
kind: ResourceQuota
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (ResourceQuotaSpec)
Spec defines the desired quota. https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (ResourceQuotaStatus)
Status defines the actual enforced quota and its current usage. https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
ResourceQuotaSpec
ResourceQuotaSpec defines the desired hard limits to enforce for Quota.
-
hard (map[string]Quantity)
hard is the set of desired hard limits for each named resource. More info: https://kubernetes.io/docs/concepts/policy/resource-quotas/
-
scopeSelector (ScopeSelector)
scopeSelector is also a collection of filters like scopes that must match each object tracked by a quota but expressed using ScopeSelectorOperator in combination with possible values. For a resource to match, both scopes AND scopeSelector (if specified in spec), must be matched.
A scope selector represents the AND of the selectors represented by the scoped-resource selector requirements.
-
scopeSelector.matchExpressions ([]ScopedResourceSelectorRequirement)
A list of scope selector requirements by scope of the resources.
A scoped-resource selector requirement is a selector that contains values, a scope name, and an operator that relates the scope name and values.
-
scopeSelector.matchExpressions.operator (string), required
Represents a scope's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist.
Possible enum values:
"DoesNotExist"
"Exists"
"In"
"NotIn"
-
scopeSelector.matchExpressions.scopeName (string), required
The name of the scope that the selector applies to.
Possible enum values:
"BestEffort"
Match all pod objects that have best effort quality of service"CrossNamespacePodAffinity"
Match all pod objects that have cross-namespace pod (anti)affinity mentioned. This is a beta feature enabled by the PodAffinityNamespaceSelector feature flag."NotBestEffort"
Match all pod objects that do not have best effort quality of service"NotTerminating"
Match all pod objects where spec.activeDeadlineSeconds is nil"PriorityClass"
Match all pod objects that have priority class mentioned"Terminating"
Match all pod objects where spec.activeDeadlineSeconds >=0
-
scopeSelector.matchExpressions.values ([]string)
An array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. This array is replaced during a strategic merge patch.
-
-
-
scopes ([]string)
A collection of filters that must match each object tracked by a quota. If not specified, the quota matches all objects.
ResourceQuotaStatus
ResourceQuotaStatus defines the enforced hard limits and observed use.
-
hard (map[string]Quantity)
Hard is the set of enforced hard limits for each named resource. More info: https://kubernetes.io/docs/concepts/policy/resource-quotas/
-
used (map[string]Quantity)
Used is the current observed total usage of the resource in the namespace.
ResourceQuotaList
ResourceQuotaList is a list of ResourceQuota items.
-
apiVersion: v1
-
kind: ResourceQuotaList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]ResourceQuota), required
Items is a list of ResourceQuota objects. More info: https://kubernetes.io/docs/concepts/policy/resource-quotas/
Operations
get
read the specified ResourceQuota
HTTP Request
GET /api/v1/namespaces/{namespace}/resourcequotas/{name}
Parameters
-
name (in path): string, required
name of the ResourceQuota
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (ResourceQuota): OK
401: Unauthorized
get
read status of the specified ResourceQuota
HTTP Request
GET /api/v1/namespaces/{namespace}/resourcequotas/{name}/status
Parameters
-
name (in path): string, required
name of the ResourceQuota
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (ResourceQuota): OK
401: Unauthorized
list
list or watch objects of kind ResourceQuota
HTTP Request
GET /api/v1/namespaces/{namespace}/resourcequotas
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ResourceQuotaList): OK
401: Unauthorized
list
list or watch objects of kind ResourceQuota
HTTP Request
GET /api/v1/resourcequotas
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ResourceQuotaList): OK
401: Unauthorized
create
create a ResourceQuota
HTTP Request
POST /api/v1/namespaces/{namespace}/resourcequotas
Parameters
-
namespace (in path): string, required
-
body: ResourceQuota, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ResourceQuota): OK
201 (ResourceQuota): Created
202 (ResourceQuota): Accepted
401: Unauthorized
update
replace the specified ResourceQuota
HTTP Request
PUT /api/v1/namespaces/{namespace}/resourcequotas/{name}
Parameters
-
name (in path): string, required
name of the ResourceQuota
-
namespace (in path): string, required
-
body: ResourceQuota, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ResourceQuota): OK
201 (ResourceQuota): Created
401: Unauthorized
update
replace status of the specified ResourceQuota
HTTP Request
PUT /api/v1/namespaces/{namespace}/resourcequotas/{name}/status
Parameters
-
name (in path): string, required
name of the ResourceQuota
-
namespace (in path): string, required
-
body: ResourceQuota, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ResourceQuota): OK
201 (ResourceQuota): Created
401: Unauthorized
patch
partially update the specified ResourceQuota
HTTP Request
PATCH /api/v1/namespaces/{namespace}/resourcequotas/{name}
Parameters
-
name (in path): string, required
name of the ResourceQuota
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ResourceQuota): OK
201 (ResourceQuota): Created
401: Unauthorized
patch
partially update status of the specified ResourceQuota
HTTP Request
PATCH /api/v1/namespaces/{namespace}/resourcequotas/{name}/status
Parameters
-
name (in path): string, required
name of the ResourceQuota
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ResourceQuota): OK
201 (ResourceQuota): Created
401: Unauthorized
delete
delete a ResourceQuota
HTTP Request
DELETE /api/v1/namespaces/{namespace}/resourcequotas/{name}
Parameters
-
name (in path): string, required
name of the ResourceQuota
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (ResourceQuota): OK
202 (ResourceQuota): Accepted
401: Unauthorized
deletecollection
delete collection of ResourceQuota
HTTP Request
DELETE /api/v1/namespaces/{namespace}/resourcequotas
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.6.3 - NetworkPolicy
apiVersion: networking.k8s.io/v1
import "k8s.io/api/networking/v1"
NetworkPolicy
NetworkPolicy describes what network traffic is allowed for a set of Pods
-
apiVersion: networking.k8s.io/v1
-
kind: NetworkPolicy
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (NetworkPolicySpec)
Specification of the desired behavior for this NetworkPolicy.
NetworkPolicySpec
NetworkPolicySpec provides the specification of a NetworkPolicy
-
podSelector (LabelSelector), required
Selects the pods to which this NetworkPolicy object applies. The array of ingress rules is applied to any pods selected by this field. Multiple network policies can select the same set of pods. In this case, the ingress rules for each are combined additively. This field is NOT optional and follows standard label selector semantics. An empty podSelector matches all pods in this namespace.
-
policyTypes ([]string)
List of rule types that the NetworkPolicy relates to. Valid options are ["Ingress"], ["Egress"], or ["Ingress", "Egress"]. If this field is not specified, it will default based on the existence of Ingress or Egress rules; policies that contain an Egress section are assumed to affect Egress, and all policies (whether or not they contain an Ingress section) are assumed to affect Ingress. If you want to write an egress-only policy, you must explicitly specify policyTypes [ "Egress" ]. Likewise, if you want to write a policy that specifies that no egress is allowed, you must specify a policyTypes value that include "Egress" (since such a policy would not include an Egress section and would otherwise default to just [ "Ingress" ]). This field is beta-level in 1.8
-
ingress ([]NetworkPolicyIngressRule)
List of ingress rules to be applied to the selected pods. Traffic is allowed to a pod if there are no NetworkPolicies selecting the pod (and cluster policy otherwise allows the traffic), OR if the traffic source is the pod's local node, OR if the traffic matches at least one ingress rule across all of the NetworkPolicy objects whose podSelector matches the pod. If this field is empty then this NetworkPolicy does not allow any traffic (and serves solely to ensure that the pods it selects are isolated by default)
NetworkPolicyIngressRule describes a particular set of traffic that is allowed to the pods matched by a NetworkPolicySpec's podSelector. The traffic must match both ports and from.
-
ingress.from ([]NetworkPolicyPeer)
List of sources which should be able to access the pods selected for this rule. Items in this list are combined using a logical OR operation. If this field is empty or missing, this rule matches all sources (traffic not restricted by source). If this field is present and contains at least one item, this rule allows traffic only if the traffic matches at least one item in the from list.
NetworkPolicyPeer describes a peer to allow traffic to/from. Only certain combinations of fields are allowed
-
ingress.from.ipBlock (IPBlock)
IPBlock defines policy on a particular IPBlock. If this field is set then neither of the other fields can be.
IPBlock describes a particular CIDR (Ex. "192.168.1.1/24","2001:db9::/64") that is allowed to the pods matched by a NetworkPolicySpec's podSelector. The except entry describes CIDRs that should not be included within this rule.
-
ingress.from.ipBlock.cidr (string), required
CIDR is a string representing the IP Block Valid examples are "192.168.1.1/24" or "2001:db9::/64"
-
ingress.from.ipBlock.except ([]string)
Except is a slice of CIDRs that should not be included within an IP Block Valid examples are "192.168.1.1/24" or "2001:db9::/64" Except values will be rejected if they are outside the CIDR range
-
-
ingress.from.namespaceSelector (LabelSelector)
Selects Namespaces using cluster-scoped labels. This field follows standard label selector semantics; if present but empty, it selects all namespaces.
If PodSelector is also set, then the NetworkPolicyPeer as a whole selects the Pods matching PodSelector in the Namespaces selected by NamespaceSelector. Otherwise it selects all Pods in the Namespaces selected by NamespaceSelector.
-
ingress.from.podSelector (LabelSelector)
This is a label selector which selects Pods. This field follows standard label selector semantics; if present but empty, it selects all pods.
If NamespaceSelector is also set, then the NetworkPolicyPeer as a whole selects the Pods matching PodSelector in the Namespaces selected by NamespaceSelector. Otherwise it selects the Pods matching PodSelector in the policy's own Namespace.
-
-
ingress.ports ([]NetworkPolicyPort)
List of ports which should be made accessible on the pods selected for this rule. Each item in this list is combined using a logical OR. If this field is empty or missing, this rule matches all ports (traffic not restricted by port). If this field is present and contains at least one item, then this rule allows traffic only if the traffic matches at least one port in the list.
NetworkPolicyPort describes a port to allow traffic on
-
ingress.ports.port (IntOrString)
The port on the given protocol. This can either be a numerical or named port on a pod. If this field is not provided, this matches all port names and numbers. If present, only traffic on the specified protocol AND port will be matched.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
ingress.ports.endPort (int32)
If set, indicates that the range of ports from port to endPort, inclusive, should be allowed by the policy. This field cannot be defined if the port field is not defined or if the port field is defined as a named (string) port. The endPort must be equal or greater than port. This feature is in Beta state and is enabled by default. It can be disabled using the Feature Gate "NetworkPolicyEndPort".
-
ingress.ports.protocol (string)
The protocol (TCP, UDP, or SCTP) which traffic must match. If not specified, this field defaults to TCP.
-
-
-
egress ([]NetworkPolicyEgressRule)
List of egress rules to be applied to the selected pods. Outgoing traffic is allowed if there are no NetworkPolicies selecting the pod (and cluster policy otherwise allows the traffic), OR if the traffic matches at least one egress rule across all of the NetworkPolicy objects whose podSelector matches the pod. If this field is empty then this NetworkPolicy limits all outgoing traffic (and serves solely to ensure that the pods it selects are isolated by default). This field is beta-level in 1.8
NetworkPolicyEgressRule describes a particular set of traffic that is allowed out of pods matched by a NetworkPolicySpec's podSelector. The traffic must match both ports and to. This type is beta-level in 1.8
-
egress.to ([]NetworkPolicyPeer)
List of destinations for outgoing traffic of pods selected for this rule. Items in this list are combined using a logical OR operation. If this field is empty or missing, this rule matches all destinations (traffic not restricted by destination). If this field is present and contains at least one item, this rule allows traffic only if the traffic matches at least one item in the to list.
NetworkPolicyPeer describes a peer to allow traffic to/from. Only certain combinations of fields are allowed
-
egress.to.ipBlock (IPBlock)
IPBlock defines policy on a particular IPBlock. If this field is set then neither of the other fields can be.
IPBlock describes a particular CIDR (Ex. "192.168.1.1/24","2001:db9::/64") that is allowed to the pods matched by a NetworkPolicySpec's podSelector. The except entry describes CIDRs that should not be included within this rule.
-
egress.to.ipBlock.cidr (string), required
CIDR is a string representing the IP Block Valid examples are "192.168.1.1/24" or "2001:db9::/64"
-
egress.to.ipBlock.except ([]string)
Except is a slice of CIDRs that should not be included within an IP Block Valid examples are "192.168.1.1/24" or "2001:db9::/64" Except values will be rejected if they are outside the CIDR range
-
-
egress.to.namespaceSelector (LabelSelector)
Selects Namespaces using cluster-scoped labels. This field follows standard label selector semantics; if present but empty, it selects all namespaces.
If PodSelector is also set, then the NetworkPolicyPeer as a whole selects the Pods matching PodSelector in the Namespaces selected by NamespaceSelector. Otherwise it selects all Pods in the Namespaces selected by NamespaceSelector.
-
egress.to.podSelector (LabelSelector)
This is a label selector which selects Pods. This field follows standard label selector semantics; if present but empty, it selects all pods.
If NamespaceSelector is also set, then the NetworkPolicyPeer as a whole selects the Pods matching PodSelector in the Namespaces selected by NamespaceSelector. Otherwise it selects the Pods matching PodSelector in the policy's own Namespace.
-
-
egress.ports ([]NetworkPolicyPort)
List of destination ports for outgoing traffic. Each item in this list is combined using a logical OR. If this field is empty or missing, this rule matches all ports (traffic not restricted by port). If this field is present and contains at least one item, then this rule allows traffic only if the traffic matches at least one port in the list.
NetworkPolicyPort describes a port to allow traffic on
-
egress.ports.port (IntOrString)
The port on the given protocol. This can either be a numerical or named port on a pod. If this field is not provided, this matches all port names and numbers. If present, only traffic on the specified protocol AND port will be matched.
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
egress.ports.endPort (int32)
If set, indicates that the range of ports from port to endPort, inclusive, should be allowed by the policy. This field cannot be defined if the port field is not defined or if the port field is defined as a named (string) port. The endPort must be equal or greater than port. This feature is in Beta state and is enabled by default. It can be disabled using the Feature Gate "NetworkPolicyEndPort".
-
egress.ports.protocol (string)
The protocol (TCP, UDP, or SCTP) which traffic must match. If not specified, this field defaults to TCP.
-
-
NetworkPolicyList
NetworkPolicyList is a list of NetworkPolicy objects.
-
apiVersion: networking.k8s.io/v1
-
kind: NetworkPolicyList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]NetworkPolicy), required
Items is a list of schema objects.
Operations
get
read the specified NetworkPolicy
HTTP Request
GET /apis/networking.k8s.io/v1/namespaces/{namespace}/networkpolicies/{name}
Parameters
-
name (in path): string, required
name of the NetworkPolicy
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (NetworkPolicy): OK
401: Unauthorized
list
list or watch objects of kind NetworkPolicy
HTTP Request
GET /apis/networking.k8s.io/v1/namespaces/{namespace}/networkpolicies
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (NetworkPolicyList): OK
401: Unauthorized
list
list or watch objects of kind NetworkPolicy
HTTP Request
GET /apis/networking.k8s.io/v1/networkpolicies
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (NetworkPolicyList): OK
401: Unauthorized
create
create a NetworkPolicy
HTTP Request
POST /apis/networking.k8s.io/v1/namespaces/{namespace}/networkpolicies
Parameters
-
namespace (in path): string, required
-
body: NetworkPolicy, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (NetworkPolicy): OK
201 (NetworkPolicy): Created
202 (NetworkPolicy): Accepted
401: Unauthorized
update
replace the specified NetworkPolicy
HTTP Request
PUT /apis/networking.k8s.io/v1/namespaces/{namespace}/networkpolicies/{name}
Parameters
-
name (in path): string, required
name of the NetworkPolicy
-
namespace (in path): string, required
-
body: NetworkPolicy, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (NetworkPolicy): OK
201 (NetworkPolicy): Created
401: Unauthorized
patch
partially update the specified NetworkPolicy
HTTP Request
PATCH /apis/networking.k8s.io/v1/namespaces/{namespace}/networkpolicies/{name}
Parameters
-
name (in path): string, required
name of the NetworkPolicy
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (NetworkPolicy): OK
201 (NetworkPolicy): Created
401: Unauthorized
delete
delete a NetworkPolicy
HTTP Request
DELETE /apis/networking.k8s.io/v1/namespaces/{namespace}/networkpolicies/{name}
Parameters
-
name (in path): string, required
name of the NetworkPolicy
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of NetworkPolicy
HTTP Request
DELETE /apis/networking.k8s.io/v1/namespaces/{namespace}/networkpolicies
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.6.4 - PodDisruptionBudget
apiVersion: policy/v1
import "k8s.io/api/policy/v1"
PodDisruptionBudget
PodDisruptionBudget is an object to define the max disruption that can be caused to a collection of pods
-
apiVersion: policy/v1
-
kind: PodDisruptionBudget
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (PodDisruptionBudgetSpec)
Specification of the desired behavior of the PodDisruptionBudget.
-
status (PodDisruptionBudgetStatus)
Most recently observed status of the PodDisruptionBudget.
PodDisruptionBudgetSpec
PodDisruptionBudgetSpec is a description of a PodDisruptionBudget.
-
maxUnavailable (IntOrString)
An eviction is allowed if at most "maxUnavailable" pods selected by "selector" are unavailable after the eviction, i.e. even in absence of the evicted pod. For example, one can prevent all voluntary evictions by specifying 0. This is a mutually exclusive setting with "minAvailable".
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
minAvailable (IntOrString)
An eviction is allowed if at least "minAvailable" pods selected by "selector" will still be available after the eviction, i.e. even in the absence of the evicted pod. So for example you can prevent all voluntary evictions by specifying "100%".
IntOrString is a type that can hold an int32 or a string. When used in JSON or YAML marshalling and unmarshalling, it produces or consumes the inner type. This allows you to have, for example, a JSON field that can accept a name or number.
-
selector (LabelSelector)
Label query over pods whose evictions are managed by the disruption budget. A null selector will match no pods, while an empty ({}) selector will select all pods within the namespace.
PodDisruptionBudgetStatus
PodDisruptionBudgetStatus represents information about the status of a PodDisruptionBudget. Status may trail the actual state of a system.
-
currentHealthy (int32), required
current number of healthy pods
-
desiredHealthy (int32), required
minimum desired number of healthy pods
-
disruptionsAllowed (int32), required
Number of pod disruptions that are currently allowed.
-
expectedPods (int32), required
total number of pods counted by this disruption budget
-
conditions ([]Condition)
Patch strategy: merge on key
type
Map: unique values on key type will be kept during a merge
Conditions contain conditions for PDB. The disruption controller sets the DisruptionAllowed condition. The following are known values for the reason field (additional reasons could be added in the future): - SyncFailed: The controller encountered an error and wasn't able to compute the number of allowed disruptions. Therefore no disruptions are allowed and the status of the condition will be False.
- InsufficientPods: The number of pods are either at or below the number required by the PodDisruptionBudget. No disruptions are allowed and the status of the condition will be False.
- SufficientPods: There are more pods than required by the PodDisruptionBudget. The condition will be True, and the number of allowed disruptions are provided by the disruptionsAllowed property.
Condition contains details for one aspect of the current state of this API Resource.
-
conditions.lastTransitionTime (Time), required
lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string), required
message is a human readable message indicating details about the transition. This may be an empty string.
-
conditions.reason (string), required
reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.
-
conditions.status (string), required
status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
type of condition in CamelCase or in foo.example.com/CamelCase.
-
conditions.observedGeneration (int64)
observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.
-
disruptedPods (map[string]Time)
DisruptedPods contains information about pods whose eviction was processed by the API server eviction subresource handler but has not yet been observed by the PodDisruptionBudget controller. A pod will be in this map from the time when the API server processed the eviction request to the time when the pod is seen by PDB controller as having been marked for deletion (or after a timeout). The key in the map is the name of the pod and the value is the time when the API server processed the eviction request. If the deletion didn't occur and a pod is still there it will be removed from the list automatically by PodDisruptionBudget controller after some time. If everything goes smooth this map should be empty for the most of the time. Large number of entries in the map may indicate problems with pod deletions.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
observedGeneration (int64)
Most recent generation observed when updating this PDB status. DisruptionsAllowed and other status information is valid only if observedGeneration equals to PDB's object generation.
PodDisruptionBudgetList
PodDisruptionBudgetList is a collection of PodDisruptionBudgets.
-
apiVersion: policy/v1
-
kind: PodDisruptionBudgetList
-
metadata (ListMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]PodDisruptionBudget), required
Items is a list of PodDisruptionBudgets
Operations
get
read the specified PodDisruptionBudget
HTTP Request
GET /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets/{name}
Parameters
-
name (in path): string, required
name of the PodDisruptionBudget
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (PodDisruptionBudget): OK
401: Unauthorized
get
read status of the specified PodDisruptionBudget
HTTP Request
GET /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets/{name}/status
Parameters
-
name (in path): string, required
name of the PodDisruptionBudget
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (PodDisruptionBudget): OK
401: Unauthorized
list
list or watch objects of kind PodDisruptionBudget
HTTP Request
GET /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PodDisruptionBudgetList): OK
401: Unauthorized
list
list or watch objects of kind PodDisruptionBudget
HTTP Request
GET /apis/policy/v1/poddisruptionbudgets
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PodDisruptionBudgetList): OK
401: Unauthorized
create
create a PodDisruptionBudget
HTTP Request
POST /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets
Parameters
-
namespace (in path): string, required
-
body: PodDisruptionBudget, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PodDisruptionBudget): OK
201 (PodDisruptionBudget): Created
202 (PodDisruptionBudget): Accepted
401: Unauthorized
update
replace the specified PodDisruptionBudget
HTTP Request
PUT /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets/{name}
Parameters
-
name (in path): string, required
name of the PodDisruptionBudget
-
namespace (in path): string, required
-
body: PodDisruptionBudget, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PodDisruptionBudget): OK
201 (PodDisruptionBudget): Created
401: Unauthorized
update
replace status of the specified PodDisruptionBudget
HTTP Request
PUT /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets/{name}/status
Parameters
-
name (in path): string, required
name of the PodDisruptionBudget
-
namespace (in path): string, required
-
body: PodDisruptionBudget, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PodDisruptionBudget): OK
201 (PodDisruptionBudget): Created
401: Unauthorized
patch
partially update the specified PodDisruptionBudget
HTTP Request
PATCH /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets/{name}
Parameters
-
name (in path): string, required
name of the PodDisruptionBudget
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PodDisruptionBudget): OK
201 (PodDisruptionBudget): Created
401: Unauthorized
patch
partially update status of the specified PodDisruptionBudget
HTTP Request
PATCH /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets/{name}/status
Parameters
-
name (in path): string, required
name of the PodDisruptionBudget
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PodDisruptionBudget): OK
201 (PodDisruptionBudget): Created
401: Unauthorized
delete
delete a PodDisruptionBudget
HTTP Request
DELETE /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets/{name}
Parameters
-
name (in path): string, required
name of the PodDisruptionBudget
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of PodDisruptionBudget
HTTP Request
DELETE /apis/policy/v1/namespaces/{namespace}/poddisruptionbudgets
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.6.5 - PodSecurityPolicy v1beta1
apiVersion: policy/v1beta1
import "k8s.io/api/policy/v1beta1"
PodSecurityPolicy
PodSecurityPolicy governs the ability to make requests that affect the Security Context that will be applied to a pod and container. Deprecated in 1.21.
-
apiVersion: policy/v1beta1
-
kind: PodSecurityPolicy
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (PodSecurityPolicySpec)
spec defines the policy enforced.
PodSecurityPolicySpec
PodSecurityPolicySpec defines the policy enforced.
-
runAsUser (RunAsUserStrategyOptions), required
runAsUser is the strategy that will dictate the allowable RunAsUser values that may be set.
RunAsUserStrategyOptions defines the strategy type and any options used to create the strategy.
-
runAsUser.rule (string), required
rule is the strategy that will dictate the allowable RunAsUser values that may be set.
-
runAsUser.ranges ([]IDRange)
ranges are the allowed ranges of uids that may be used. If you would like to force a single uid then supply a single range with the same start and end. Required for MustRunAs.
IDRange provides a min/max of an allowed range of IDs.
-
runAsUser.ranges.max (int64), required
max is the end of the range, inclusive.
-
runAsUser.ranges.min (int64), required
min is the start of the range, inclusive.
-
-
-
runAsGroup (RunAsGroupStrategyOptions)
RunAsGroup is the strategy that will dictate the allowable RunAsGroup values that may be set. If this field is omitted, the pod's RunAsGroup can take any value. This field requires the RunAsGroup feature gate to be enabled.
RunAsGroupStrategyOptions defines the strategy type and any options used to create the strategy.
-
runAsGroup.rule (string), required
rule is the strategy that will dictate the allowable RunAsGroup values that may be set.
-
runAsGroup.ranges ([]IDRange)
ranges are the allowed ranges of gids that may be used. If you would like to force a single gid then supply a single range with the same start and end. Required for MustRunAs.
IDRange provides a min/max of an allowed range of IDs.
-
runAsGroup.ranges.max (int64), required
max is the end of the range, inclusive.
-
runAsGroup.ranges.min (int64), required
min is the start of the range, inclusive.
-
-
-
fsGroup (FSGroupStrategyOptions), required
fsGroup is the strategy that will dictate what fs group is used by the SecurityContext.
FSGroupStrategyOptions defines the strategy type and options used to create the strategy.
-
fsGroup.ranges ([]IDRange)
ranges are the allowed ranges of fs groups. If you would like to force a single fs group then supply a single range with the same start and end. Required for MustRunAs.
IDRange provides a min/max of an allowed range of IDs.
-
fsGroup.ranges.max (int64), required
max is the end of the range, inclusive.
-
fsGroup.ranges.min (int64), required
min is the start of the range, inclusive.
-
-
fsGroup.rule (string)
rule is the strategy that will dictate what FSGroup is used in the SecurityContext.
-
-
supplementalGroups (SupplementalGroupsStrategyOptions), required
supplementalGroups is the strategy that will dictate what supplemental groups are used by the SecurityContext.
SupplementalGroupsStrategyOptions defines the strategy type and options used to create the strategy.
-
supplementalGroups.ranges ([]IDRange)
ranges are the allowed ranges of supplemental groups. If you would like to force a single supplemental group then supply a single range with the same start and end. Required for MustRunAs.
IDRange provides a min/max of an allowed range of IDs.
-
supplementalGroups.ranges.max (int64), required
max is the end of the range, inclusive.
-
supplementalGroups.ranges.min (int64), required
min is the start of the range, inclusive.
-
-
supplementalGroups.rule (string)
rule is the strategy that will dictate what supplemental groups is used in the SecurityContext.
-
-
seLinux (SELinuxStrategyOptions), required
seLinux is the strategy that will dictate the allowable labels that may be set.
SELinuxStrategyOptions defines the strategy type and any options used to create the strategy.
-
seLinux.rule (string), required
rule is the strategy that will dictate the allowable labels that may be set.
-
seLinux.seLinuxOptions (SELinuxOptions)
seLinuxOptions required to run as; required for MustRunAs More info: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
SELinuxOptions are the labels to be applied to the container
-
seLinux.seLinuxOptions.level (string)
Level is SELinux level label that applies to the container.
-
seLinux.seLinuxOptions.role (string)
Role is a SELinux role label that applies to the container.
-
seLinux.seLinuxOptions.type (string)
Type is a SELinux type label that applies to the container.
-
seLinux.seLinuxOptions.user (string)
User is a SELinux user label that applies to the container.
-
-
-
readOnlyRootFilesystem (boolean)
readOnlyRootFilesystem when set to true will force containers to run with a read only root file system. If the container specifically requests to run with a non-read only root file system the PSP should deny the pod. If set to false the container may run with a read only root file system if it wishes but it will not be forced to.
-
privileged (boolean)
privileged determines if a pod can request to be run as privileged.
-
allowPrivilegeEscalation (boolean)
allowPrivilegeEscalation determines if a pod can request to allow privilege escalation. If unspecified, defaults to true.
-
defaultAllowPrivilegeEscalation (boolean)
defaultAllowPrivilegeEscalation controls the default setting for whether a process can gain more privileges than its parent process.
-
allowedCSIDrivers ([]AllowedCSIDriver)
AllowedCSIDrivers is an allowlist of inline CSI drivers that must be explicitly set to be embedded within a pod spec. An empty value indicates that any CSI driver can be used for inline ephemeral volumes. This is a beta field, and is only honored if the API server enables the CSIInlineVolume feature gate.
AllowedCSIDriver represents a single inline CSI Driver that is allowed to be used.
-
allowedCSIDrivers.name (string), required
Name is the registered name of the CSI driver
-
-
allowedCapabilities ([]string)
allowedCapabilities is a list of capabilities that can be requested to add to the container. Capabilities in this field may be added at the pod author's discretion. You must not list a capability in both allowedCapabilities and requiredDropCapabilities.
-
requiredDropCapabilities ([]string)
requiredDropCapabilities are the capabilities that will be dropped from the container. These are required to be dropped and cannot be added.
-
defaultAddCapabilities ([]string)
defaultAddCapabilities is the default set of capabilities that will be added to the container unless the pod spec specifically drops the capability. You may not list a capability in both defaultAddCapabilities and requiredDropCapabilities. Capabilities added here are implicitly allowed, and need not be included in the allowedCapabilities list.
-
allowedFlexVolumes ([]AllowedFlexVolume)
allowedFlexVolumes is an allowlist of Flexvolumes. Empty or nil indicates that all Flexvolumes may be used. This parameter is effective only when the usage of the Flexvolumes is allowed in the "volumes" field.
AllowedFlexVolume represents a single Flexvolume that is allowed to be used.
-
allowedFlexVolumes.driver (string), required
driver is the name of the Flexvolume driver.
-
-
allowedHostPaths ([]AllowedHostPath)
allowedHostPaths is an allowlist of host paths. Empty indicates that all host paths may be used.
AllowedHostPath defines the host volume conditions that will be enabled by a policy for pods to use. It requires the path prefix to be defined.
-
allowedHostPaths.pathPrefix (string)
pathPrefix is the path prefix that the host volume must match. It does not support
*
. Trailing slashes are trimmed when validating the path prefix with a host path.Examples:
/foo
would allow/foo
,/foo/
and/foo/bar
/foo
would not allow/food
or/etc/foo
-
allowedHostPaths.readOnly (boolean)
when set to true, will allow host volumes matching the pathPrefix only if all volume mounts are readOnly.
-
-
allowedProcMountTypes ([]string)
AllowedProcMountTypes is an allowlist of allowed ProcMountTypes. Empty or nil indicates that only the DefaultProcMountType may be used. This requires the ProcMountType feature flag to be enabled.
-
allowedUnsafeSysctls ([]string)
allowedUnsafeSysctls is a list of explicitly allowed unsafe sysctls, defaults to none. Each entry is either a plain sysctl name or ends in "*" in which case it is considered as a prefix of allowed sysctls. Single * means all unsafe sysctls are allowed. Kubelet has to allowlist all allowed unsafe sysctls explicitly to avoid rejection.
Examples: e.g. "foo/" allows "foo/bar", "foo/baz", etc. e.g. "foo." allows "foo.bar", "foo.baz", etc.
-
forbiddenSysctls ([]string)
forbiddenSysctls is a list of explicitly forbidden sysctls, defaults to none. Each entry is either a plain sysctl name or ends in "*" in which case it is considered as a prefix of forbidden sysctls. Single * means all sysctls are forbidden.
Examples: e.g. "foo/" forbids "foo/bar", "foo/baz", etc. e.g. "foo." forbids "foo.bar", "foo.baz", etc.
-
hostIPC (boolean)
hostIPC determines if the policy allows the use of HostIPC in the pod spec.
-
hostNetwork (boolean)
hostNetwork determines if the policy allows the use of HostNetwork in the pod spec.
-
hostPID (boolean)
hostPID determines if the policy allows the use of HostPID in the pod spec.
-
hostPorts ([]HostPortRange)
hostPorts determines which host port ranges are allowed to be exposed.
HostPortRange defines a range of host ports that will be enabled by a policy for pods to use. It requires both the start and end to be defined.
-
hostPorts.max (int32), required
max is the end of the range, inclusive.
-
hostPorts.min (int32), required
min is the start of the range, inclusive.
-
-
runtimeClass (RuntimeClassStrategyOptions)
runtimeClass is the strategy that will dictate the allowable RuntimeClasses for a pod. If this field is omitted, the pod's runtimeClassName field is unrestricted. Enforcement of this field depends on the RuntimeClass feature gate being enabled.
RuntimeClassStrategyOptions define the strategy that will dictate the allowable RuntimeClasses for a pod.
-
runtimeClass.allowedRuntimeClassNames ([]string), required
allowedRuntimeClassNames is an allowlist of RuntimeClass names that may be specified on a pod. A value of "*" means that any RuntimeClass name is allowed, and must be the only item in the list. An empty list requires the RuntimeClassName field to be unset.
-
runtimeClass.defaultRuntimeClassName (string)
defaultRuntimeClassName is the default RuntimeClassName to set on the pod. The default MUST be allowed by the allowedRuntimeClassNames list. A value of nil does not mutate the Pod.
-
-
volumes ([]string)
volumes is an allowlist of volume plugins. Empty indicates that no volumes may be used. To allow all volumes you may use '*'.
PodSecurityPolicyList
PodSecurityPolicyList is a list of PodSecurityPolicy objects.
-
apiVersion: policy/v1beta1
-
kind: PodSecurityPolicyList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]PodSecurityPolicy), required
items is a list of schema objects.
Operations
get
read the specified PodSecurityPolicy
HTTP Request
GET /apis/policy/v1beta1/podsecuritypolicies/{name}
Parameters
-
name (in path): string, required
name of the PodSecurityPolicy
-
pretty (in query): string
Response
200 (PodSecurityPolicy): OK
401: Unauthorized
list
list or watch objects of kind PodSecurityPolicy
HTTP Request
GET /apis/policy/v1beta1/podsecuritypolicies
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PodSecurityPolicyList): OK
401: Unauthorized
create
create a PodSecurityPolicy
HTTP Request
POST /apis/policy/v1beta1/podsecuritypolicies
Parameters
-
body: PodSecurityPolicy, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PodSecurityPolicy): OK
201 (PodSecurityPolicy): Created
202 (PodSecurityPolicy): Accepted
401: Unauthorized
update
replace the specified PodSecurityPolicy
HTTP Request
PUT /apis/policy/v1beta1/podsecuritypolicies/{name}
Parameters
-
name (in path): string, required
name of the PodSecurityPolicy
-
body: PodSecurityPolicy, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PodSecurityPolicy): OK
201 (PodSecurityPolicy): Created
401: Unauthorized
patch
partially update the specified PodSecurityPolicy
HTTP Request
PATCH /apis/policy/v1beta1/podsecuritypolicies/{name}
Parameters
-
name (in path): string, required
name of the PodSecurityPolicy
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PodSecurityPolicy): OK
201 (PodSecurityPolicy): Created
401: Unauthorized
delete
delete a PodSecurityPolicy
HTTP Request
DELETE /apis/policy/v1beta1/podsecuritypolicies/{name}
Parameters
-
name (in path): string, required
name of the PodSecurityPolicy
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (PodSecurityPolicy): OK
202 (PodSecurityPolicy): Accepted
401: Unauthorized
deletecollection
delete collection of PodSecurityPolicy
HTTP Request
DELETE /apis/policy/v1beta1/podsecuritypolicies
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.7 - Extend Resources
6.5.7.1 - CustomResourceDefinition
apiVersion: apiextensions.k8s.io/v1
import "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1"
CustomResourceDefinition
CustomResourceDefinition represents a resource that should be exposed on the API server. Its name MUST be in the format <.spec.name>.<.spec.group>.
-
apiVersion: apiextensions.k8s.io/v1
-
kind: CustomResourceDefinition
-
metadata (ObjectMeta)
Standard object's metadata More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (CustomResourceDefinitionSpec), required
spec describes how the user wants the resources to appear
-
status (CustomResourceDefinitionStatus)
status indicates the actual state of the CustomResourceDefinition
CustomResourceDefinitionSpec
CustomResourceDefinitionSpec describes how a user wants their resource to appear
-
group (string), required
group is the API group of the defined custom resource. The custom resources are served under
/apis/\<group>/...
. Must match the name of the CustomResourceDefinition (in the form\<names.plural>.\<group>
). -
names (CustomResourceDefinitionNames), required
names specify the resource and kind names for the custom resource.
CustomResourceDefinitionNames indicates the names to serve this CustomResourceDefinition
-
names.kind (string), required
kind is the serialized kind of the resource. It is normally CamelCase and singular. Custom resource instances will use this value as the
kind
attribute in API calls. -
names.plural (string), required
plural is the plural name of the resource to serve. The custom resources are served under
/apis/\<group>/\<version>/.../\<plural>
. Must match the name of the CustomResourceDefinition (in the form\<names.plural>.\<group>
). Must be all lowercase. -
names.categories ([]string)
categories is a list of grouped resources this custom resource belongs to (e.g. 'all'). This is published in API discovery documents, and used by clients to support invocations like
kubectl get all
. -
names.listKind (string)
listKind is the serialized kind of the list for this resource. Defaults to "
kind
List". -
names.shortNames ([]string)
shortNames are short names for the resource, exposed in API discovery documents, and used by clients to support invocations like
kubectl get \<shortname>
. It must be all lowercase. -
names.singular (string)
singular is the singular name of the resource. It must be all lowercase. Defaults to lowercased
kind
.
-
-
scope (string), required
scope indicates whether the defined custom resource is cluster- or namespace-scoped. Allowed values are
Cluster
andNamespaced
. -
versions ([]CustomResourceDefinitionVersion), required
versions is the list of all API versions of the defined custom resource. Version names are used to compute the order in which served versions are listed in API discovery. If the version string is "kube-like", it will sort above non "kube-like" version strings, which are ordered lexicographically. "Kube-like" versions start with a "v", then are followed by a number (the major version), then optionally the string "alpha" or "beta" and another number (the minor version). These are sorted first by GA > beta > alpha (where GA is a version with no suffix such as beta or alpha), and then by comparing major version, then minor version. An example sorted list of versions: v10, v2, v1, v11beta2, v10beta3, v3beta1, v12alpha1, v11alpha2, foo1, foo10.
CustomResourceDefinitionVersion describes a version for CRD.
-
versions.name (string), required
name is the version name, e.g. “v1”, “v2beta1”, etc. The custom resources are served under this version at
/apis/\<group>/\<version>/...
ifserved
is true. -
versions.served (boolean), required
served is a flag enabling/disabling this version from being served via REST APIs
-
versions.storage (boolean), required
storage indicates this version should be used when persisting custom resources to storage. There must be exactly one version with storage=true.
-
versions.additionalPrinterColumns ([]CustomResourceColumnDefinition)
additionalPrinterColumns specifies additional columns returned in Table output. See https://kubernetes.io/docs/reference/using-api/api-concepts/#receiving-resources-as-tables for details. If no columns are specified, a single column displaying the age of the custom resource is used.
CustomResourceColumnDefinition specifies a column for server side printing.
-
versions.additionalPrinterColumns.jsonPath (string), required
jsonPath is a simple JSON path (i.e. with array notation) which is evaluated against each custom resource to produce the value for this column.
-
versions.additionalPrinterColumns.name (string), required
name is a human readable name for the column.
-
versions.additionalPrinterColumns.type (string), required
type is an OpenAPI type definition for this column. See https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md#data-types for details.
-
versions.additionalPrinterColumns.description (string)
description is a human readable description of this column.
-
versions.additionalPrinterColumns.format (string)
format is an optional OpenAPI type definition for this column. The 'name' format is applied to the primary identifier column to assist in clients identifying column is the resource name. See https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md#data-types for details.
-
versions.additionalPrinterColumns.priority (int32)
priority is an integer defining the relative importance of this column compared to others. Lower numbers are considered higher priority. Columns that may be omitted in limited space scenarios should be given a priority greater than 0.
-
-
versions.deprecated (boolean)
deprecated indicates this version of the custom resource API is deprecated. When set to true, API requests to this version receive a warning header in the server response. Defaults to false.
-
versions.deprecationWarning (string)
deprecationWarning overrides the default warning returned to API clients. May only be set when
deprecated
is true. The default warning indicates this version is deprecated and recommends use of the newest served version of equal or greater stability, if one exists. -
versions.schema (CustomResourceValidation)
schema describes the schema used for validation, pruning, and defaulting of this version of the custom resource.
CustomResourceValidation is a list of validation methods for CustomResources.
-
versions.schema.openAPIV3Schema (JSONSchemaProps)
openAPIV3Schema is the OpenAPI v3 schema to use for validation and pruning.
-
-
versions.subresources (CustomResourceSubresources)
subresources specify what subresources this version of the defined custom resource have.
CustomResourceSubresources defines the status and scale subresources for CustomResources.
-
versions.subresources.scale (CustomResourceSubresourceScale)
scale indicates the custom resource should serve a
/scale
subresource that returns anautoscaling/v1
Scale object.CustomResourceSubresourceScale defines how to serve the scale subresource for CustomResources.
-
versions.subresources.scale.specReplicasPath (string), required
specReplicasPath defines the JSON path inside of a custom resource that corresponds to Scale
spec.replicas
. Only JSON paths without the array notation are allowed. Must be a JSON Path under.spec
. If there is no value under the given path in the custom resource, the/scale
subresource will return an error on GET. -
versions.subresources.scale.statusReplicasPath (string), required
statusReplicasPath defines the JSON path inside of a custom resource that corresponds to Scale
status.replicas
. Only JSON paths without the array notation are allowed. Must be a JSON Path under.status
. If there is no value under the given path in the custom resource, thestatus.replicas
value in the/scale
subresource will default to 0. -
versions.subresources.scale.labelSelectorPath (string)
labelSelectorPath defines the JSON path inside of a custom resource that corresponds to Scale
status.selector
. Only JSON paths without the array notation are allowed. Must be a JSON Path under.status
or.spec
. Must be set to work with HorizontalPodAutoscaler. The field pointed by this JSON path must be a string field (not a complex selector struct) which contains a serialized label selector in string form. More info: https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions#scale-subresource If there is no value under the given path in the custom resource, thestatus.selector
value in the/scale
subresource will default to the empty string.
-
-
versions.subresources.status (CustomResourceSubresourceStatus)
status indicates the custom resource should serve a
/status
subresource. When enabled: 1. requests to the custom resource primary endpoint ignore changes to thestatus
stanza of the object. 2. requests to the custom resource/status
subresource ignore changes to anything other than thestatus
stanza of the object.CustomResourceSubresourceStatus defines how to serve the status subresource for CustomResources. Status is represented by the
.status
JSON path inside of a CustomResource. When set, * exposes a /status subresource for the custom resource * PUT requests to the /status subresource take a custom resource object, and ignore changes to anything except the status stanza * PUT/POST/PATCH requests to the custom resource ignore changes to the status stanza
-
-
-
conversion (CustomResourceConversion)
conversion defines conversion settings for the CRD.
CustomResourceConversion describes how to convert different versions of a CR.
-
conversion.strategy (string), required
strategy specifies how custom resources are converted between versions. Allowed values are: -
None
: The converter only change the apiVersion and would not touch any other field in the custom resource. -Webhook
: API Server will call to an external webhook to do the conversion. Additional information is needed for this option. This requires spec.preserveUnknownFields to be false, and spec.conversion.webhook to be set. -
conversion.webhook (WebhookConversion)
webhook describes how to call the conversion webhook. Required when
strategy
is set toWebhook
.WebhookConversion describes how to call a conversion webhook
-
conversion.webhook.conversionReviewVersions ([]string), required
conversionReviewVersions is an ordered list of preferred
ConversionReview
versions the Webhook expects. The API server will use the first version in the list which it supports. If none of the versions specified in this list are supported by API server, conversion will fail for the custom resource. If a persisted Webhook configuration specifies allowed versions and does not include any versions known to the API Server, calls to the webhook will fail. -
conversion.webhook.clientConfig (WebhookClientConfig)
clientConfig is the instructions for how to call the webhook if strategy is
Webhook
.WebhookClientConfig contains the information to make a TLS connection with the webhook.
-
conversion.webhook.clientConfig.caBundle ([]byte)
caBundle is a PEM encoded CA bundle which will be used to validate the webhook's server certificate. If unspecified, system trust roots on the apiserver are used.
-
conversion.webhook.clientConfig.service (ServiceReference)
service is a reference to the service for this webhook. Either service or url must be specified.
If the webhook is running within the cluster, then you should use
service
.ServiceReference holds a reference to Service.legacy.k8s.io
-
conversion.webhook.clientConfig.service.name (string), required
name is the name of the service. Required
-
conversion.webhook.clientConfig.service.namespace (string), required
namespace is the namespace of the service. Required
-
conversion.webhook.clientConfig.service.path (string)
path is an optional URL path at which the webhook will be contacted.
-
conversion.webhook.clientConfig.service.port (int32)
port is an optional service port at which the webhook will be contacted.
port
should be a valid port number (1-65535, inclusive). Defaults to 443 for backward compatibility.
-
-
conversion.webhook.clientConfig.url (string)
url gives the location of the webhook, in standard URL form (
scheme://host:port/path
). Exactly one ofurl
orservice
must be specified.The
host
should not refer to a service running in the cluster; use theservice
field instead. The host might be resolved via external DNS in some apiservers (e.g.,kube-apiserver
cannot resolve in-cluster DNS as that would be a layering violation).host
may also be an IP address.Please note that using
localhost
or127.0.0.1
as ahost
is risky unless you take great care to run this webhook on all hosts which run an apiserver which might need to make calls to this webhook. Such installs are likely to be non-portable, i.e., not easy to turn up in a new cluster.The scheme must be "https"; the URL must begin with "https://".
A path is optional, and if present may be any string permissible in a URL. You may use the path to pass an arbitrary string to the webhook, for example, a cluster identifier.
Attempting to use a user or basic auth e.g. "user:password@" is not allowed. Fragments ("#...") and query parameters ("?...") are not allowed, either.
-
-
-
-
preserveUnknownFields (boolean)
preserveUnknownFields indicates that object fields which are not specified in the OpenAPI schema should be preserved when persisting to storage. apiVersion, kind, metadata and known fields inside metadata are always preserved. This field is deprecated in favor of setting
x-preserve-unknown-fields
to true inspec.versions[*].schema.openAPIV3Schema
. See https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#pruning-versus-preserving-unknown-fields for details.
JSONSchemaProps
JSONSchemaProps is a JSON-Schema following Specification Draft 4 (http://json-schema.org/).
-
$ref (string)
-
$schema (string)
-
additionalItems (JSONSchemaPropsOrBool)
JSONSchemaPropsOrBool represents JSONSchemaProps or a boolean value. Defaults to true for the boolean property.
-
additionalProperties (JSONSchemaPropsOrBool)
JSONSchemaPropsOrBool represents JSONSchemaProps or a boolean value. Defaults to true for the boolean property.
-
allOf ([]JSONSchemaProps)
-
anyOf ([]JSONSchemaProps)
-
default (JSON)
default is a default value for undefined object fields. Defaulting is a beta feature under the CustomResourceDefaulting feature gate. Defaulting requires spec.preserveUnknownFields to be false.
JSON represents any valid JSON value. These types are supported: bool, int64, float64, string, []interface{}, map[string]interface{} and nil.
-
definitions (map[string]JSONSchemaProps)
-
dependencies (map[string]JSONSchemaPropsOrStringArray)
JSONSchemaPropsOrStringArray represents a JSONSchemaProps or a string array.
-
description (string)
-
enum ([]JSON)
JSON represents any valid JSON value. These types are supported: bool, int64, float64, string, []interface{}, map[string]interface{} and nil.
-
example (JSON)
JSON represents any valid JSON value. These types are supported: bool, int64, float64, string, []interface{}, map[string]interface{} and nil.
-
exclusiveMaximum (boolean)
-
exclusiveMinimum (boolean)
-
externalDocs (ExternalDocumentation)
ExternalDocumentation allows referencing an external resource for extended documentation.
-
externalDocs.description (string)
-
externalDocs.url (string)
-
-
format (string)
format is an OpenAPI v3 format string. Unknown formats are ignored. The following formats are validated:
- bsonobjectid: a bson object ID, i.e. a 24 characters hex string - uri: an URI as parsed by Golang net/url.ParseRequestURI - email: an email address as parsed by Golang net/mail.ParseAddress - hostname: a valid representation for an Internet host name, as defined by RFC 1034, section 3.1 [RFC1034]. - ipv4: an IPv4 IP as parsed by Golang net.ParseIP - ipv6: an IPv6 IP as parsed by Golang net.ParseIP - cidr: a CIDR as parsed by Golang net.ParseCIDR - mac: a MAC address as parsed by Golang net.ParseMAC - uuid: an UUID that allows uppercase defined by the regex (?i)^[0-9a-f]{8}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{12}$ - uuid3: an UUID3 that allows uppercase defined by the regex (?i)^[0-9a-f]{8}-?[0-9a-f]{4}-?3[0-9a-f]{3}-?[0-9a-f]{4}-?[0-9a-f]{12}$ - uuid4: an UUID4 that allows uppercase defined by the regex (?i)^[0-9a-f]{8}-?[0-9a-f]{4}-?4[0-9a-f]{3}-?[89ab][0-9a-f]{3}-?[0-9a-f]{12}$ - uuid5: an UUID5 that allows uppercase defined by the regex (?i)^[0-9a-f]{8}-?[0-9a-f]{4}-?5[0-9a-f]{3}-?[89ab][0-9a-f]{3}-?[0-9a-f]{12}$ - isbn: an ISBN10 or ISBN13 number string like "0321751043" or "978-0321751041" - isbn10: an ISBN10 number string like "0321751043" - isbn13: an ISBN13 number string like "978-0321751041" - creditcard: a credit card number defined by the regex ^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$ with any non digit characters mixed in - ssn: a U.S. social security number following the regex ^\d{3}[- ]?\d{2}[- ]?\d{4}$ - hexcolor: an hexadecimal color code like "#FFFFFF: following the regex ^#?([0-9a-fA-F]{3}|[0-9a-fA-F]{6})$ - rgbcolor: an RGB color code like rgb like "rgb(255,255,2559" - byte: base64 encoded binary data - password: any kind of string - date: a date string like "2006-01-02" as defined by full-date in RFC3339 - duration: a duration string like "22 ns" as parsed by Golang time.ParseDuration or compatible with Scala duration format - datetime: a date time string like "2014-12-15T19:30:20.000Z" as defined by date-time in RFC3339.
-
id (string)
-
items (JSONSchemaPropsOrArray)
JSONSchemaPropsOrArray represents a value that can either be a JSONSchemaProps or an array of JSONSchemaProps. Mainly here for serialization purposes.
-
maxItems (int64)
-
maxLength (int64)
-
maxProperties (int64)
-
maximum (double)
-
minItems (int64)
-
minLength (int64)
-
minProperties (int64)
-
minimum (double)
-
multipleOf (double)
-
not (JSONSchemaProps)
-
nullable (boolean)
-
oneOf ([]JSONSchemaProps)
-
pattern (string)
-
patternProperties (map[string]JSONSchemaProps)
-
properties (map[string]JSONSchemaProps)
-
required ([]string)
-
title (string)
-
type (string)
-
uniqueItems (boolean)
-
x-kubernetes-embedded-resource (boolean)
x-kubernetes-embedded-resource defines that the value is an embedded Kubernetes runtime.Object, with TypeMeta and ObjectMeta. The type must be object. It is allowed to further restrict the embedded object. kind, apiVersion and metadata are validated automatically. x-kubernetes-preserve-unknown-fields is allowed to be true, but does not have to be if the object is fully specified (up to kind, apiVersion, metadata).
-
x-kubernetes-int-or-string (boolean)
x-kubernetes-int-or-string specifies that this value is either an integer or a string. If this is true, an empty type is allowed and type as child of anyOf is permitted if following one of the following patterns:
- anyOf:
- type: integer
- type: string
- allOf:
- anyOf:
- type: integer
- type: string
- ... zero or more
- anyOf:
- anyOf:
-
x-kubernetes-list-map-keys ([]string)
x-kubernetes-list-map-keys annotates an array with the x-kubernetes-list-type
map
by specifying the keys used as the index of the map.This tag MUST only be used on lists that have the "x-kubernetes-list-type" extension set to "map". Also, the values specified for this attribute must be a scalar typed field of the child structure (no nesting is supported).
The properties specified must either be required or have a default value, to ensure those properties are present for all list items.
-
x-kubernetes-list-type (string)
x-kubernetes-list-type annotates an array to further describe its topology. This extension must only be used on lists and may have 3 possible values:
atomic
: the list is treated as a single entity, like a scalar. Atomic lists will be entirely replaced when updated. This extension may be used on any type of list (struct, scalar, ...).set
: Sets are lists that must not have multiple items with the same value. Each value must be a scalar, an object with x-kubernetes-map-typeatomic
or an array with x-kubernetes-list-typeatomic
.map
: These lists are like maps in that their elements have a non-index key used to identify them. Order is preserved upon merge. The map tag must only be used on a list with elements of type object. Defaults to atomic for arrays.
-
x-kubernetes-map-type (string)
x-kubernetes-map-type annotates an object to further describe its topology. This extension must only be used when type is object and may have 2 possible values:
granular
: These maps are actual maps (key-value pairs) and each fields are independent from each other (they can each be manipulated by separate actors). This is the default behaviour for all maps.atomic
: the list is treated as a single entity, like a scalar. Atomic maps will be entirely replaced when updated.
-
x-kubernetes-preserve-unknown-fields (boolean)
x-kubernetes-preserve-unknown-fields stops the API server decoding step from pruning fields which are not specified in the validation schema. This affects fields recursively, but switches back to normal pruning behaviour if nested properties or additionalProperties are specified in the schema. This can either be true or undefined. False is forbidden.
-
x-kubernetes-validations ([]ValidationRule)
Patch strategy: merge on key
rule
Map: unique values on key rule will be kept during a merge
x-kubernetes-validations describes a list of validation rules written in the CEL expression language. This field is an alpha-level. Using this field requires the feature gate
CustomResourceValidationExpressions
to be enabled.ValidationRule describes a validation rule written in the CEL expression language.
-
x-kubernetes-validations.rule (string), required
Rule represents the expression which will be evaluated by CEL. ref: https://github.com/google/cel-spec The Rule is scoped to the location of the x-kubernetes-validations extension in the schema. The
self
variable in the CEL expression is bound to the scoped value. Example: - Rule scoped to the root of a resource with a status subresource: {"rule": "self.status.actual <= self.spec.maxDesired"}If the Rule is scoped to an object with properties, the accessible properties of the object are field selectable via
self.field
and field presence can be checked viahas(self.field)
. Null valued fields are treated as absent fields in CEL expressions. If the Rule is scoped to an object with additionalProperties (i.e. a map) the value of the map are accessible viaself[mapKey]
, map containment can be checked viamapKey in self
and all entries of the map are accessible via CEL macros and functions such asself.all(...)
. If the Rule is scoped to an array, the elements of the array are accessible viaself[i]
and also by macros and functions. If the Rule is scoped to a scalar,self
is bound to the scalar value. Examples: - Rule scoped to a map of objects: {"rule": "self.components['Widget'].priority < 10"} - Rule scoped to a list of integers: {"rule": "self.values.all(value, value >= 0 && value < 100)"} - Rule scoped to a string value: {"rule": "self.startsWith('kube')"}The
apiVersion
,kind
,metadata.name
andmetadata.generateName
are always accessible from the root of the object and from any x-kubernetes-embedded-resource annotated objects. No other metadata properties are accessible.Unknown data preserved in custom resources via x-kubernetes-preserve-unknown-fields is not accessible in CEL expressions. This includes: - Unknown field values that are preserved by object schemas with x-kubernetes-preserve-unknown-fields. - Object properties where the property schema is of an "unknown type". An "unknown type" is recursively defined as:
- A schema with no type and x-kubernetes-preserve-unknown-fields set to true
- An array where the items schema is of an "unknown type"
- An object where the additionalProperties schema is of an "unknown type"
Only property names of the form
[a-zA-Z_.-/][a-zA-Z0-9_.-/]*
are accessible. Accessible property names are escaped according to the following rules when accessed in the expression: - '' escapes to 'underscores' - '.' escapes to 'dot' - '-' escapes to 'dash' - '/' escapes to 'slash' - Property names that exactly match a CEL RESERVED keyword escape to '{keyword}__'. The keywords are: "true", "false", "null", "in", "as", "break", "const", "continue", "else", "for", "function", "if", "import", "let", "loop", "package", "namespace", "return". Examples:- Rule accessing a property named "namespace": {"rule": "self.namespace > 0"}
- Rule accessing a property named "x-prop": {"rule": "self.x__dash__prop > 0"}
- Rule accessing a property named "redact__d": {"rule": "self.redact__underscores__d > 0"}
Equality on arrays with x-kubernetes-list-type of 'set' or 'map' ignores element order, i.e. [1, 2] == [2, 1]. Concatenation on arrays with x-kubernetes-list-type use the semantics of the list type:
- 'set':
X + Y
performs a union where the array positions of all elements inX
are preserved and non-intersecting elements inY
are appended, retaining their partial order. - 'map':
X + Y
performs a merge where the array positions of all keys inX
are preserved but the values are overwritten by values inY
when the key sets ofX
andY
intersect. Elements inY
with non-intersecting keys are appended, retaining their partial order.
-
x-kubernetes-validations.message (string)
Message represents the message displayed when validation fails. The message is required if the Rule contains line breaks. The message must not contain line breaks. If unset, the message is "failed rule: {Rule}". e.g. "must be a URL with the host matching spec.host"
-
CustomResourceDefinitionStatus
CustomResourceDefinitionStatus indicates the state of the CustomResourceDefinition
-
acceptedNames (CustomResourceDefinitionNames)
acceptedNames are the names that are actually being used to serve discovery. They may be different than the names in spec.
CustomResourceDefinitionNames indicates the names to serve this CustomResourceDefinition
-
acceptedNames.kind (string), required
kind is the serialized kind of the resource. It is normally CamelCase and singular. Custom resource instances will use this value as the
kind
attribute in API calls. -
acceptedNames.plural (string), required
plural is the plural name of the resource to serve. The custom resources are served under
/apis/\<group>/\<version>/.../\<plural>
. Must match the name of the CustomResourceDefinition (in the form\<names.plural>.\<group>
). Must be all lowercase. -
acceptedNames.categories ([]string)
categories is a list of grouped resources this custom resource belongs to (e.g. 'all'). This is published in API discovery documents, and used by clients to support invocations like
kubectl get all
. -
acceptedNames.listKind (string)
listKind is the serialized kind of the list for this resource. Defaults to "
kind
List". -
acceptedNames.shortNames ([]string)
shortNames are short names for the resource, exposed in API discovery documents, and used by clients to support invocations like
kubectl get \<shortname>
. It must be all lowercase. -
acceptedNames.singular (string)
singular is the singular name of the resource. It must be all lowercase. Defaults to lowercased
kind
.
-
-
conditions ([]CustomResourceDefinitionCondition)
Map: unique values on key type will be kept during a merge
conditions indicate state for particular aspects of a CustomResourceDefinition
CustomResourceDefinitionCondition contains details for the current condition of this pod.
-
conditions.status (string), required
status is the status of the condition. Can be True, False, Unknown.
-
conditions.type (string), required
type is the type of the condition. Types include Established, NamesAccepted and Terminating.
-
conditions.lastTransitionTime (Time)
lastTransitionTime last time the condition transitioned from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
message is a human-readable message indicating details about last transition.
-
conditions.reason (string)
reason is a unique, one-word, CamelCase reason for the condition's last transition.
-
-
storedVersions ([]string)
storedVersions lists all versions of CustomResources that were ever persisted. Tracking these versions allows a migration path for stored versions in etcd. The field is mutable so a migration controller can finish a migration to another version (ensuring no old objects are left in storage), and then remove the rest of the versions from this list. Versions may not be removed from
spec.versions
while they exist in this list.
CustomResourceDefinitionList
CustomResourceDefinitionList is a list of CustomResourceDefinition objects.
-
items ([]CustomResourceDefinition), required
items list individual CustomResourceDefinition objects
-
apiVersion (string)
APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
-
kind (string)
Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
metadata (ListMeta)
Standard object's metadata More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
Operations
get
read the specified CustomResourceDefinition
HTTP Request
GET /apis/apiextensions.k8s.io/v1/customresourcedefinitions/{name}
Parameters
-
name (in path): string, required
name of the CustomResourceDefinition
-
pretty (in query): string
Response
200 (CustomResourceDefinition): OK
401: Unauthorized
get
read status of the specified CustomResourceDefinition
HTTP Request
GET /apis/apiextensions.k8s.io/v1/customresourcedefinitions/{name}/status
Parameters
-
name (in path): string, required
name of the CustomResourceDefinition
-
pretty (in query): string
Response
200 (CustomResourceDefinition): OK
401: Unauthorized
list
list or watch objects of kind CustomResourceDefinition
HTTP Request
GET /apis/apiextensions.k8s.io/v1/customresourcedefinitions
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (CustomResourceDefinitionList): OK
401: Unauthorized
create
create a CustomResourceDefinition
HTTP Request
POST /apis/apiextensions.k8s.io/v1/customresourcedefinitions
Parameters
-
body: CustomResourceDefinition, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CustomResourceDefinition): OK
201 (CustomResourceDefinition): Created
202 (CustomResourceDefinition): Accepted
401: Unauthorized
update
replace the specified CustomResourceDefinition
HTTP Request
PUT /apis/apiextensions.k8s.io/v1/customresourcedefinitions/{name}
Parameters
-
name (in path): string, required
name of the CustomResourceDefinition
-
body: CustomResourceDefinition, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CustomResourceDefinition): OK
201 (CustomResourceDefinition): Created
401: Unauthorized
update
replace status of the specified CustomResourceDefinition
HTTP Request
PUT /apis/apiextensions.k8s.io/v1/customresourcedefinitions/{name}/status
Parameters
-
name (in path): string, required
name of the CustomResourceDefinition
-
body: CustomResourceDefinition, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (CustomResourceDefinition): OK
201 (CustomResourceDefinition): Created
401: Unauthorized
patch
partially update the specified CustomResourceDefinition
HTTP Request
PATCH /apis/apiextensions.k8s.io/v1/customresourcedefinitions/{name}
Parameters
-
name (in path): string, required
name of the CustomResourceDefinition
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CustomResourceDefinition): OK
201 (CustomResourceDefinition): Created
401: Unauthorized
patch
partially update status of the specified CustomResourceDefinition
HTTP Request
PATCH /apis/apiextensions.k8s.io/v1/customresourcedefinitions/{name}/status
Parameters
-
name (in path): string, required
name of the CustomResourceDefinition
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (CustomResourceDefinition): OK
201 (CustomResourceDefinition): Created
401: Unauthorized
delete
delete a CustomResourceDefinition
HTTP Request
DELETE /apis/apiextensions.k8s.io/v1/customresourcedefinitions/{name}
Parameters
-
name (in path): string, required
name of the CustomResourceDefinition
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of CustomResourceDefinition
HTTP Request
DELETE /apis/apiextensions.k8s.io/v1/customresourcedefinitions
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.7.2 - MutatingWebhookConfiguration
apiVersion: admissionregistration.k8s.io/v1
import "k8s.io/api/admissionregistration/v1"
MutatingWebhookConfiguration
MutatingWebhookConfiguration describes the configuration of and admission webhook that accept or reject and may change the object.
-
apiVersion: admissionregistration.k8s.io/v1
-
kind: MutatingWebhookConfiguration
-
metadata (ObjectMeta)
Standard object metadata; More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata.
-
webhooks ([]MutatingWebhook)
Patch strategy: merge on key
name
Webhooks is a list of webhooks and the affected resources and operations.
MutatingWebhook describes an admission webhook and the resources and operations it applies to.
-
webhooks.admissionReviewVersions ([]string), required
AdmissionReviewVersions is an ordered list of preferred
AdmissionReview
versions the Webhook expects. API server will try to use first version in the list which it supports. If none of the versions specified in this list supported by API server, validation will fail for this object. If a persisted webhook configuration specifies allowed versions and does not include any versions known to the API Server, calls to the webhook will fail and be subject to the failure policy. -
webhooks.clientConfig (WebhookClientConfig), required
ClientConfig defines how to communicate with the hook. Required
WebhookClientConfig contains the information to make a TLS connection with the webhook
-
webhooks.clientConfig.caBundle ([]byte)
caBundle
is a PEM encoded CA bundle which will be used to validate the webhook's server certificate. If unspecified, system trust roots on the apiserver are used. -
webhooks.clientConfig.service (ServiceReference)
service
is a reference to the service for this webhook. Eitherservice
orurl
must be specified.If the webhook is running within the cluster, then you should use
service
.ServiceReference holds a reference to Service.legacy.k8s.io
-
webhooks.clientConfig.service.name (string), required
name
is the name of the service. Required -
webhooks.clientConfig.service.namespace (string), required
namespace
is the namespace of the service. Required -
webhooks.clientConfig.service.path (string)
path
is an optional URL path which will be sent in any request to this service. -
webhooks.clientConfig.service.port (int32)
If specified, the port on the service that hosting webhook. Default to 443 for backward compatibility.
port
should be a valid port number (1-65535, inclusive).
-
-
webhooks.clientConfig.url (string)
url
gives the location of the webhook, in standard URL form (scheme://host:port/path
). Exactly one ofurl
orservice
must be specified.The
host
should not refer to a service running in the cluster; use theservice
field instead. The host might be resolved via external DNS in some apiservers (e.g.,kube-apiserver
cannot resolve in-cluster DNS as that would be a layering violation).host
may also be an IP address.Please note that using
localhost
or127.0.0.1
as ahost
is risky unless you take great care to run this webhook on all hosts which run an apiserver which might need to make calls to this webhook. Such installs are likely to be non-portable, i.e., not easy to turn up in a new cluster.The scheme must be "https"; the URL must begin with "https://".
A path is optional, and if present may be any string permissible in a URL. You may use the path to pass an arbitrary string to the webhook, for example, a cluster identifier.
Attempting to use a user or basic auth e.g. "user:password@" is not allowed. Fragments ("#...") and query parameters ("?...") are not allowed, either.
-
-
webhooks.name (string), required
The name of the admission webhook. Name should be fully qualified, e.g., imagepolicy.kubernetes.io, where "imagepolicy" is the name of the webhook, and kubernetes.io is the name of the organization. Required.
-
webhooks.sideEffects (string), required
SideEffects states whether this webhook has side effects. Acceptable values are: None, NoneOnDryRun (webhooks created via v1beta1 may also specify Some or Unknown). Webhooks with side effects MUST implement a reconciliation system, since a request may be rejected by a future step in the admission chain and the side effects therefore need to be undone. Requests with the dryRun attribute will be auto-rejected if they match a webhook with sideEffects == Unknown or Some.
-
webhooks.failurePolicy (string)
FailurePolicy defines how unrecognized errors from the admission endpoint are handled - allowed values are Ignore or Fail. Defaults to Fail.
-
webhooks.matchPolicy (string)
matchPolicy defines how the "rules" list is used to match incoming requests. Allowed values are "Exact" or "Equivalent".
-
Exact: match a request only if it exactly matches a specified rule. For example, if deployments can be modified via apps/v1, apps/v1beta1, and extensions/v1beta1, but "rules" only included
apiGroups:["apps"], apiVersions:["v1"], resources: ["deployments"]
, a request to apps/v1beta1 or extensions/v1beta1 would not be sent to the webhook. -
Equivalent: match a request if modifies a resource listed in rules, even via another API group or version. For example, if deployments can be modified via apps/v1, apps/v1beta1, and extensions/v1beta1, and "rules" only included
apiGroups:["apps"], apiVersions:["v1"], resources: ["deployments"]
, a request to apps/v1beta1 or extensions/v1beta1 would be converted to apps/v1 and sent to the webhook.
Defaults to "Equivalent"
-
-
webhooks.namespaceSelector (LabelSelector)
NamespaceSelector decides whether to run the webhook on an object based on whether the namespace for that object matches the selector. If the object itself is a namespace, the matching is performed on object.metadata.labels. If the object is another cluster scoped resource, it never skips the webhook.
For example, to run the webhook on any objects whose namespace is not associated with "runlevel" of "0" or "1"; you will set the selector as follows: "namespaceSelector": { "matchExpressions": [ { "key": "runlevel", "operator": "NotIn", "values": [ "0", "1" ] } ] }
If instead you want to only run the webhook on any objects whose namespace is associated with the "environment" of "prod" or "staging"; you will set the selector as follows: "namespaceSelector": { "matchExpressions": [ { "key": "environment", "operator": "In", "values": [ "prod", "staging" ] } ] }
See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ for more examples of label selectors.
Default to the empty LabelSelector, which matches everything.
-
webhooks.objectSelector (LabelSelector)
ObjectSelector decides whether to run the webhook based on if the object has matching labels. objectSelector is evaluated against both the oldObject and newObject that would be sent to the webhook, and is considered to match if either object matches the selector. A null object (oldObject in the case of create, or newObject in the case of delete) or an object that cannot have labels (like a DeploymentRollback or a PodProxyOptions object) is not considered to match. Use the object selector only if the webhook is opt-in, because end users may skip the admission webhook by setting the labels. Default to the empty LabelSelector, which matches everything.
-
webhooks.reinvocationPolicy (string)
reinvocationPolicy indicates whether this webhook should be called multiple times as part of a single admission evaluation. Allowed values are "Never" and "IfNeeded".
Never: the webhook will not be called more than once in a single admission evaluation.
IfNeeded: the webhook will be called at least one additional time as part of the admission evaluation if the object being admitted is modified by other admission plugins after the initial webhook call. Webhooks that specify this option must be idempotent, able to process objects they previously admitted. Note: * the number of additional invocations is not guaranteed to be exactly one. * if additional invocations result in further modifications to the object, webhooks are not guaranteed to be invoked again. * webhooks that use this option may be reordered to minimize the number of additional invocations. * to validate an object after all mutations are guaranteed complete, use a validating admission webhook instead.
Defaults to "Never".
-
webhooks.rules ([]RuleWithOperations)
Rules describes what operations on what resources/subresources the webhook cares about. The webhook cares about an operation if it matches any Rule. However, in order to prevent ValidatingAdmissionWebhooks and MutatingAdmissionWebhooks from putting the cluster in a state which cannot be recovered from without completely disabling the plugin, ValidatingAdmissionWebhooks and MutatingAdmissionWebhooks are never called on admission requests for ValidatingWebhookConfiguration and MutatingWebhookConfiguration objects.
RuleWithOperations is a tuple of Operations and Resources. It is recommended to make sure that all the tuple expansions are valid.
-
webhooks.rules.apiGroups ([]string)
APIGroups is the API groups the resources belong to. '' is all groups. If '' is present, the length of the slice must be one. Required.
-
webhooks.rules.apiVersions ([]string)
APIVersions is the API versions the resources belong to. '' is all versions. If '' is present, the length of the slice must be one. Required.
-
webhooks.rules.operations ([]string)
Operations is the operations the admission hook cares about - CREATE, UPDATE, DELETE, CONNECT or * for all of those operations and any future admission operations that are added. If '*' is present, the length of the slice must be one. Required.
-
webhooks.rules.resources ([]string)
Resources is a list of resources this rule applies to.
For example: 'pods' means pods. 'pods/log' means the log subresource of pods. '' means all resources, but not subresources. 'pods/' means all subresources of pods. '/scale' means all scale subresources. '/*' means all resources and their subresources.
If wildcard is present, the validation rule will ensure resources do not overlap with each other.
Depending on the enclosing object, subresources might not be allowed. Required.
-
webhooks.rules.scope (string)
scope specifies the scope of this rule. Valid values are "Cluster", "Namespaced", and "" "Cluster" means that only cluster-scoped resources will match this rule. Namespace API objects are cluster-scoped. "Namespaced" means that only namespaced resources will match this rule. "" means that there are no scope restrictions. Subresources match the scope of their parent resource. Default is "*".
-
-
webhooks.timeoutSeconds (int32)
TimeoutSeconds specifies the timeout for this webhook. After the timeout passes, the webhook call will be ignored or the API call will fail based on the failure policy. The timeout value must be between 1 and 30 seconds. Default to 10 seconds.
-
MutatingWebhookConfigurationList
MutatingWebhookConfigurationList is a list of MutatingWebhookConfiguration.
-
apiVersion: admissionregistration.k8s.io/v1
-
kind: MutatingWebhookConfigurationList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]MutatingWebhookConfiguration), required
List of MutatingWebhookConfiguration.
Operations
get
read the specified MutatingWebhookConfiguration
HTTP Request
GET /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/{name}
Parameters
-
name (in path): string, required
name of the MutatingWebhookConfiguration
-
pretty (in query): string
Response
200 (MutatingWebhookConfiguration): OK
401: Unauthorized
list
list or watch objects of kind MutatingWebhookConfiguration
HTTP Request
GET /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (MutatingWebhookConfigurationList): OK
401: Unauthorized
create
create a MutatingWebhookConfiguration
HTTP Request
POST /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations
Parameters
-
body: MutatingWebhookConfiguration, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (MutatingWebhookConfiguration): OK
201 (MutatingWebhookConfiguration): Created
202 (MutatingWebhookConfiguration): Accepted
401: Unauthorized
update
replace the specified MutatingWebhookConfiguration
HTTP Request
PUT /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/{name}
Parameters
-
name (in path): string, required
name of the MutatingWebhookConfiguration
-
body: MutatingWebhookConfiguration, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (MutatingWebhookConfiguration): OK
201 (MutatingWebhookConfiguration): Created
401: Unauthorized
patch
partially update the specified MutatingWebhookConfiguration
HTTP Request
PATCH /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/{name}
Parameters
-
name (in path): string, required
name of the MutatingWebhookConfiguration
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (MutatingWebhookConfiguration): OK
201 (MutatingWebhookConfiguration): Created
401: Unauthorized
delete
delete a MutatingWebhookConfiguration
HTTP Request
DELETE /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/{name}
Parameters
-
name (in path): string, required
name of the MutatingWebhookConfiguration
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of MutatingWebhookConfiguration
HTTP Request
DELETE /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.7.3 - ValidatingWebhookConfiguration
apiVersion: admissionregistration.k8s.io/v1
import "k8s.io/api/admissionregistration/v1"
ValidatingWebhookConfiguration
ValidatingWebhookConfiguration describes the configuration of and admission webhook that accept or reject and object without changing it.
-
apiVersion: admissionregistration.k8s.io/v1
-
kind: ValidatingWebhookConfiguration
-
metadata (ObjectMeta)
Standard object metadata; More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata.
-
webhooks ([]ValidatingWebhook)
Patch strategy: merge on key
name
Webhooks is a list of webhooks and the affected resources and operations.
ValidatingWebhook describes an admission webhook and the resources and operations it applies to.
-
webhooks.admissionReviewVersions ([]string), required
AdmissionReviewVersions is an ordered list of preferred
AdmissionReview
versions the Webhook expects. API server will try to use first version in the list which it supports. If none of the versions specified in this list supported by API server, validation will fail for this object. If a persisted webhook configuration specifies allowed versions and does not include any versions known to the API Server, calls to the webhook will fail and be subject to the failure policy. -
webhooks.clientConfig (WebhookClientConfig), required
ClientConfig defines how to communicate with the hook. Required
WebhookClientConfig contains the information to make a TLS connection with the webhook
-
webhooks.clientConfig.caBundle ([]byte)
caBundle
is a PEM encoded CA bundle which will be used to validate the webhook's server certificate. If unspecified, system trust roots on the apiserver are used. -
webhooks.clientConfig.service (ServiceReference)
service
is a reference to the service for this webhook. Eitherservice
orurl
must be specified.If the webhook is running within the cluster, then you should use
service
.ServiceReference holds a reference to Service.legacy.k8s.io
-
webhooks.clientConfig.service.name (string), required
name
is the name of the service. Required -
webhooks.clientConfig.service.namespace (string), required
namespace
is the namespace of the service. Required -
webhooks.clientConfig.service.path (string)
path
is an optional URL path which will be sent in any request to this service. -
webhooks.clientConfig.service.port (int32)
If specified, the port on the service that hosting webhook. Default to 443 for backward compatibility.
port
should be a valid port number (1-65535, inclusive).
-
-
webhooks.clientConfig.url (string)
url
gives the location of the webhook, in standard URL form (scheme://host:port/path
). Exactly one ofurl
orservice
must be specified.The
host
should not refer to a service running in the cluster; use theservice
field instead. The host might be resolved via external DNS in some apiservers (e.g.,kube-apiserver
cannot resolve in-cluster DNS as that would be a layering violation).host
may also be an IP address.Please note that using
localhost
or127.0.0.1
as ahost
is risky unless you take great care to run this webhook on all hosts which run an apiserver which might need to make calls to this webhook. Such installs are likely to be non-portable, i.e., not easy to turn up in a new cluster.The scheme must be "https"; the URL must begin with "https://".
A path is optional, and if present may be any string permissible in a URL. You may use the path to pass an arbitrary string to the webhook, for example, a cluster identifier.
Attempting to use a user or basic auth e.g. "user:password@" is not allowed. Fragments ("#...") and query parameters ("?...") are not allowed, either.
-
-
webhooks.name (string), required
The name of the admission webhook. Name should be fully qualified, e.g., imagepolicy.kubernetes.io, where "imagepolicy" is the name of the webhook, and kubernetes.io is the name of the organization. Required.
-
webhooks.sideEffects (string), required
SideEffects states whether this webhook has side effects. Acceptable values are: None, NoneOnDryRun (webhooks created via v1beta1 may also specify Some or Unknown). Webhooks with side effects MUST implement a reconciliation system, since a request may be rejected by a future step in the admission chain and the side effects therefore need to be undone. Requests with the dryRun attribute will be auto-rejected if they match a webhook with sideEffects == Unknown or Some.
-
webhooks.failurePolicy (string)
FailurePolicy defines how unrecognized errors from the admission endpoint are handled - allowed values are Ignore or Fail. Defaults to Fail.
-
webhooks.matchPolicy (string)
matchPolicy defines how the "rules" list is used to match incoming requests. Allowed values are "Exact" or "Equivalent".
-
Exact: match a request only if it exactly matches a specified rule. For example, if deployments can be modified via apps/v1, apps/v1beta1, and extensions/v1beta1, but "rules" only included
apiGroups:["apps"], apiVersions:["v1"], resources: ["deployments"]
, a request to apps/v1beta1 or extensions/v1beta1 would not be sent to the webhook. -
Equivalent: match a request if modifies a resource listed in rules, even via another API group or version. For example, if deployments can be modified via apps/v1, apps/v1beta1, and extensions/v1beta1, and "rules" only included
apiGroups:["apps"], apiVersions:["v1"], resources: ["deployments"]
, a request to apps/v1beta1 or extensions/v1beta1 would be converted to apps/v1 and sent to the webhook.
Defaults to "Equivalent"
-
-
webhooks.namespaceSelector (LabelSelector)
NamespaceSelector decides whether to run the webhook on an object based on whether the namespace for that object matches the selector. If the object itself is a namespace, the matching is performed on object.metadata.labels. If the object is another cluster scoped resource, it never skips the webhook.
For example, to run the webhook on any objects whose namespace is not associated with "runlevel" of "0" or "1"; you will set the selector as follows: "namespaceSelector": { "matchExpressions": [ { "key": "runlevel", "operator": "NotIn", "values": [ "0", "1" ] } ] }
If instead you want to only run the webhook on any objects whose namespace is associated with the "environment" of "prod" or "staging"; you will set the selector as follows: "namespaceSelector": { "matchExpressions": [ { "key": "environment", "operator": "In", "values": [ "prod", "staging" ] } ] }
See https://kubernetes.io/docs/concepts/overview/working-with-objects/labels for more examples of label selectors.
Default to the empty LabelSelector, which matches everything.
-
webhooks.objectSelector (LabelSelector)
ObjectSelector decides whether to run the webhook based on if the object has matching labels. objectSelector is evaluated against both the oldObject and newObject that would be sent to the webhook, and is considered to match if either object matches the selector. A null object (oldObject in the case of create, or newObject in the case of delete) or an object that cannot have labels (like a DeploymentRollback or a PodProxyOptions object) is not considered to match. Use the object selector only if the webhook is opt-in, because end users may skip the admission webhook by setting the labels. Default to the empty LabelSelector, which matches everything.
-
webhooks.rules ([]RuleWithOperations)
Rules describes what operations on what resources/subresources the webhook cares about. The webhook cares about an operation if it matches any Rule. However, in order to prevent ValidatingAdmissionWebhooks and MutatingAdmissionWebhooks from putting the cluster in a state which cannot be recovered from without completely disabling the plugin, ValidatingAdmissionWebhooks and MutatingAdmissionWebhooks are never called on admission requests for ValidatingWebhookConfiguration and MutatingWebhookConfiguration objects.
RuleWithOperations is a tuple of Operations and Resources. It is recommended to make sure that all the tuple expansions are valid.
-
webhooks.rules.apiGroups ([]string)
APIGroups is the API groups the resources belong to. '' is all groups. If '' is present, the length of the slice must be one. Required.
-
webhooks.rules.apiVersions ([]string)
APIVersions is the API versions the resources belong to. '' is all versions. If '' is present, the length of the slice must be one. Required.
-
webhooks.rules.operations ([]string)
Operations is the operations the admission hook cares about - CREATE, UPDATE, DELETE, CONNECT or * for all of those operations and any future admission operations that are added. If '*' is present, the length of the slice must be one. Required.
-
webhooks.rules.resources ([]string)
Resources is a list of resources this rule applies to.
For example: 'pods' means pods. 'pods/log' means the log subresource of pods. '' means all resources, but not subresources. 'pods/' means all subresources of pods. '/scale' means all scale subresources. '/*' means all resources and their subresources.
If wildcard is present, the validation rule will ensure resources do not overlap with each other.
Depending on the enclosing object, subresources might not be allowed. Required.
-
webhooks.rules.scope (string)
scope specifies the scope of this rule. Valid values are "Cluster", "Namespaced", and "" "Cluster" means that only cluster-scoped resources will match this rule. Namespace API objects are cluster-scoped. "Namespaced" means that only namespaced resources will match this rule. "" means that there are no scope restrictions. Subresources match the scope of their parent resource. Default is "*".
-
-
webhooks.timeoutSeconds (int32)
TimeoutSeconds specifies the timeout for this webhook. After the timeout passes, the webhook call will be ignored or the API call will fail based on the failure policy. The timeout value must be between 1 and 30 seconds. Default to 10 seconds.
-
ValidatingWebhookConfigurationList
ValidatingWebhookConfigurationList is a list of ValidatingWebhookConfiguration.
-
apiVersion: admissionregistration.k8s.io/v1
-
kind: ValidatingWebhookConfigurationList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]ValidatingWebhookConfiguration), required
List of ValidatingWebhookConfiguration.
Operations
get
read the specified ValidatingWebhookConfiguration
HTTP Request
GET /apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/{name}
Parameters
-
name (in path): string, required
name of the ValidatingWebhookConfiguration
-
pretty (in query): string
Response
200 (ValidatingWebhookConfiguration): OK
401: Unauthorized
list
list or watch objects of kind ValidatingWebhookConfiguration
HTTP Request
GET /apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ValidatingWebhookConfigurationList): OK
401: Unauthorized
create
create a ValidatingWebhookConfiguration
HTTP Request
POST /apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations
Parameters
-
body: ValidatingWebhookConfiguration, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ValidatingWebhookConfiguration): OK
201 (ValidatingWebhookConfiguration): Created
202 (ValidatingWebhookConfiguration): Accepted
401: Unauthorized
update
replace the specified ValidatingWebhookConfiguration
HTTP Request
PUT /apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/{name}
Parameters
-
name (in path): string, required
name of the ValidatingWebhookConfiguration
-
body: ValidatingWebhookConfiguration, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (ValidatingWebhookConfiguration): OK
201 (ValidatingWebhookConfiguration): Created
401: Unauthorized
patch
partially update the specified ValidatingWebhookConfiguration
HTTP Request
PATCH /apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/{name}
Parameters
-
name (in path): string, required
name of the ValidatingWebhookConfiguration
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (ValidatingWebhookConfiguration): OK
201 (ValidatingWebhookConfiguration): Created
401: Unauthorized
delete
delete a ValidatingWebhookConfiguration
HTTP Request
DELETE /apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations/{name}
Parameters
-
name (in path): string, required
name of the ValidatingWebhookConfiguration
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of ValidatingWebhookConfiguration
HTTP Request
DELETE /apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.8 - Cluster Resources
6.5.8.1 - Node
apiVersion: v1
import "k8s.io/api/core/v1"
Node
Node is a worker node in Kubernetes. Each node will have a unique identifier in the cache (i.e. in etcd).
-
apiVersion: v1
-
kind: Node
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (NodeSpec)
Spec defines the behavior of a node. https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (NodeStatus)
Most recently observed status of the node. Populated by the system. Read-only. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
NodeSpec
NodeSpec describes the attributes that a node is created with.
-
configSource (NodeConfigSource)
Deprecated. If specified, the source of the node's configuration. The DynamicKubeletConfig feature gate must be enabled for the Kubelet to use this field. This field is deprecated as of 1.22: https://git.k8s.io/enhancements/keps/sig-node/281-dynamic-kubelet-configuration
NodeConfigSource specifies a source of node configuration. Exactly one subfield (excluding metadata) must be non-nil. This API is deprecated since 1.22
-
configSource.configMap (ConfigMapNodeConfigSource)
ConfigMap is a reference to a Node's ConfigMap
ConfigMapNodeConfigSource contains the information to reference a ConfigMap as a config source for the Node. This API is deprecated since 1.22: https://git.k8s.io/enhancements/keps/sig-node/281-dynamic-kubelet-configuration
-
configSource.configMap.kubeletConfigKey (string), required
KubeletConfigKey declares which key of the referenced ConfigMap corresponds to the KubeletConfiguration structure This field is required in all cases.
-
configSource.configMap.name (string), required
Name is the metadata.name of the referenced ConfigMap. This field is required in all cases.
-
configSource.configMap.namespace (string), required
Namespace is the metadata.namespace of the referenced ConfigMap. This field is required in all cases.
-
configSource.configMap.resourceVersion (string)
ResourceVersion is the metadata.ResourceVersion of the referenced ConfigMap. This field is forbidden in Node.Spec, and required in Node.Status.
-
configSource.configMap.uid (string)
UID is the metadata.UID of the referenced ConfigMap. This field is forbidden in Node.Spec, and required in Node.Status.
-
-
-
externalID (string)
Deprecated. Not all kubelets will set this field. Remove field after 1.13. see: https://issues.k8s.io/61966
-
podCIDR (string)
PodCIDR represents the pod IP range assigned to the node.
-
podCIDRs ([]string)
podCIDRs represents the IP ranges assigned to the node for usage by Pods on that node. If this field is specified, the 0th entry must match the podCIDR field. It may contain at most 1 value for each of IPv4 and IPv6.
-
providerID (string)
ID of the node assigned by the cloud provider in the format: <ProviderName>://<ProviderSpecificNodeID>
-
taints ([]Taint)
If specified, the node's taints.
The node this Taint is attached to has the "effect" on any pod that does not tolerate the Taint.
-
taints.effect (string), required
Required. The effect of the taint on pods that do not tolerate the taint. Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
Possible enum values:
"NoExecute"
Evict any already-running pods that do not tolerate the taint. Currently enforced by NodeController."NoSchedule"
Do not allow new pods to schedule onto the node unless they tolerate the taint, but allow all pods submitted to Kubelet without going through the scheduler to start, and allow all already-running pods to continue running. Enforced by the scheduler."PreferNoSchedule"
Like TaintEffectNoSchedule, but the scheduler tries not to schedule new pods onto the node, rather than prohibiting new pods from scheduling onto the node entirely. Enforced by the scheduler.
-
taints.key (string), required
Required. The taint key to be applied to a node.
-
taints.timeAdded (Time)
TimeAdded represents the time at which the taint was added. It is only written for NoExecute taints.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
taints.value (string)
The taint value corresponding to the taint key.
-
-
unschedulable (boolean)
Unschedulable controls node schedulability of new pods. By default, node is schedulable. More info: https://kubernetes.io/docs/concepts/nodes/node/#manual-node-administration
NodeStatus
NodeStatus is information about the current status of a node.
-
addresses ([]NodeAddress)
Patch strategy: merge on key
type
List of addresses reachable to the node. Queried from cloud provider, if available. More info: https://kubernetes.io/docs/concepts/nodes/node/#addresses Note: This field is declared as mergeable, but the merge key is not sufficiently unique, which can cause data corruption when it is merged. Callers should instead use a full-replacement patch. See http://pr.k8s.io/79391 for an example.
NodeAddress contains information for the node's address.
-
addresses.address (string), required
The node address.
-
addresses.type (string), required
Node address type, one of Hostname, ExternalIP or InternalIP.
Possible enum values:
"ExternalDNS"
identifies a DNS name which resolves to an IP address which has the characteristics of a NodeExternalIP. The IP it resolves to may or may not be a listed NodeExternalIP address."ExternalIP"
identifies an IP address which is, in some way, intended to be more usable from outside the cluster then an internal IP, though no specific semantics are defined. It may be a globally routable IP, though it is not required to be. External IPs may be assigned directly to an interface on the node, like a NodeInternalIP, or alternatively, packets sent to the external IP may be NAT'ed to an internal node IP rather than being delivered directly (making the IP less efficient for node-to-node traffic than a NodeInternalIP)."Hostname"
identifies a name of the node. Although every node can be assumed to have a NodeAddress of this type, its exact syntax and semantics are not defined, and are not consistent between different clusters."InternalDNS"
identifies a DNS name which resolves to an IP address which has the characteristics of a NodeInternalIP. The IP it resolves to may or may not be a listed NodeInternalIP address."InternalIP"
identifies an IP address which is assigned to one of the node's network interfaces. Every node should have at least one address of this type. An internal IP is normally expected to be reachable from every other node, but may not be visible to hosts outside the cluster. By default it is assumed that kube-apiserver can reach node internal IPs, though it is possible to configure clusters where this is not the case. NodeInternalIP is the default type of node IP, and does not necessarily imply that the IP is ONLY reachable internally. If a node has multiple internal IPs, no specific semantics are assigned to the additional IPs.
-
-
allocatable (map[string]Quantity)
Allocatable represents the resources of a node that are available for scheduling. Defaults to Capacity.
-
capacity (map[string]Quantity)
Capacity represents the total resources of a node. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#capacity
-
conditions ([]NodeCondition)
Patch strategy: merge on key
type
Conditions is an array of current observed node conditions. More info: https://kubernetes.io/docs/concepts/nodes/node/#condition
NodeCondition contains condition information for a node.
-
conditions.status (string), required
Status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
Type of node condition.
Possible enum values:
"DiskPressure"
means the kubelet is under pressure due to insufficient available disk."MemoryPressure"
means the kubelet is under pressure due to insufficient available memory."NetworkUnavailable"
means that network for the node is not correctly configured."PIDPressure"
means the kubelet is under pressure due to insufficient available PID."Ready"
means kubelet is healthy and ready to accept pods.
-
conditions.lastHeartbeatTime (Time)
Last time we got an update on a given condition.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.lastTransitionTime (Time)
Last time the condition transit from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
Human readable message indicating details about last transition.
-
conditions.reason (string)
(brief) reason for the condition's last transition.
-
-
config (NodeConfigStatus)
Status of the config assigned to the node via the dynamic Kubelet config feature.
NodeConfigStatus describes the status of the config assigned by Node.Spec.ConfigSource.
-
config.active (NodeConfigSource)
Active reports the checkpointed config the node is actively using. Active will represent either the current version of the Assigned config, or the current LastKnownGood config, depending on whether attempting to use the Assigned config results in an error.
NodeConfigSource specifies a source of node configuration. Exactly one subfield (excluding metadata) must be non-nil. This API is deprecated since 1.22
-
config.active.configMap (ConfigMapNodeConfigSource)
ConfigMap is a reference to a Node's ConfigMap
ConfigMapNodeConfigSource contains the information to reference a ConfigMap as a config source for the Node. This API is deprecated since 1.22: https://git.k8s.io/enhancements/keps/sig-node/281-dynamic-kubelet-configuration
-
config.active.configMap.kubeletConfigKey (string), required
KubeletConfigKey declares which key of the referenced ConfigMap corresponds to the KubeletConfiguration structure This field is required in all cases.
-
config.active.configMap.name (string), required
Name is the metadata.name of the referenced ConfigMap. This field is required in all cases.
-
config.active.configMap.namespace (string), required
Namespace is the metadata.namespace of the referenced ConfigMap. This field is required in all cases.
-
config.active.configMap.resourceVersion (string)
ResourceVersion is the metadata.ResourceVersion of the referenced ConfigMap. This field is forbidden in Node.Spec, and required in Node.Status.
-
config.active.configMap.uid (string)
UID is the metadata.UID of the referenced ConfigMap. This field is forbidden in Node.Spec, and required in Node.Status.
-
-
-
config.assigned (NodeConfigSource)
Assigned reports the checkpointed config the node will try to use. When Node.Spec.ConfigSource is updated, the node checkpoints the associated config payload to local disk, along with a record indicating intended config. The node refers to this record to choose its config checkpoint, and reports this record in Assigned. Assigned only updates in the status after the record has been checkpointed to disk. When the Kubelet is restarted, it tries to make the Assigned config the Active config by loading and validating the checkpointed payload identified by Assigned.
NodeConfigSource specifies a source of node configuration. Exactly one subfield (excluding metadata) must be non-nil. This API is deprecated since 1.22
-
config.assigned.configMap (ConfigMapNodeConfigSource)
ConfigMap is a reference to a Node's ConfigMap
ConfigMapNodeConfigSource contains the information to reference a ConfigMap as a config source for the Node. This API is deprecated since 1.22: https://git.k8s.io/enhancements/keps/sig-node/281-dynamic-kubelet-configuration
-
config.assigned.configMap.kubeletConfigKey (string), required
KubeletConfigKey declares which key of the referenced ConfigMap corresponds to the KubeletConfiguration structure This field is required in all cases.
-
config.assigned.configMap.name (string), required
Name is the metadata.name of the referenced ConfigMap. This field is required in all cases.
-
config.assigned.configMap.namespace (string), required
Namespace is the metadata.namespace of the referenced ConfigMap. This field is required in all cases.
-
config.assigned.configMap.resourceVersion (string)
ResourceVersion is the metadata.ResourceVersion of the referenced ConfigMap. This field is forbidden in Node.Spec, and required in Node.Status.
-
config.assigned.configMap.uid (string)
UID is the metadata.UID of the referenced ConfigMap. This field is forbidden in Node.Spec, and required in Node.Status.
-
-
-
config.error (string)
Error describes any problems reconciling the Spec.ConfigSource to the Active config. Errors may occur, for example, attempting to checkpoint Spec.ConfigSource to the local Assigned record, attempting to checkpoint the payload associated with Spec.ConfigSource, attempting to load or validate the Assigned config, etc. Errors may occur at different points while syncing config. Earlier errors (e.g. download or checkpointing errors) will not result in a rollback to LastKnownGood, and may resolve across Kubelet retries. Later errors (e.g. loading or validating a checkpointed config) will result in a rollback to LastKnownGood. In the latter case, it is usually possible to resolve the error by fixing the config assigned in Spec.ConfigSource. You can find additional information for debugging by searching the error message in the Kubelet log. Error is a human-readable description of the error state; machines can check whether or not Error is empty, but should not rely on the stability of the Error text across Kubelet versions.
-
config.lastKnownGood (NodeConfigSource)
LastKnownGood reports the checkpointed config the node will fall back to when it encounters an error attempting to use the Assigned config. The Assigned config becomes the LastKnownGood config when the node determines that the Assigned config is stable and correct. This is currently implemented as a 10-minute soak period starting when the local record of Assigned config is updated. If the Assigned config is Active at the end of this period, it becomes the LastKnownGood. Note that if Spec.ConfigSource is reset to nil (use local defaults), the LastKnownGood is also immediately reset to nil, because the local default config is always assumed good. You should not make assumptions about the node's method of determining config stability and correctness, as this may change or become configurable in the future.
NodeConfigSource specifies a source of node configuration. Exactly one subfield (excluding metadata) must be non-nil. This API is deprecated since 1.22
-
config.lastKnownGood.configMap (ConfigMapNodeConfigSource)
ConfigMap is a reference to a Node's ConfigMap
ConfigMapNodeConfigSource contains the information to reference a ConfigMap as a config source for the Node. This API is deprecated since 1.22: https://git.k8s.io/enhancements/keps/sig-node/281-dynamic-kubelet-configuration
-
config.lastKnownGood.configMap.kubeletConfigKey (string), required
KubeletConfigKey declares which key of the referenced ConfigMap corresponds to the KubeletConfiguration structure This field is required in all cases.
-
config.lastKnownGood.configMap.name (string), required
Name is the metadata.name of the referenced ConfigMap. This field is required in all cases.
-
config.lastKnownGood.configMap.namespace (string), required
Namespace is the metadata.namespace of the referenced ConfigMap. This field is required in all cases.
-
config.lastKnownGood.configMap.resourceVersion (string)
ResourceVersion is the metadata.ResourceVersion of the referenced ConfigMap. This field is forbidden in Node.Spec, and required in Node.Status.
-
config.lastKnownGood.configMap.uid (string)
UID is the metadata.UID of the referenced ConfigMap. This field is forbidden in Node.Spec, and required in Node.Status.
-
-
-
-
daemonEndpoints (NodeDaemonEndpoints)
Endpoints of daemons running on the Node.
NodeDaemonEndpoints lists ports opened by daemons running on the Node.
-
images ([]ContainerImage)
List of container images on this node
-
images.names ([]string)
Names by which this image is known. e.g. ["k8s.gcr.io/hyperkube:v1.0.7", "dockerhub.io/google_containers/hyperkube:v1.0.7"]
-
images.sizeBytes (int64)
The size of the image in bytes.
-
-
nodeInfo (NodeSystemInfo)
Set of ids/uuids to uniquely identify the node. More info: https://kubernetes.io/docs/concepts/nodes/node/#info
NodeSystemInfo is a set of ids/uuids to uniquely identify the node.
-
nodeInfo.architecture (string), required
The Architecture reported by the node
-
nodeInfo.bootID (string), required
Boot ID reported by the node.
-
nodeInfo.containerRuntimeVersion (string), required
ContainerRuntime Version reported by the node through runtime remote API (e.g. docker://1.5.0).
-
nodeInfo.kernelVersion (string), required
Kernel Version reported by the node from 'uname -r' (e.g. 3.16.0-0.bpo.4-amd64).
-
nodeInfo.kubeProxyVersion (string), required
KubeProxy Version reported by the node.
-
nodeInfo.kubeletVersion (string), required
Kubelet Version reported by the node.
-
nodeInfo.machineID (string), required
MachineID reported by the node. For unique machine identification in the cluster this field is preferred. Learn more from man(5) machine-id: http://man7.org/linux/man-pages/man5/machine-id.5.html
-
nodeInfo.operatingSystem (string), required
The Operating System reported by the node
-
nodeInfo.osImage (string), required
OS Image reported by the node from /etc/os-release (e.g. Debian GNU/Linux 7 (wheezy)).
-
nodeInfo.systemUUID (string), required
SystemUUID reported by the node. For unique machine identification MachineID is preferred. This field is specific to Red Hat hosts https://access.redhat.com/documentation/en-us/red_hat_subscription_management/1/html/rhsm/uuid
-
-
phase (string)
NodePhase is the recently observed lifecycle phase of the node. More info: https://kubernetes.io/docs/concepts/nodes/node/#phase The field is never populated, and now is deprecated.
Possible enum values:
"Pending"
means the node has been created/added by the system, but not configured."Running"
means the node has been configured and has Kubernetes components running."Terminated"
means the node has been removed from the cluster.
-
volumesAttached ([]AttachedVolume)
List of volumes that are attached to the node.
AttachedVolume describes a volume attached to a node
-
volumesAttached.devicePath (string), required
DevicePath represents the device path where the volume should be available
-
volumesAttached.name (string), required
Name of the attached volume
-
-
volumesInUse ([]string)
List of attachable volumes in use (mounted) by the node.
NodeList
NodeList is the whole list of all Nodes which have been registered with master.
-
apiVersion: v1
-
kind: NodeList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]Node), required
List of nodes
Operations
get
read the specified Node
HTTP Request
GET /api/v1/nodes/{name}
Parameters
-
name (in path): string, required
name of the Node
-
pretty (in query): string
Response
200 (Node): OK
401: Unauthorized
get
read status of the specified Node
HTTP Request
GET /api/v1/nodes/{name}/status
Parameters
-
name (in path): string, required
name of the Node
-
pretty (in query): string
Response
200 (Node): OK
401: Unauthorized
list
list or watch objects of kind Node
HTTP Request
GET /api/v1/nodes
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (NodeList): OK
401: Unauthorized
create
create a Node
HTTP Request
POST /api/v1/nodes
Parameters
-
body: Node, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Node): OK
201 (Node): Created
202 (Node): Accepted
401: Unauthorized
update
replace the specified Node
HTTP Request
PUT /api/v1/nodes/{name}
Parameters
-
name (in path): string, required
name of the Node
-
body: Node, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Node): OK
201 (Node): Created
401: Unauthorized
update
replace status of the specified Node
HTTP Request
PUT /api/v1/nodes/{name}/status
Parameters
-
name (in path): string, required
name of the Node
-
body: Node, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Node): OK
201 (Node): Created
401: Unauthorized
patch
partially update the specified Node
HTTP Request
PATCH /api/v1/nodes/{name}
Parameters
-
name (in path): string, required
name of the Node
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Node): OK
201 (Node): Created
401: Unauthorized
patch
partially update status of the specified Node
HTTP Request
PATCH /api/v1/nodes/{name}/status
Parameters
-
name (in path): string, required
name of the Node
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Node): OK
201 (Node): Created
401: Unauthorized
delete
delete a Node
HTTP Request
DELETE /api/v1/nodes/{name}
Parameters
-
name (in path): string, required
name of the Node
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of Node
HTTP Request
DELETE /api/v1/nodes
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.8.2 - Namespace
apiVersion: v1
import "k8s.io/api/core/v1"
Namespace
Namespace provides a scope for Names. Use of multiple namespaces is optional.
-
apiVersion: v1
-
kind: Namespace
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (NamespaceSpec)
Spec defines the behavior of the Namespace. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
-
status (NamespaceStatus)
Status describes the current status of a Namespace. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
NamespaceSpec
NamespaceSpec describes the attributes on a Namespace.
-
finalizers ([]string)
Finalizers is an opaque list of values that must be empty to permanently remove object from storage. More info: https://kubernetes.io/docs/tasks/administer-cluster/namespaces/
NamespaceStatus
NamespaceStatus is information about the current status of a Namespace.
-
conditions ([]NamespaceCondition)
Patch strategy: merge on key
type
Represents the latest available observations of a namespace's current state.
NamespaceCondition contains details about state of namespace.
-
conditions.status (string), required
Status of the condition, one of True, False, Unknown.
-
conditions.type (string), required
Type of namespace controller condition.
Possible enum values:
"NamespaceContentRemaining"
contains information about resources remaining in a namespace."NamespaceDeletionContentFailure"
contains information about namespace deleter errors during deletion of resources."NamespaceDeletionDiscoveryFailure"
contains information about namespace deleter errors during resource discovery."NamespaceDeletionGroupVersionParsingFailure"
contains information about namespace deleter errors parsing GV for legacy types."NamespaceFinalizersRemaining"
contains information about which finalizers are on resources remaining in a namespace.
-
conditions.lastTransitionTime (Time)
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
-
conditions.reason (string)
-
-
phase (string)
Phase is the current lifecycle phase of the namespace. More info: https://kubernetes.io/docs/tasks/administer-cluster/namespaces/
Possible enum values:
"Active"
means the namespace is available for use in the system"Terminating"
means the namespace is undergoing graceful termination
NamespaceList
NamespaceList is a list of Namespaces.
-
apiVersion: v1
-
kind: NamespaceList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]Namespace), required
Items is the list of Namespace objects in the list. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
Operations
get
read the specified Namespace
HTTP Request
GET /api/v1/namespaces/{name}
Parameters
-
name (in path): string, required
name of the Namespace
-
pretty (in query): string
Response
200 (Namespace): OK
401: Unauthorized
get
read status of the specified Namespace
HTTP Request
GET /api/v1/namespaces/{name}/status
Parameters
-
name (in path): string, required
name of the Namespace
-
pretty (in query): string
Response
200 (Namespace): OK
401: Unauthorized
list
list or watch objects of kind Namespace
HTTP Request
GET /api/v1/namespaces
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (NamespaceList): OK
401: Unauthorized
create
create a Namespace
HTTP Request
POST /api/v1/namespaces
Parameters
-
body: Namespace, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Namespace): OK
201 (Namespace): Created
202 (Namespace): Accepted
401: Unauthorized
update
replace the specified Namespace
HTTP Request
PUT /api/v1/namespaces/{name}
Parameters
-
name (in path): string, required
name of the Namespace
-
body: Namespace, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Namespace): OK
201 (Namespace): Created
401: Unauthorized
update
replace finalize of the specified Namespace
HTTP Request
PUT /api/v1/namespaces/{name}/finalize
Parameters
-
name (in path): string, required
name of the Namespace
-
body: Namespace, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Namespace): OK
201 (Namespace): Created
401: Unauthorized
update
replace status of the specified Namespace
HTTP Request
PUT /api/v1/namespaces/{name}/status
Parameters
-
name (in path): string, required
name of the Namespace
-
body: Namespace, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Namespace): OK
201 (Namespace): Created
401: Unauthorized
patch
partially update the specified Namespace
HTTP Request
PATCH /api/v1/namespaces/{name}
Parameters
-
name (in path): string, required
name of the Namespace
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Namespace): OK
201 (Namespace): Created
401: Unauthorized
patch
partially update status of the specified Namespace
HTTP Request
PATCH /api/v1/namespaces/{name}/status
Parameters
-
name (in path): string, required
name of the Namespace
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Namespace): OK
201 (Namespace): Created
401: Unauthorized
delete
delete a Namespace
HTTP Request
DELETE /api/v1/namespaces/{name}
Parameters
-
name (in path): string, required
name of the Namespace
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
6.5.8.3 - Event
apiVersion: events.k8s.io/v1
import "k8s.io/api/events/v1"
Event
Event is a report of an event somewhere in the cluster. It generally denotes some state change in the system. Events have a limited retention time and triggers and messages may evolve with time. Event consumers should not rely on the timing of an event with a given Reason reflecting a consistent underlying trigger, or the continued existence of events with that Reason. Events should be treated as informative, best-effort, supplemental data.
-
apiVersion: events.k8s.io/v1
-
kind: Event
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
eventTime (MicroTime), required
eventTime is the time when this Event was first observed. It is required.
MicroTime is version of Time with microsecond level precision.
-
action (string)
action is what action was taken/failed regarding to the regarding object. It is machine-readable. This field cannot be empty for new Events and it can have at most 128 characters.
-
deprecatedCount (int32)
deprecatedCount is the deprecated field assuring backward compatibility with core.v1 Event type.
-
deprecatedFirstTimestamp (Time)
deprecatedFirstTimestamp is the deprecated field assuring backward compatibility with core.v1 Event type.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
deprecatedLastTimestamp (Time)
deprecatedLastTimestamp is the deprecated field assuring backward compatibility with core.v1 Event type.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
deprecatedSource (EventSource)
deprecatedSource is the deprecated field assuring backward compatibility with core.v1 Event type.
EventSource contains information for an event.
-
deprecatedSource.component (string)
Component from which the event is generated.
-
deprecatedSource.host (string)
Node name on which the event is generated.
-
-
note (string)
note is a human-readable description of the status of this operation. Maximal length of the note is 1kB, but libraries should be prepared to handle values up to 64kB.
-
reason (string)
reason is why the action was taken. It is human-readable. This field cannot be empty for new Events and it can have at most 128 characters.
-
regarding (ObjectReference)
regarding contains the object this Event is about. In most cases it's an Object reporting controller implements, e.g. ReplicaSetController implements ReplicaSets and this event is emitted because it acts on some changes in a ReplicaSet object.
-
related (ObjectReference)
related is the optional secondary object for more complex actions. E.g. when regarding object triggers a creation or deletion of related object.
-
reportingController (string)
reportingController is the name of the controller that emitted this Event, e.g.
kubernetes.io/kubelet
. This field cannot be empty for new Events. -
reportingInstance (string)
reportingInstance is the ID of the controller instance, e.g.
kubelet-xyzf
. This field cannot be empty for new Events and it can have at most 128 characters. -
series (EventSeries)
series is data about the Event series this event represents or nil if it's a singleton Event.
EventSeries contain information on series of events, i.e. thing that was/is happening continuously for some time. How often to update the EventSeries is up to the event reporters. The default event reporter in "k8s.io/client-go/tools/events/event_broadcaster.go" shows how this struct is updated on heartbeats and can guide customized reporter implementations.
-
series.count (int32), required
count is the number of occurrences in this series up to the last heartbeat time.
-
series.lastObservedTime (MicroTime), required
lastObservedTime is the time when last Event from the series was seen before last heartbeat.
MicroTime is version of Time with microsecond level precision.
-
-
type (string)
type is the type of this event (Normal, Warning), new types could be added in the future. It is machine-readable. This field cannot be empty for new Events.
EventList
EventList is a list of Event objects.
-
apiVersion: events.k8s.io/v1
-
kind: EventList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]Event), required
items is a list of schema objects.
Operations
get
read the specified Event
HTTP Request
GET /apis/events.k8s.io/v1/namespaces/{namespace}/events/{name}
Parameters
-
name (in path): string, required
name of the Event
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Event): OK
401: Unauthorized
list
list or watch objects of kind Event
HTTP Request
GET /apis/events.k8s.io/v1/namespaces/{namespace}/events
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (EventList): OK
401: Unauthorized
list
list or watch objects of kind Event
HTTP Request
GET /apis/events.k8s.io/v1/events
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (EventList): OK
401: Unauthorized
create
create an Event
HTTP Request
POST /apis/events.k8s.io/v1/namespaces/{namespace}/events
Parameters
-
namespace (in path): string, required
-
body: Event, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Event): OK
201 (Event): Created
202 (Event): Accepted
401: Unauthorized
update
replace the specified Event
HTTP Request
PUT /apis/events.k8s.io/v1/namespaces/{namespace}/events/{name}
Parameters
-
name (in path): string, required
name of the Event
-
namespace (in path): string, required
-
body: Event, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Event): OK
201 (Event): Created
401: Unauthorized
patch
partially update the specified Event
HTTP Request
PATCH /apis/events.k8s.io/v1/namespaces/{namespace}/events/{name}
Parameters
-
name (in path): string, required
name of the Event
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Event): OK
201 (Event): Created
401: Unauthorized
delete
delete an Event
HTTP Request
DELETE /apis/events.k8s.io/v1/namespaces/{namespace}/events/{name}
Parameters
-
name (in path): string, required
name of the Event
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of Event
HTTP Request
DELETE /apis/events.k8s.io/v1/namespaces/{namespace}/events
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.8.4 - APIService
apiVersion: apiregistration.k8s.io/v1
import "k8s.io/kube-aggregator/pkg/apis/apiregistration/v1"
APIService
APIService represents a server for a particular GroupVersion. Name must be "version.group".
-
apiVersion: apiregistration.k8s.io/v1
-
kind: APIService
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (APIServiceSpec)
Spec contains information for locating and communicating with a server
-
status (APIServiceStatus)
Status contains derived information about an API server
APIServiceSpec
APIServiceSpec contains information for locating and communicating with a server. Only https is supported, though you are able to disable certificate verification.
-
groupPriorityMinimum (int32), required
GroupPriorityMininum is the priority this group should have at least. Higher priority means that the group is preferred by clients over lower priority ones. Note that other versions of this group might specify even higher GroupPriorityMininum values such that the whole group gets a higher priority. The primary sort is based on GroupPriorityMinimum, ordered highest number to lowest (20 before 10). The secondary sort is based on the alphabetical comparison of the name of the object. (v1.bar before v1.foo) We'd recommend something like: *.k8s.io (except extensions) at 18000 and PaaSes (OpenShift, Deis) are recommended to be in the 2000s
-
versionPriority (int32), required
VersionPriority controls the ordering of this API version inside of its group. Must be greater than zero. The primary sort is based on VersionPriority, ordered highest to lowest (20 before 10). Since it's inside of a group, the number can be small, probably in the 10s. In case of equal version priorities, the version string will be used to compute the order inside a group. If the version string is "kube-like", it will sort above non "kube-like" version strings, which are ordered lexicographically. "Kube-like" versions start with a "v", then are followed by a number (the major version), then optionally the string "alpha" or "beta" and another number (the minor version). These are sorted first by GA > beta > alpha (where GA is a version with no suffix such as beta or alpha), and then by comparing major version, then minor version. An example sorted list of versions: v10, v2, v1, v11beta2, v10beta3, v3beta1, v12alpha1, v11alpha2, foo1, foo10.
-
caBundle ([]byte)
Atomic: will be replaced during a merge
CABundle is a PEM encoded CA bundle which will be used to validate an API server's serving certificate. If unspecified, system trust roots on the apiserver are used.
-
group (string)
Group is the API group name this server hosts
-
insecureSkipTLSVerify (boolean)
InsecureSkipTLSVerify disables TLS certificate verification when communicating with this server. This is strongly discouraged. You should use the CABundle instead.
-
service (ServiceReference)
Service is a reference to the service for this API server. It must communicate on port 443. If the Service is nil, that means the handling for the API groupversion is handled locally on this server. The call will simply delegate to the normal handler chain to be fulfilled.
ServiceReference holds a reference to Service.legacy.k8s.io
-
service.name (string)
Name is the name of the service
-
service.namespace (string)
Namespace is the namespace of the service
-
service.port (int32)
If specified, the port on the service that hosting webhook. Default to 443 for backward compatibility.
port
should be a valid port number (1-65535, inclusive).
-
-
version (string)
Version is the API version this server hosts. For example, "v1"
APIServiceStatus
APIServiceStatus contains derived information about an API server
-
conditions ([]APIServiceCondition)
Patch strategy: merge on key
type
Map: unique values on key type will be kept during a merge
Current service state of apiService.
APIServiceCondition describes the state of an APIService at a particular point
-
conditions.status (string), required
Status is the status of the condition. Can be True, False, Unknown.
-
conditions.type (string), required
Type is the type of the condition.
-
conditions.lastTransitionTime (Time)
Last time the condition transitioned from one status to another.
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
Human-readable message indicating details about last transition.
-
conditions.reason (string)
Unique, one-word, CamelCase reason for the condition's last transition.
-
APIServiceList
APIServiceList is a list of APIService objects.
-
apiVersion: apiregistration.k8s.io/v1
-
kind: APIServiceList
-
metadata (ListMeta)
Standard list metadata More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]APIService), required
Items is the list of APIService
Operations
get
read the specified APIService
HTTP Request
GET /apis/apiregistration.k8s.io/v1/apiservices/{name}
Parameters
-
name (in path): string, required
name of the APIService
-
pretty (in query): string
Response
200 (APIService): OK
401: Unauthorized
get
read status of the specified APIService
HTTP Request
GET /apis/apiregistration.k8s.io/v1/apiservices/{name}/status
Parameters
-
name (in path): string, required
name of the APIService
-
pretty (in query): string
Response
200 (APIService): OK
401: Unauthorized
list
list or watch objects of kind APIService
HTTP Request
GET /apis/apiregistration.k8s.io/v1/apiservices
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (APIServiceList): OK
401: Unauthorized
create
create an APIService
HTTP Request
POST /apis/apiregistration.k8s.io/v1/apiservices
Parameters
-
body: APIService, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (APIService): OK
201 (APIService): Created
202 (APIService): Accepted
401: Unauthorized
update
replace the specified APIService
HTTP Request
PUT /apis/apiregistration.k8s.io/v1/apiservices/{name}
Parameters
-
name (in path): string, required
name of the APIService
-
body: APIService, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (APIService): OK
201 (APIService): Created
401: Unauthorized
update
replace status of the specified APIService
HTTP Request
PUT /apis/apiregistration.k8s.io/v1/apiservices/{name}/status
Parameters
-
name (in path): string, required
name of the APIService
-
body: APIService, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (APIService): OK
201 (APIService): Created
401: Unauthorized
patch
partially update the specified APIService
HTTP Request
PATCH /apis/apiregistration.k8s.io/v1/apiservices/{name}
Parameters
-
name (in path): string, required
name of the APIService
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (APIService): OK
201 (APIService): Created
401: Unauthorized
patch
partially update status of the specified APIService
HTTP Request
PATCH /apis/apiregistration.k8s.io/v1/apiservices/{name}/status
Parameters
-
name (in path): string, required
name of the APIService
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (APIService): OK
201 (APIService): Created
401: Unauthorized
delete
delete an APIService
HTTP Request
DELETE /apis/apiregistration.k8s.io/v1/apiservices/{name}
Parameters
-
name (in path): string, required
name of the APIService
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of APIService
HTTP Request
DELETE /apis/apiregistration.k8s.io/v1/apiservices
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.8.5 - Lease
apiVersion: coordination.k8s.io/v1
import "k8s.io/api/coordination/v1"
Lease
Lease defines a lease concept.
-
apiVersion: coordination.k8s.io/v1
-
kind: Lease
-
metadata (ObjectMeta)
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
spec (LeaseSpec)
Specification of the Lease. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
LeaseSpec
LeaseSpec is a specification of a Lease.
-
acquireTime (MicroTime)
acquireTime is a time when the current lease was acquired.
MicroTime is version of Time with microsecond level precision.
-
holderIdentity (string)
holderIdentity contains the identity of the holder of a current lease.
-
leaseDurationSeconds (int32)
leaseDurationSeconds is a duration that candidates for a lease need to wait to force acquire it. This is measure against time of last observed RenewTime.
-
leaseTransitions (int32)
leaseTransitions is the number of transitions of a lease between holders.
-
renewTime (MicroTime)
renewTime is a time when the current holder of a lease has last updated the lease.
MicroTime is version of Time with microsecond level precision.
LeaseList
LeaseList is a list of Lease objects.
-
apiVersion: coordination.k8s.io/v1
-
kind: LeaseList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]Lease), required
Items is a list of schema objects.
Operations
get
read the specified Lease
HTTP Request
GET /apis/coordination.k8s.io/v1/namespaces/{namespace}/leases/{name}
Parameters
-
name (in path): string, required
name of the Lease
-
namespace (in path): string, required
-
pretty (in query): string
Response
200 (Lease): OK
401: Unauthorized
list
list or watch objects of kind Lease
HTTP Request
GET /apis/coordination.k8s.io/v1/namespaces/{namespace}/leases
Parameters
-
namespace (in path): string, required
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (LeaseList): OK
401: Unauthorized
list
list or watch objects of kind Lease
HTTP Request
GET /apis/coordination.k8s.io/v1/leases
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (LeaseList): OK
401: Unauthorized
create
create a Lease
HTTP Request
POST /apis/coordination.k8s.io/v1/namespaces/{namespace}/leases
Parameters
-
namespace (in path): string, required
-
body: Lease, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Lease): OK
201 (Lease): Created
202 (Lease): Accepted
401: Unauthorized
update
replace the specified Lease
HTTP Request
PUT /apis/coordination.k8s.io/v1/namespaces/{namespace}/leases/{name}
Parameters
-
name (in path): string, required
name of the Lease
-
namespace (in path): string, required
-
body: Lease, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Lease): OK
201 (Lease): Created
401: Unauthorized
patch
partially update the specified Lease
HTTP Request
PATCH /apis/coordination.k8s.io/v1/namespaces/{namespace}/leases/{name}
Parameters
-
name (in path): string, required
name of the Lease
-
namespace (in path): string, required
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (Lease): OK
201 (Lease): Created
401: Unauthorized
delete
delete a Lease
HTTP Request
DELETE /apis/coordination.k8s.io/v1/namespaces/{namespace}/leases/{name}
Parameters
-
name (in path): string, required
name of the Lease
-
namespace (in path): string, required
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of Lease
HTTP Request
DELETE /apis/coordination.k8s.io/v1/namespaces/{namespace}/leases
Parameters
-
namespace (in path): string, required
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.8.6 - RuntimeClass
apiVersion: node.k8s.io/v1
import "k8s.io/api/node/v1"
RuntimeClass
RuntimeClass defines a class of container runtime supported in the cluster. The RuntimeClass is used to determine which container runtime is used to run all containers in a pod. RuntimeClasses are manually defined by a user or cluster provisioner, and referenced in the PodSpec. The Kubelet is responsible for resolving the RuntimeClassName reference before running the pod. For more details, see https://kubernetes.io/docs/concepts/containers/runtime-class/
-
apiVersion: node.k8s.io/v1
-
kind: RuntimeClass
-
metadata (ObjectMeta)
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
handler (string), required
Handler specifies the underlying runtime and configuration that the CRI implementation will use to handle pods of this class. The possible values are specific to the node & CRI configuration. It is assumed that all handlers are available on every node, and handlers of the same name are equivalent on every node. For example, a handler called "runc" might specify that the runc OCI runtime (using native Linux containers) will be used to run the containers in a pod. The Handler must be lowercase, conform to the DNS Label (RFC 1123) requirements, and is immutable.
-
overhead (Overhead)
Overhead represents the resource overhead associated with running a pod for a given RuntimeClass. For more details, see https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/ This field is in beta starting v1.18 and is only honored by servers that enable the PodOverhead feature.
Overhead structure represents the resource overhead associated with running a pod.
-
overhead.podFixed (map[string]Quantity)
PodFixed represents the fixed resource overhead associated with running a pod.
-
-
scheduling (Scheduling)
Scheduling holds the scheduling constraints to ensure that pods running with this RuntimeClass are scheduled to nodes that support it. If scheduling is nil, this RuntimeClass is assumed to be supported by all nodes.
Scheduling specifies the scheduling constraints for nodes supporting a RuntimeClass.
-
scheduling.nodeSelector (map[string]string)
nodeSelector lists labels that must be present on nodes that support this RuntimeClass. Pods using this RuntimeClass can only be scheduled to a node matched by this selector. The RuntimeClass nodeSelector is merged with a pod's existing nodeSelector. Any conflicts will cause the pod to be rejected in admission.
-
scheduling.tolerations ([]Toleration)
Atomic: will be replaced during a merge
tolerations are appended (excluding duplicates) to pods running with this RuntimeClass during admission, effectively unioning the set of nodes tolerated by the pod and the RuntimeClass.
The pod this Toleration is attached to tolerates any taint that matches the triple <key,value,effect> using the matching operator
. -
scheduling.tolerations.key (string)
Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty, operator must be Exists; this combination means to match all values and all keys.
-
scheduling.tolerations.operator (string)
Operator represents a key's relationship to the value. Valid operators are Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.
Possible enum values:
"Equal"
"Exists"
-
scheduling.tolerations.value (string)
Value is the taint value the toleration matches to. If the operator is Exists, the value should be empty, otherwise just a regular string.
-
scheduling.tolerations.effect (string)
Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.
Possible enum values:
"NoExecute"
Evict any already-running pods that do not tolerate the taint. Currently enforced by NodeController."NoSchedule"
Do not allow new pods to schedule onto the node unless they tolerate the taint, but allow all pods submitted to Kubelet without going through the scheduler to start, and allow all already-running pods to continue running. Enforced by the scheduler."PreferNoSchedule"
Like TaintEffectNoSchedule, but the scheduler tries not to schedule new pods onto the node, rather than prohibiting new pods from scheduling onto the node entirely. Enforced by the scheduler.
-
scheduling.tolerations.tolerationSeconds (int64)
TolerationSeconds represents the period of time the toleration (which must be of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.
-
-
RuntimeClassList
RuntimeClassList is a list of RuntimeClass objects.
-
apiVersion: node.k8s.io/v1
-
kind: RuntimeClassList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
items ([]RuntimeClass), required
Items is a list of schema objects.
Operations
get
read the specified RuntimeClass
HTTP Request
GET /apis/node.k8s.io/v1/runtimeclasses/{name}
Parameters
-
name (in path): string, required
name of the RuntimeClass
-
pretty (in query): string
Response
200 (RuntimeClass): OK
401: Unauthorized
list
list or watch objects of kind RuntimeClass
HTTP Request
GET /apis/node.k8s.io/v1/runtimeclasses
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (RuntimeClassList): OK
401: Unauthorized
create
create a RuntimeClass
HTTP Request
POST /apis/node.k8s.io/v1/runtimeclasses
Parameters
-
body: RuntimeClass, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (RuntimeClass): OK
201 (RuntimeClass): Created
202 (RuntimeClass): Accepted
401: Unauthorized
update
replace the specified RuntimeClass
HTTP Request
PUT /apis/node.k8s.io/v1/runtimeclasses/{name}
Parameters
-
name (in path): string, required
name of the RuntimeClass
-
body: RuntimeClass, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (RuntimeClass): OK
201 (RuntimeClass): Created
401: Unauthorized
patch
partially update the specified RuntimeClass
HTTP Request
PATCH /apis/node.k8s.io/v1/runtimeclasses/{name}
Parameters
-
name (in path): string, required
name of the RuntimeClass
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (RuntimeClass): OK
201 (RuntimeClass): Created
401: Unauthorized
delete
delete a RuntimeClass
HTTP Request
DELETE /apis/node.k8s.io/v1/runtimeclasses/{name}
Parameters
-
name (in path): string, required
name of the RuntimeClass
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of RuntimeClass
HTTP Request
DELETE /apis/node.k8s.io/v1/runtimeclasses
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.8.7 - FlowSchema v1beta2
apiVersion: flowcontrol.apiserver.k8s.io/v1beta2
import "k8s.io/api/flowcontrol/v1beta2"
FlowSchema
FlowSchema defines the schema of a group of flows. Note that a flow is made up of a set of inbound API requests with similar attributes and is identified by a pair of strings: the name of the FlowSchema and a "flow distinguisher".
-
apiVersion: flowcontrol.apiserver.k8s.io/v1beta2
-
kind: FlowSchema
-
metadata (ObjectMeta)
metadata
is the standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata -
spec (FlowSchemaSpec)
spec
is the specification of the desired behavior of a FlowSchema. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status -
status (FlowSchemaStatus)
status
is the current status of a FlowSchema. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
FlowSchemaSpec
FlowSchemaSpec describes how the FlowSchema's specification looks like.
-
priorityLevelConfiguration (PriorityLevelConfigurationReference), required
priorityLevelConfiguration
should reference a PriorityLevelConfiguration in the cluster. If the reference cannot be resolved, the FlowSchema will be ignored and marked as invalid in its status. Required.PriorityLevelConfigurationReference contains information that points to the "request-priority" being used.
-
priorityLevelConfiguration.name (string), required
name
is the name of the priority level configuration being referenced Required.
-
-
distinguisherMethod (FlowDistinguisherMethod)
distinguisherMethod
defines how to compute the flow distinguisher for requests that match this schema.nil
specifies that the distinguisher is disabled and thus will always be the empty string.FlowDistinguisherMethod specifies the method of a flow distinguisher.
-
distinguisherMethod.type (string), required
type
is the type of flow distinguisher method The supported types are "ByUser" and "ByNamespace". Required.
-
-
matchingPrecedence (int32)
matchingPrecedence
is used to choose among the FlowSchemas that match a given request. The chosen FlowSchema is among those with the numerically lowest (which we take to be logically highest) MatchingPrecedence. Each MatchingPrecedence value must be ranged in [1,10000]. Note that if the precedence is not specified, it will be set to 1000 as default. -
rules ([]PolicyRulesWithSubjects)
Atomic: will be replaced during a merge
rules
describes which requests will match this flow schema. This FlowSchema matches a request if and only if at least one member of rules matches the request. if it is an empty slice, there will be no requests matching the FlowSchema.PolicyRulesWithSubjects prescribes a test that applies to a request to an apiserver. The test considers the subject making the request, the verb being requested, and the resource to be acted upon. This PolicyRulesWithSubjects matches a request if and only if both (a) at least one member of subjects matches the request and (b) at least one member of resourceRules or nonResourceRules matches the request.
-
rules.subjects ([]Subject), required
Atomic: will be replaced during a merge
subjects is the list of normal user, serviceaccount, or group that this rule cares about. There must be at least one member in this slice. A slice that includes both the system:authenticated and system:unauthenticated user groups matches every request. Required.
Subject matches the originator of a request, as identified by the request authentication system. There are three ways of matching an originator; by user, group, or service account.
-
rules.subjects.kind (string), required
kind
indicates which one of the other fields is non-empty. Required -
rules.subjects.group (GroupSubject)
group
matches based on user group name.GroupSubject holds detailed information for group-kind subject.
-
rules.subjects.group.name (string), required
name is the user group that matches, or "*" to match all user groups. See https://github.com/kubernetes/apiserver/blob/master/pkg/authentication/user/user.go for some well-known group names. Required.
-
-
rules.subjects.serviceAccount (ServiceAccountSubject)
serviceAccount
matches ServiceAccounts.ServiceAccountSubject holds detailed information for service-account-kind subject.
-
rules.subjects.serviceAccount.name (string), required
name
is the name of matching ServiceAccount objects, or "*" to match regardless of name. Required. -
rules.subjects.serviceAccount.namespace (string), required
namespace
is the namespace of matching ServiceAccount objects. Required.
-
-
rules.subjects.user (UserSubject)
user
matches based on username.UserSubject holds detailed information for user-kind subject.
-
rules.subjects.user.name (string), required
name
is the username that matches, or "*" to match all usernames. Required.
-
-
-
rules.nonResourceRules ([]NonResourcePolicyRule)
Atomic: will be replaced during a merge
nonResourceRules
is a list of NonResourcePolicyRules that identify matching requests according to their verb and the target non-resource URL.NonResourcePolicyRule is a predicate that matches non-resource requests according to their verb and the target non-resource URL. A NonResourcePolicyRule matches a request if and only if both (a) at least one member of verbs matches the request and (b) at least one member of nonResourceURLs matches the request.
-
rules.nonResourceRules.nonResourceURLs ([]string), required
Set: unique values will be kept during a merge
nonResourceURLs
is a set of url prefixes that a user should have access to and may not be empty. For example:- "/healthz" is legal
- "/hea*" is illegal
- "/hea" is legal but matches nothing
- "/hea/*" also matches nothing
- "/healthz/" matches all per-component health checks. "" matches all non-resource urls. if it is present, it must be the only entry. Required.
-
rules.nonResourceRules.verbs ([]string), required
Set: unique values will be kept during a merge
verbs
is a list of matching verbs and may not be empty. "*" matches all verbs. If it is present, it must be the only entry. Required.
-
-
rules.resourceRules ([]ResourcePolicyRule)
Atomic: will be replaced during a merge
resourceRules
is a slice of ResourcePolicyRules that identify matching requests according to their verb and the target resource. At least one ofresourceRules
andnonResourceRules
has to be non-empty.ResourcePolicyRule is a predicate that matches some resource requests, testing the request's verb and the target resource. A ResourcePolicyRule matches a resource request if and only if: (a) at least one member of verbs matches the request, (b) at least one member of apiGroups matches the request, (c) at least one member of resources matches the request, and (d) either (d1) the request does not specify a namespace (i.e.,
Namespace==""
) and clusterScope is true or (d2) the request specifies a namespace and least one member of namespaces matches the request's namespace.-
rules.resourceRules.apiGroups ([]string), required
Set: unique values will be kept during a merge
apiGroups
is a list of matching API groups and may not be empty. "*" matches all API groups and, if present, must be the only entry. Required. -
rules.resourceRules.resources ([]string), required
Set: unique values will be kept during a merge
resources
is a list of matching resources (i.e., lowercase and plural) with, if desired, subresource. For example, [ "services", "nodes/status" ]. This list may not be empty. "*" matches all resources and, if present, must be the only entry. Required. -
rules.resourceRules.verbs ([]string), required
Set: unique values will be kept during a merge
verbs
is a list of matching verbs and may not be empty. "*" matches all verbs and, if present, must be the only entry. Required. -
rules.resourceRules.clusterScope (boolean)
clusterScope
indicates whether to match requests that do not specify a namespace (which happens either because the resource is not namespaced or the request targets all namespaces). If this field is omitted or false then thenamespaces
field must contain a non-empty list. -
rules.resourceRules.namespaces ([]string)
Set: unique values will be kept during a merge
namespaces
is a list of target namespaces that restricts matches. A request that specifies a target namespace matches only if either (a) this list contains that target namespace or (b) this list contains "". Note that "" matches any specified namespace but does not match a request that does not specify a namespace (see theclusterScope
field for that). This list may be empty, but only ifclusterScope
is true.
-
-
FlowSchemaStatus
FlowSchemaStatus represents the current state of a FlowSchema.
-
conditions ([]FlowSchemaCondition)
Map: unique values on key type will be kept during a merge
conditions
is a list of the current states of FlowSchema.FlowSchemaCondition describes conditions for a FlowSchema.
-
conditions.lastTransitionTime (Time)
lastTransitionTime
is the last time the condition transitioned from one status to another.Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
message
is a human-readable message indicating details about last transition. -
conditions.reason (string)
reason
is a unique, one-word, CamelCase reason for the condition's last transition. -
conditions.status (string)
status
is the status of the condition. Can be True, False, Unknown. Required. -
conditions.type (string)
type
is the type of the condition. Required.
-
FlowSchemaList
FlowSchemaList is a list of FlowSchema objects.
-
apiVersion: flowcontrol.apiserver.k8s.io/v1beta2
-
kind: FlowSchemaList
-
metadata (ListMeta)
metadata
is the standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata -
items ([]FlowSchema), required
items
is a list of FlowSchemas.
Operations
get
read the specified FlowSchema
HTTP Request
GET /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas/{name}
Parameters
-
name (in path): string, required
name of the FlowSchema
-
pretty (in query): string
Response
200 (FlowSchema): OK
401: Unauthorized
get
read status of the specified FlowSchema
HTTP Request
GET /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas/{name}/status
Parameters
-
name (in path): string, required
name of the FlowSchema
-
pretty (in query): string
Response
200 (FlowSchema): OK
401: Unauthorized
list
list or watch objects of kind FlowSchema
HTTP Request
GET /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (FlowSchemaList): OK
401: Unauthorized
create
create a FlowSchema
HTTP Request
POST /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas
Parameters
-
body: FlowSchema, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (FlowSchema): OK
201 (FlowSchema): Created
202 (FlowSchema): Accepted
401: Unauthorized
update
replace the specified FlowSchema
HTTP Request
PUT /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas/{name}
Parameters
-
name (in path): string, required
name of the FlowSchema
-
body: FlowSchema, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (FlowSchema): OK
201 (FlowSchema): Created
401: Unauthorized
update
replace status of the specified FlowSchema
HTTP Request
PUT /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas/{name}/status
Parameters
-
name (in path): string, required
name of the FlowSchema
-
body: FlowSchema, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (FlowSchema): OK
201 (FlowSchema): Created
401: Unauthorized
patch
partially update the specified FlowSchema
HTTP Request
PATCH /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas/{name}
Parameters
-
name (in path): string, required
name of the FlowSchema
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (FlowSchema): OK
201 (FlowSchema): Created
401: Unauthorized
patch
partially update status of the specified FlowSchema
HTTP Request
PATCH /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas/{name}/status
Parameters
-
name (in path): string, required
name of the FlowSchema
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (FlowSchema): OK
201 (FlowSchema): Created
401: Unauthorized
delete
delete a FlowSchema
HTTP Request
DELETE /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas/{name}
Parameters
-
name (in path): string, required
name of the FlowSchema
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of FlowSchema
HTTP Request
DELETE /apis/flowcontrol.apiserver.k8s.io/v1beta2/flowschemas
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.8.8 - PriorityLevelConfiguration v1beta2
apiVersion: flowcontrol.apiserver.k8s.io/v1beta2
import "k8s.io/api/flowcontrol/v1beta2"
PriorityLevelConfiguration
PriorityLevelConfiguration represents the configuration of a priority level.
-
apiVersion: flowcontrol.apiserver.k8s.io/v1beta2
-
kind: PriorityLevelConfiguration
-
metadata (ObjectMeta)
metadata
is the standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata -
spec (PriorityLevelConfigurationSpec)
spec
is the specification of the desired behavior of a "request-priority". More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status -
status (PriorityLevelConfigurationStatus)
status
is the current status of a "request-priority". More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
PriorityLevelConfigurationSpec
PriorityLevelConfigurationSpec specifies the configuration of a priority level.
-
type (string), required
type
indicates whether this priority level is subject to limitation on request execution. A value of"Exempt"
means that requests of this priority level are not subject to a limit (and thus are never queued) and do not detract from the capacity made available to other priority levels. A value of"Limited"
means that (a) requests of this priority level are subject to limits and (b) some of the server's limited capacity is made available exclusively to this priority level. Required. -
limited (LimitedPriorityLevelConfiguration)
limited
specifies how requests are handled for a Limited priority level. This field must be non-empty if and only iftype
is"Limited"
.*LimitedPriorityLevelConfiguration specifies how to handle requests that are subject to limits. It addresses two issues:
- How are requests for this priority level limited?
- What should be done with requests that exceed the limit?*
-
limited.assuredConcurrencyShares (int32)
assuredConcurrencyShares
(ACS) configures the execution limit, which is a limit on the number of requests of this priority level that may be exeucting at a given time. ACS must be a positive number. The server's concurrency limit (SCL) is divided among the concurrency-controlled priority levels in proportion to their assured concurrency shares. This produces the assured concurrency value (ACV) --- the number of requests that may be executing at a time --- for each such priority level:ACV(l) = ceil( SCL * ACS(l) / ( sum[priority levels k] ACS(k) ) )
bigger numbers of ACS mean more reserved concurrent requests (at the expense of every other PL). This field has a default value of 30.
-
limited.limitResponse (LimitResponse)
limitResponse
indicates what to do with requests that can not be executed right nowLimitResponse defines how to handle requests that can not be executed right now.
-
limited.limitResponse.type (string), required
type
is "Queue" or "Reject". "Queue" means that requests that can not be executed upon arrival are held in a queue until they can be executed or a queuing limit is reached. "Reject" means that requests that can not be executed upon arrival are rejected. Required. -
limited.limitResponse.queuing (QueuingConfiguration)
queuing
holds the configuration parameters for queuing. This field may be non-empty only iftype
is"Queue"
.QueuingConfiguration holds the configuration parameters for queuing
-
limited.limitResponse.queuing.handSize (int32)
handSize
is a small positive number that configures the shuffle sharding of requests into queues. When enqueuing a request at this priority level the request's flow identifier (a string pair) is hashed and the hash value is used to shuffle the list of queues and deal a hand of the size specified here. The request is put into one of the shortest queues in that hand.handSize
must be no larger thanqueues
, and should be significantly smaller (so that a few heavy flows do not saturate most of the queues). See the user-facing documentation for more extensive guidance on setting this field. This field has a default value of 8. -
limited.limitResponse.queuing.queueLengthLimit (int32)
queueLengthLimit
is the maximum number of requests allowed to be waiting in a given queue of this priority level at a time; excess requests are rejected. This value must be positive. If not specified, it will be defaulted to 50. -
limited.limitResponse.queuing.queues (int32)
queues
is the number of queues for this priority level. The queues exist independently at each apiserver. The value must be positive. Setting it to 1 effectively precludes shufflesharding and thus makes the distinguisher method of associated flow schemas irrelevant. This field has a default value of 64.
-
-
PriorityLevelConfigurationStatus
PriorityLevelConfigurationStatus represents the current state of a "request-priority".
-
conditions ([]PriorityLevelConfigurationCondition)
Map: unique values on key type will be kept during a merge
conditions
is the current state of "request-priority".PriorityLevelConfigurationCondition defines the condition of priority level.
-
conditions.lastTransitionTime (Time)
lastTransitionTime
is the last time the condition transitioned from one status to another.Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
conditions.message (string)
message
is a human-readable message indicating details about last transition. -
conditions.reason (string)
reason
is a unique, one-word, CamelCase reason for the condition's last transition. -
conditions.status (string)
status
is the status of the condition. Can be True, False, Unknown. Required. -
conditions.type (string)
type
is the type of the condition. Required.
-
PriorityLevelConfigurationList
PriorityLevelConfigurationList is a list of PriorityLevelConfiguration objects.
-
apiVersion: flowcontrol.apiserver.k8s.io/v1beta2
-
kind: PriorityLevelConfigurationList
-
metadata (ListMeta)
metadata
is the standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata -
items ([]PriorityLevelConfiguration), required
items
is a list of request-priorities.
Operations
get
read the specified PriorityLevelConfiguration
HTTP Request
GET /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations/{name}
Parameters
-
name (in path): string, required
name of the PriorityLevelConfiguration
-
pretty (in query): string
Response
200 (PriorityLevelConfiguration): OK
401: Unauthorized
get
read status of the specified PriorityLevelConfiguration
HTTP Request
GET /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations/{name}/status
Parameters
-
name (in path): string, required
name of the PriorityLevelConfiguration
-
pretty (in query): string
Response
200 (PriorityLevelConfiguration): OK
401: Unauthorized
list
list or watch objects of kind PriorityLevelConfiguration
HTTP Request
GET /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (PriorityLevelConfigurationList): OK
401: Unauthorized
create
create a PriorityLevelConfiguration
HTTP Request
POST /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations
Parameters
-
body: PriorityLevelConfiguration, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PriorityLevelConfiguration): OK
201 (PriorityLevelConfiguration): Created
202 (PriorityLevelConfiguration): Accepted
401: Unauthorized
update
replace the specified PriorityLevelConfiguration
HTTP Request
PUT /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations/{name}
Parameters
-
name (in path): string, required
name of the PriorityLevelConfiguration
-
body: PriorityLevelConfiguration, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PriorityLevelConfiguration): OK
201 (PriorityLevelConfiguration): Created
401: Unauthorized
update
replace status of the specified PriorityLevelConfiguration
HTTP Request
PUT /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations/{name}/status
Parameters
-
name (in path): string, required
name of the PriorityLevelConfiguration
-
body: PriorityLevelConfiguration, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (PriorityLevelConfiguration): OK
201 (PriorityLevelConfiguration): Created
401: Unauthorized
patch
partially update the specified PriorityLevelConfiguration
HTTP Request
PATCH /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations/{name}
Parameters
-
name (in path): string, required
name of the PriorityLevelConfiguration
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PriorityLevelConfiguration): OK
201 (PriorityLevelConfiguration): Created
401: Unauthorized
patch
partially update status of the specified PriorityLevelConfiguration
HTTP Request
PATCH /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations/{name}/status
Parameters
-
name (in path): string, required
name of the PriorityLevelConfiguration
-
body: Patch, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
force (in query): boolean
-
pretty (in query): string
Response
200 (PriorityLevelConfiguration): OK
201 (PriorityLevelConfiguration): Created
401: Unauthorized
delete
delete a PriorityLevelConfiguration
HTTP Request
DELETE /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations/{name}
Parameters
-
name (in path): string, required
name of the PriorityLevelConfiguration
-
body: DeleteOptions
-
dryRun (in query): string
-
gracePeriodSeconds (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
Response
200 (Status): OK
202 (Status): Accepted
401: Unauthorized
deletecollection
delete collection of PriorityLevelConfiguration
HTTP Request
DELETE /apis/flowcontrol.apiserver.k8s.io/v1beta2/prioritylevelconfigurations
Parameters
-
body: DeleteOptions
-
continue (in query): string
-
dryRun (in query): string
-
fieldSelector (in query): string
-
gracePeriodSeconds (in query): integer
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
propagationPolicy (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
Response
200 (Status): OK
401: Unauthorized
6.5.8.9 - Binding
apiVersion: v1
import "k8s.io/api/core/v1"
Binding
Binding ties one object to another; for example, a pod is bound to a node by a scheduler. Deprecated in 1.7, please use the bindings subresource of pods instead.
-
apiVersion: v1
-
kind: Binding
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
target (ObjectReference), required
The target object that you want to bind to the standard object.
Operations
create
create a Binding
HTTP Request
POST /api/v1/namespaces/{namespace}/bindings
Parameters
-
namespace (in path): string, required
-
body: Binding, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Binding): OK
201 (Binding): Created
202 (Binding): Accepted
401: Unauthorized
create
create binding of a Pod
HTTP Request
POST /api/v1/namespaces/{namespace}/pods/{name}/binding
Parameters
-
name (in path): string, required
name of the Binding
-
namespace (in path): string, required
-
body: Binding, required
-
dryRun (in query): string
-
fieldManager (in query): string
-
fieldValidation (in query): string
-
pretty (in query): string
Response
200 (Binding): OK
201 (Binding): Created
202 (Binding): Accepted
401: Unauthorized
6.5.8.10 - ComponentStatus
apiVersion: v1
import "k8s.io/api/core/v1"
ComponentStatus
ComponentStatus (and ComponentStatusList) holds the cluster validation info. Deprecated: This API is deprecated in v1.19+
-
apiVersion: v1
-
kind: ComponentStatus
-
metadata (ObjectMeta)
Standard object's metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-
conditions ([]ComponentCondition)
Patch strategy: merge on key
type
List of component conditions observed
Information about the condition of a component.
-
conditions.status (string), required
Status of the condition for a component. Valid values for "Healthy": "True", "False", or "Unknown".
-
conditions.type (string), required
Type of condition for a component. Valid value: "Healthy"
-
conditions.error (string)
Condition error code for a component. For example, a health check error code.
-
conditions.message (string)
Message about the condition for a component. For example, information about a health check.
-
ComponentStatusList
Status of all the conditions for the component as a list of ComponentStatus objects. Deprecated: This API is deprecated in v1.19+
-
apiVersion: v1
-
kind: ComponentStatusList
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
items ([]ComponentStatus), required
List of ComponentStatus objects.
Operations
get
read the specified ComponentStatus
HTTP Request
GET /api/v1/componentstatuses/{name}
Parameters
-
name (in path): string, required
name of the ComponentStatus
-
pretty (in query): string
Response
200 (ComponentStatus): OK
401: Unauthorized
list
list objects of kind ComponentStatus
HTTP Request
GET /api/v1/componentstatuses
Parameters
-
allowWatchBookmarks (in query): boolean
-
continue (in query): string
-
fieldSelector (in query): string
-
labelSelector (in query): string
-
limit (in query): integer
-
pretty (in query): string
-
resourceVersion (in query): string
-
resourceVersionMatch (in query): string
-
timeoutSeconds (in query): integer
-
watch (in query): boolean
Response
200 (ComponentStatusList): OK
401: Unauthorized
6.5.9 - Common Definitions
6.5.9.1 - DeleteOptions
import "k8s.io/apimachinery/pkg/apis/meta/v1"
DeleteOptions may be provided when deleting an API object.
-
apiVersion (string)
APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
-
dryRun ([]string)
When present, indicates that modifications should not be persisted. An invalid or unrecognized dryRun directive will result in an error response and no further processing of the request. Valid values are: - All: all dry run stages will be processed
-
gracePeriodSeconds (int64)
The duration in seconds before the object should be deleted. Value must be non-negative integer. The value zero indicates delete immediately. If this value is nil, the default grace period for the specified type will be used. Defaults to a per object value if not specified. zero means delete immediately.
-
kind (string)
Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
orphanDependents (boolean)
Deprecated: please use the PropagationPolicy, this field will be deprecated in 1.7. Should the dependent objects be orphaned. If true/false, the "orphan" finalizer will be added to/removed from the object's finalizers list. Either this field or PropagationPolicy may be set, but not both.
-
preconditions (Preconditions)
Must be fulfilled before a deletion is carried out. If not possible, a 409 Conflict status will be returned.
Preconditions must be fulfilled before an operation (update, delete, etc.) is carried out.
-
preconditions.resourceVersion (string)
Specifies the target ResourceVersion
-
preconditions.uid (string)
Specifies the target UID.
-
-
propagationPolicy (string)
Whether and how garbage collection will be performed. Either this field or OrphanDependents may be set, but not both. The default policy is decided by the existing finalizer set in the metadata.finalizers and the resource-specific default policy. Acceptable values are: 'Orphan' - orphan the dependents; 'Background' - allow the garbage collector to delete the dependents in the background; 'Foreground' - a cascading policy that deletes all dependents in the foreground.
6.5.9.2 - LabelSelector
import "k8s.io/apimachinery/pkg/apis/meta/v1"
A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects.
-
matchExpressions ([]LabelSelectorRequirement)
matchExpressions is a list of label selector requirements. The requirements are ANDed.
A label selector requirement is a selector that contains values, a key, and an operator that relates the key and values.
-
matchExpressions.key (string), required
Patch strategy: merge on key
key
key is the label key that the selector applies to.
-
matchExpressions.operator (string), required
operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.
-
matchExpressions.values ([]string)
values is an array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. This array is replaced during a strategic merge patch.
-
-
matchLabels (map[string]string)
matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value". The requirements are ANDed.
6.5.9.3 - ListMeta
import "k8s.io/apimachinery/pkg/apis/meta/v1"
ListMeta describes metadata that synthetic resources must have, including lists and various status objects. A resource may have only one of {ObjectMeta, ListMeta}.
-
continue (string)
continue may be set if the user set a limit on the number of items returned, and indicates that the server has more data available. The value is opaque and may be used to issue another request to the endpoint that served this list to retrieve the next set of available objects. Continuing a consistent list may not be possible if the server configuration has changed or more than a few minutes have passed. The resourceVersion field returned when using this continue value will be identical to the value in the first response, unless you have received this token from an error message.
-
remainingItemCount (int64)
remainingItemCount is the number of subsequent items in the list which are not included in this list response. If the list request contained label or field selectors, then the number of remaining items is unknown and the field will be left unset and omitted during serialization. If the list is complete (either because it is not chunking or because this is the last chunk), then there are no more remaining items and this field will be left unset and omitted during serialization. Servers older than v1.15 do not set this field. The intended use of the remainingItemCount is estimating the size of a collection. Clients should not rely on the remainingItemCount to be set or to be exact.
-
resourceVersion (string)
String that identifies the server's internal version of this object that can be used by clients to determine when objects have changed. Value must be treated as opaque by clients and passed unmodified back to the server. Populated by the system. Read-only. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
-
selfLink (string)
selfLink is a URL representing this object. Populated by the system. Read-only.
DEPRECATED Kubernetes will stop propagating this field in 1.20 release and the field is planned to be removed in 1.21 release.
6.5.9.4 - LocalObjectReference
import "k8s.io/api/core/v1"
LocalObjectReference contains enough information to let you locate the referenced object inside the same namespace.
-
name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
6.5.9.5 - NodeSelectorRequirement
import "k8s.io/api/core/v1"
A node selector requirement is a selector that contains values, a key, and an operator that relates the key and values.
-
key (string), required
The label key that the selector applies to.
-
operator (string), required
Represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
Possible enum values:
"DoesNotExist"
"Exists"
"Gt"
"In"
"Lt"
"NotIn"
-
values ([]string)
An array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. If the operator is Gt or Lt, the values array must have a single element, which will be interpreted as an integer. This array is replaced during a strategic merge patch.
6.5.9.6 - ObjectFieldSelector
import "k8s.io/api/core/v1"
ObjectFieldSelector selects an APIVersioned field of an object.
-
fieldPath (string), required
Path of the field to select in the specified API version.
-
apiVersion (string)
Version of the schema the FieldPath is written in terms of, defaults to "v1".
6.5.9.7 - ObjectMeta
import "k8s.io/apimachinery/pkg/apis/meta/v1"
ObjectMeta is metadata that all persisted resources must have, which includes all objects users must create.
-
name (string)
Name must be unique within a namespace. Is required when creating resources, although some resources may allow a client to request the generation of an appropriate name automatically. Name is primarily intended for creation idempotence and configuration definition. Cannot be updated. More info: http://kubernetes.io/docs/user-guide/identifiers#names
-
generateName (string)
GenerateName is an optional prefix, used by the server, to generate a unique name ONLY IF the Name field has not been provided. If this field is used, the name returned to the client will be different than the name passed. This value will also be combined with a unique suffix. The provided value has the same validation rules as the Name field, and may be truncated by the length of the suffix required to make the value unique on the server.
If this field is specified and the generated name exists, the server will NOT return a 409 - instead, it will either return 201 Created or 500 with Reason ServerTimeout indicating a unique name could not be found in the time allotted, and the client should retry (optionally after the time indicated in the Retry-After header).
Applied only if Name is not specified. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#idempotency
-
namespace (string)
Namespace defines the space within which each name must be unique. An empty namespace is equivalent to the "default" namespace, but "default" is the canonical representation. Not all objects are required to be scoped to a namespace - the value of this field for those objects will be empty.
Must be a DNS_LABEL. Cannot be updated. More info: http://kubernetes.io/docs/user-guide/namespaces
-
labels (map[string]string)
Map of string keys and values that can be used to organize and categorize (scope and select) objects. May match selectors of replication controllers and services. More info: http://kubernetes.io/docs/user-guide/labels
-
annotations (map[string]string)
Annotations is an unstructured key value map stored with a resource that may be set by external tools to store and retrieve arbitrary metadata. They are not queryable and should be preserved when modifying objects. More info: http://kubernetes.io/docs/user-guide/annotations
System
-
finalizers ([]string)
Must be empty before the object is deleted from the registry. Each entry is an identifier for the responsible component that will remove the entry from the list. If the deletionTimestamp of the object is non-nil, entries in this list can only be removed. Finalizers may be processed and removed in any order. Order is NOT enforced because it introduces significant risk of stuck finalizers. finalizers is a shared field, any actor with permission can reorder it. If the finalizer list is processed in order, then this can lead to a situation in which the component responsible for the first finalizer in the list is waiting for a signal (field value, external system, or other) produced by a component responsible for a finalizer later in the list, resulting in a deadlock. Without enforced ordering finalizers are free to order amongst themselves and are not vulnerable to ordering changes in the list.
-
managedFields ([]ManagedFieldsEntry)
ManagedFields maps workflow-id and version to the set of fields that are managed by that workflow. This is mostly for internal housekeeping, and users typically shouldn't need to set or understand this field. A workflow can be the user's name, a controller's name, or the name of a specific apply path like "ci-cd". The set of fields is always in the version that the workflow used when modifying the object.
ManagedFieldsEntry is a workflow-id, a FieldSet and the group version of the resource that the fieldset applies to.
-
managedFields.apiVersion (string)
APIVersion defines the version of this resource that this field set applies to. The format is "group/version" just like the top-level APIVersion field. It is necessary to track the version of a field set because it cannot be automatically converted.
-
managedFields.fieldsType (string)
FieldsType is the discriminator for the different fields format and version. There is currently only one possible value: "FieldsV1"
-
managedFields.fieldsV1 (FieldsV1)
FieldsV1 holds the first JSON version format as described in the "FieldsV1" type.
*FieldsV1 stores a set of fields in a data structure like a Trie, in JSON format.
Each key is either a '.' representing the field itself, and will always map to an empty set, or a string representing a sub-field or item. The string will follow one of these four formats: 'f:
', where is the name of a field in a struct, or key in a map 'v: ', where is the exact json formatted value of a list item 'i: ', where is position of a item in a list 'k: ', where is a map of a list item's key fields to their unique values If a key maps to an empty Fields value, the field that key represents is part of the set. The exact format is defined in sigs.k8s.io/structured-merge-diff*
-
managedFields.manager (string)
Manager is an identifier of the workflow managing these fields.
-
managedFields.operation (string)
Operation is the type of operation which lead to this ManagedFieldsEntry being created. The only valid values for this field are 'Apply' and 'Update'.
-
managedFields.subresource (string)
Subresource is the name of the subresource used to update that object, or empty string if the object was updated through the main resource. The value of this field is used to distinguish between managers, even if they share the same name. For example, a status update will be distinct from a regular update using the same manager name. Note that the APIVersion field is not related to the Subresource field and it always corresponds to the version of the main resource.
-
managedFields.time (Time)
Time is timestamp of when these fields were set. It should always be empty if Operation is 'Apply'
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
-
ownerReferences ([]OwnerReference)
Patch strategy: merge on key
uid
List of objects depended by this object. If ALL objects in the list have been deleted, this object will be garbage collected. If this object is managed by a controller, then an entry in this list will point to this controller, with the controller field set to true. There cannot be more than one managing controller.
OwnerReference contains enough information to let you identify an owning object. An owning object must be in the same namespace as the dependent, or be cluster-scoped, so there is no namespace field.
-
ownerReferences.apiVersion (string), required
API version of the referent.
-
ownerReferences.kind (string), required
Kind of the referent. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
ownerReferences.name (string), required
Name of the referent. More info: http://kubernetes.io/docs/user-guide/identifiers#names
-
ownerReferences.uid (string), required
UID of the referent. More info: http://kubernetes.io/docs/user-guide/identifiers#uids
-
ownerReferences.blockOwnerDeletion (boolean)
If true, AND if the owner has the "foregroundDeletion" finalizer, then the owner cannot be deleted from the key-value store until this reference is removed. Defaults to false. To set this field, a user needs "delete" permission of the owner, otherwise 422 (Unprocessable Entity) will be returned.
-
ownerReferences.controller (boolean)
If true, this reference points to the managing controller.
-
Read-only
-
creationTimestamp (Time)
CreationTimestamp is a timestamp representing the server time when this object was created. It is not guaranteed to be set in happens-before order across separate operations. Clients may not set this value. It is represented in RFC3339 form and is in UTC.
Populated by the system. Read-only. Null for lists. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
deletionGracePeriodSeconds (int64)
Number of seconds allowed for this object to gracefully terminate before it will be removed from the system. Only set when deletionTimestamp is also set. May only be shortened. Read-only.
-
deletionTimestamp (Time)
DeletionTimestamp is RFC 3339 date and time at which this resource will be deleted. This field is set by the server when a graceful deletion is requested by the user, and is not directly settable by a client. The resource is expected to be deleted (no longer visible from resource lists, and not reachable by name) after the time in this field, once the finalizers list is empty. As long as the finalizers list contains items, deletion is blocked. Once the deletionTimestamp is set, this value may not be unset or be set further into the future, although it may be shortened or the resource may be deleted prior to this time. For example, a user may request that a pod is deleted in 30 seconds. The Kubelet will react by sending a graceful termination signal to the containers in the pod. After that 30 seconds, the Kubelet will send a hard termination signal (SIGKILL) to the container and after cleanup, remove the pod from the API. In the presence of network partitions, this object may still exist after this timestamp, until an administrator or automated process can determine the resource is fully terminated. If not set, graceful deletion of the object has not been requested.
Populated by the system when a graceful deletion is requested. Read-only. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
Time is a wrapper around time.Time which supports correct marshaling to YAML and JSON. Wrappers are provided for many of the factory methods that the time package offers.
-
generation (int64)
A sequence number representing a specific generation of the desired state. Populated by the system. Read-only.
-
resourceVersion (string)
An opaque value that represents the internal version of this object that can be used by clients to determine when objects have changed. May be used for optimistic concurrency, change detection, and the watch operation on a resource or set of resources. Clients must treat these values as opaque and passed unmodified back to the server. They may only be valid for a particular resource or set of resources.
Populated by the system. Read-only. Value must be treated as opaque by clients and . More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
-
selfLink (string)
SelfLink is a URL representing this object. Populated by the system. Read-only.
DEPRECATED Kubernetes will stop propagating this field in 1.20 release and the field is planned to be removed in 1.21 release.
-
uid (string)
UID is the unique in time and space value for this object. It is typically generated by the server on successful creation of a resource and is not allowed to change on PUT operations.
Populated by the system. Read-only. More info: http://kubernetes.io/docs/user-guide/identifiers#uids
Ignored
-
clusterName (string)
The name of the cluster which the object belongs to. This is used to distinguish resources with same name and namespace in different clusters. This field is not set anywhere right now and apiserver is going to ignore it if set in create or update request.
6.5.9.8 - ObjectReference
import "k8s.io/api/core/v1"
ObjectReference contains enough information to let you inspect or modify the referred object.
-
apiVersion (string)
API version of the referent.
-
fieldPath (string)
If referring to a piece of an object instead of an entire object, this string should contain a valid JSON/Go field access statement, such as desiredState.manifest.containers[2]. For example, if the object reference is to a container within a pod, this would take on a value like: "spec.containers{name}" (where "name" refers to the name of the container that triggered the event) or if no container name is specified "spec.containers[2]" (container with index 2 in this pod). This syntax is chosen only to have some well-defined way of referencing a part of an object.
-
kind (string)
Kind of the referent. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
name (string)
Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
-
namespace (string)
Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
-
resourceVersion (string)
Specific resourceVersion to which this reference is made, if any. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
-
uid (string)
UID of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#uids
6.5.9.9 - Patch
import "k8s.io/apimachinery/pkg/apis/meta/v1"
Patch is provided to give a concrete name and type to the Kubernetes PATCH request body.
6.5.9.10 - Quantity
import "k8s.io/apimachinery/pkg/api/resource"
Quantity is a fixed-point representation of a number. It provides convenient marshaling/unmarshaling in JSON and YAML, in addition to String() and AsInt64() accessors.
The serialization format is:
<quantity> ::= <signedNumber><suffix> (Note that <suffix> may be empty, from the "" case in <decimalSI>.) <digit> ::= 0 | 1 | ... | 9 <digits> ::= <digit> | <digit><digits> <number> ::= <digits> | <digits>.<digits> | <digits>. | .<digits> <sign> ::= "+" | "-" <signedNumber> ::= <number> | <sign><number> <suffix> ::= <binarySI> | <decimalExponent> | <decimalSI> <binarySI> ::= Ki | Mi | Gi | Ti | Pi | Ei (International System of units; See: http://physics.nist.gov/cuu/Units/binary.html) <decimalSI> ::= m | "" | k | M | G | T | P | E (Note that 1024 = 1Ki but 1000 = 1k; I didn't choose the capitalization.) <decimalExponent> ::= "e" <signedNumber> | "E" <signedNumber>
No matter which of the three exponent forms is used, no quantity may represent a number greater than 2^63-1 in magnitude, nor may it have more than 3 decimal places. Numbers larger or more precise will be capped or rounded up. (E.g.: 0.1m will rounded up to 1m.) This may be extended in the future if we require larger or smaller quantities.
When a Quantity is parsed from a string, it will remember the type of suffix it had, and will use the same type again when it is serialized.
Before serializing, Quantity will be put in "canonical form". This means that Exponent/suffix will be adjusted up or down (with a corresponding increase or decrease in Mantissa) such that: a. No precision is lost b. No fractional digits will be emitted c. The exponent (or suffix) is as large as possible. The sign will be omitted unless the number is negative.
Examples: 1.5 will be serialized as "1500m" 1.5Gi will be serialized as "1536Mi"
Note that the quantity will NEVER be internally represented by a floating point number. That is the whole point of this exercise.
Non-canonical values will still parse as long as they are well formed, but will be re-emitted in their canonical form. (So always use canonical form, or don't diff.)
This format is intended to make it difficult to use these numbers without writing some sort of special handling code in the hopes that that will cause implementors to also use a fixed point implementation.
6.5.9.11 - ResourceFieldSelector
import "k8s.io/api/core/v1"
ResourceFieldSelector represents container resources (cpu, memory) and their output format
-
resource (string), required
Required: resource to select
-
containerName (string)
Container name: required for volumes, optional for env vars
-
divisor (Quantity)
Specifies the output format of the exposed resources, defaults to "1"
6.5.9.12 - Status
import "k8s.io/apimachinery/pkg/apis/meta/v1"
Status is a return value for calls that don't return other objects.
-
apiVersion (string)
APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
-
code (int32)
Suggested HTTP return code for this status, 0 if not set.
-
details (StatusDetails)
Extended data associated with the reason. Each reason may define its own extended details. This field is optional and the data returned is not guaranteed to conform to any schema except that defined by the reason type.
StatusDetails is a set of additional properties that MAY be set by the server to provide additional information about a response. The Reason field of a Status object defines what attributes will be set. Clients must ignore fields that do not match the defined type of each attribute, and should assume that any attribute may be empty, invalid, or under defined.
-
details.causes ([]StatusCause)
The Causes array includes more details associated with the StatusReason failure. Not all StatusReasons may provide detailed causes.
StatusCause provides more information about an api.Status failure, including cases when multiple errors are encountered.
-
details.causes.field (string)
The field of the resource that has caused this error, as named by its JSON serialization. May include dot and postfix notation for nested attributes. Arrays are zero-indexed. Fields may appear more than once in an array of causes due to fields having multiple errors. Optional.
Examples: "name" - the field "name" on the current resource "items[0].name" - the field "name" on the first array entry in "items"
-
details.causes.message (string)
A human-readable description of the cause of the error. This field may be presented as-is to a reader.
-
details.causes.reason (string)
A machine-readable description of the cause of the error. If this value is empty there is no information available.
-
-
details.group (string)
The group attribute of the resource associated with the status StatusReason.
-
details.kind (string)
The kind attribute of the resource associated with the status StatusReason. On some operations may differ from the requested resource Kind. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
details.name (string)
The name attribute of the resource associated with the status StatusReason (when there is a single name which can be described).
-
details.retryAfterSeconds (int32)
If specified, the time in seconds before the operation should be retried. Some errors may indicate the client must take an alternate action - for those errors this field may indicate how long to wait before taking the alternate action.
-
details.uid (string)
UID of the resource. (when there is a single resource which can be described). More info: http://kubernetes.io/docs/user-guide/identifiers#uids
-
-
kind (string)
Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
message (string)
A human-readable description of the status of this operation.
-
metadata (ListMeta)
Standard list metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
-
reason (string)
A machine-readable description of why this operation is in the "Failure" status. If this value is empty there is no information available. A Reason clarifies an HTTP status code but does not override it.
-
status (string)
Status of the operation. One of: "Success" or "Failure". More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status
6.5.9.13 - TypedLocalObjectReference
import "k8s.io/api/core/v1"
TypedLocalObjectReference contains enough information to let you locate the typed referenced object inside the same namespace.
-
kind (string), required
Kind is the type of resource being referenced
-
name (string), required
Name is the name of resource being referenced
-
apiGroup (string)
APIGroup is the group for the resource being referenced. If APIGroup is not specified, the specified Kind must be in the core API group. For any other third-party types, APIGroup is required.
6.5.10 - Common Parameters
allowWatchBookmarks
allowWatchBookmarks requests watch events with type "BOOKMARK". Servers that do not implement bookmarks may ignore this flag and bookmarks are sent at the server's discretion. Clients should not assume bookmarks are returned at any specific interval, nor may they assume the server will send any BOOKMARK event during a session. If this is not a watch, this field is ignored.
continue
The continue option should be set when retrieving more results from the server. Since this value is server defined, clients may only use the continue value from a previous query result with identical query parameters (except for the value of continue) and the server may reject a continue value it does not recognize. If the specified continue value is no longer valid whether due to expiration (generally five to fifteen minutes) or a configuration change on the server, the server will respond with a 410 ResourceExpired error together with a continue token. If the client needs a consistent list, it must restart their list without the continue field. Otherwise, the client may send another list request with the token received with the 410 error, the server will respond with a list starting from the next key, but from the latest snapshot, which is inconsistent from the previous list results - objects that are created, modified, or deleted after the first list request will be included in the response, as long as their keys are after the "next key".
This field is not supported when watch is true. Clients may start a watch from the last resourceVersion value returned by the server and not miss any modifications.
dryRun
When present, indicates that modifications should not be persisted. An invalid or unrecognized dryRun directive will result in an error response and no further processing of the request. Valid values are: - All: all dry run stages will be processed
fieldManager
fieldManager is a name associated with the actor or entity that is making these changes. The value must be less than or 128 characters long, and only contain printable characters, as defined by https://golang.org/pkg/unicode/#IsPrint.
fieldSelector
A selector to restrict the list of returned objects by their fields. Defaults to everything.
fieldValidation
fieldValidation determines how the server should respond to unknown/duplicate fields in the object in the request. Introduced as alpha in 1.23, older servers or servers with the ServerSideFieldValidation
feature disabled will discard valid values specified in this param and not perform any server side field validation. Valid values are: - Ignore: ignores unknown/duplicate fields. - Warn: responds with a warning for each unknown/duplicate field, but successfully serves the request. - Strict: fails the request on unknown/duplicate fields.
force
Force is going to "force" Apply requests. It means user will re-acquire conflicting fields owned by other people. Force flag must be unset for non-apply patch requests.
gracePeriodSeconds
The duration in seconds before the object should be deleted. Value must be non-negative integer. The value zero indicates delete immediately. If this value is nil, the default grace period for the specified type will be used. Defaults to a per object value if not specified. zero means delete immediately.
labelSelector
A selector to restrict the list of returned objects by their labels. Defaults to everything.
limit
limit is a maximum number of responses to return for a list call. If more items exist, the server will set the continue
field on the list metadata to a value that can be used with the same initial query to retrieve the next set of results. Setting a limit may return fewer than the requested amount of items (up to zero items) in the event all requested objects are filtered out and clients should only use the presence of the continue field to determine whether more results are available. Servers may choose not to support the limit argument and will return all of the available results. If limit is specified and the continue field is empty, clients may assume that no more results are available. This field is not supported if watch is true.
The server guarantees that the objects returned when using continue will be identical to issuing a single list call without a limit - that is, no objects created, modified, or deleted after the first request is issued will be included in any subsequent continued requests. This is sometimes referred to as a consistent snapshot, and ensures that a client that is using limit to receive smaller chunks of a very large result can ensure they see all possible objects. If objects are updated during a chunked list the version of the object that was present at the time the first list result was calculated is returned.
namespace
object name and auth scope, such as for teams and projects
pretty
If 'true', then the output is pretty printed.
propagationPolicy
Whether and how garbage collection will be performed. Either this field or OrphanDependents may be set, but not both. The default policy is decided by the existing finalizer set in the metadata.finalizers and the resource-specific default policy. Acceptable values are: 'Orphan' - orphan the dependents; 'Background' - allow the garbage collector to delete the dependents in the background; 'Foreground' - a cascading policy that deletes all dependents in the foreground.
resourceVersion
resourceVersion sets a constraint on what resource versions a request may be served from. See https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions for details.
Defaults to unset
resourceVersionMatch
resourceVersionMatch determines how resourceVersion is applied to list calls. It is highly recommended that resourceVersionMatch be set for list calls where resourceVersion is set See https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions for details.
Defaults to unset
timeoutSeconds
Timeout for the list/watch call. This limits the duration of the call, regardless of any activity or inactivity.
watch
Watch for changes to the described resources and return them as a stream of add, update, and remove notifications. Specify resourceVersion.
6.6 - Kubernetes Issues and Security
6.6.1 - Kubernetes Issue Tracker
To report a security issue, please follow the Kubernetes security disclosure process.
Work on Kubernetes code and public issues are tracked using GitHub Issues.
Security-related announcements are sent to the kubernetes-security-announce@googlegroups.com mailing list.
6.6.2 - Kubernetes Security and Disclosure Information
This page describes Kubernetes security and disclosure information.
Security Announcements
Join the kubernetes-security-announce group for emails about security and major API announcements.
Report a Vulnerability
We're extremely grateful for security researchers and users that report vulnerabilities to the Kubernetes Open Source Community. All reports are thoroughly investigated by a set of community volunteers.
To make a report, submit your vulnerability to the Kubernetes bug bounty program. This allows triage and handling of the vulnerability with standardized response times.
You can also email the private security@kubernetes.io list with the security details and the details expected for all Kubernetes bug reports.
You may encrypt your email to this list using the GPG keys of the Security Response Committee members. Encryption using GPG is NOT required to make a disclosure.
When Should I Report a Vulnerability?
- You think you discovered a potential security vulnerability in Kubernetes
- You are unsure how a vulnerability affects Kubernetes
- You think you discovered a vulnerability in another project that Kubernetes depends on
- For projects with their own vulnerability reporting and disclosure process, please report it directly there
When Should I NOT Report a Vulnerability?
- You need help tuning Kubernetes components for security
- You need help applying security related updates
- Your issue is not security related
Security Vulnerability Response
Each report is acknowledged and analyzed by Security Response Committee members within 3 working days. This will set off the Security Release Process.
Any vulnerability information shared with Security Response Committee stays within Kubernetes project and will not be disseminated to other projects unless it is necessary to get the issue fixed.
As the security issue moves from triage, to identified fix, to release planning we will keep the reporter updated.
Public Disclosure Timing
A public disclosure date is negotiated by the Kubernetes Security Response Committee and the bug submitter. We prefer to fully disclose the bug as soon as possible once a user mitigation is available. It is reasonable to delay disclosure when the bug or the fix is not yet fully understood, the solution is not well-tested, or for vendor coordination. The timeframe for disclosure is from immediate (especially if it's already publicly known) to a few weeks. For a vulnerability with a straightforward mitigation, we expect report date to disclosure date to be on the order of 7 days. The Kubernetes Security Response Committee holds the final say when setting a disclosure date.
6.7 - Node Reference Information
6.7.1 - Articles on dockershim Removal and on Using CRI-compatible Runtimes
This is a list of articles and other pages that are either about the Kubernetes' deprecation and removal of dockershim, or about using CRI-compatible container runtimes, in connection with that removal.
Kubernetes project
-
Kubernetes blog: Dockershim Removal FAQ (originally published 2022/02/17)
-
Kubernetes blog: Kubernetes is Moving on From Dockershim: Commitments and Next Steps (published 2022/01/07)
-
Kubernetes blog: Dockershim removal is coming. Are you ready? (published 2021/11/12)
-
Kubernetes documentation: Migrating from dockershim
-
Kubernetes documentation: Container runtimes
-
Kubernetes enhancement proposal: KEP-2221: Removing dockershim from kubelet
-
Kubernetes enhancement proposal issue: Removing dockershim from kubelet (k/enhancements#2221)
You can provide feedback via the GitHub issue Dockershim removal feedback & issues.
External sources
-
Amazon Web Services EKS documentation: Dockershim deprecation
-
CNCF conference video: Lessons Learned Migrating Kubernetes from Docker to containerd Runtime (Ana Caylin, at KubeCon Europe 2019)
-
Docker.com blog: What developers need to know about Docker, Docker Engine, and Kubernetes v1.20 (published 2020/12/04)
-
"Google Open Source" channel on YouTube: Learn Kubernetes with Google - Migrating from Dockershim to Containerd
-
Microsoft Apps on Azure blog: Dockershim deprecation and AKS (published 2022/01/21)
-
Mirantis blog: The Future of Dockershim is cri-dockerd (published 2021/04/21)
-
Mirantis: Mirantis/cri-dockerd Git repository (on GitHub)
-
Tripwire: How Dockershim’s Forthcoming Deprecation Affects Your Kubernetes
6.8 - Ports and Protocols
When running Kubernetes in an environment with strict network boundaries, such as on-premises datacenter with physical network firewalls or Virtual Networks in Public Cloud, it is useful to be aware of the ports and protocols used by Kubernetes components
Control plane
Protocol | Direction | Port Range | Purpose | Used By |
---|---|---|---|---|
TCP | Inbound | 6443 | Kubernetes API server | All |
TCP | Inbound | 2379-2380 | etcd server client API | kube-apiserver, etcd |
TCP | Inbound | 10250 | Kubelet API | Self, Control plane |
TCP | Inbound | 10259 | kube-scheduler | Self |
TCP | Inbound | 10257 | kube-controller-manager | Self |
Although etcd ports are included in control plane section, you can also host your own etcd cluster externally or on custom ports.
Worker node(s)
Protocol | Direction | Port Range | Purpose | Used By |
---|---|---|---|---|
TCP | Inbound | 10250 | Kubelet API | Self, Control plane |
TCP | Inbound | 30000-32767 | NodePort Services† | All |
† Default port range for NodePort Services.
All default port numbers can be overridden. When custom ports are used those ports need to be open instead of defaults mentioned here.
One common example is API server port that is sometimes switched to 443. Alternatively, the default port is kept as is and API server is put behind a load balancer that listens on 443 and routes the requests to API server on the default port.
6.9 - Setup tools
6.9.1 - Kubeadm
Kubeadm is a tool built to provide kubeadm init
and kubeadm join
as best-practice "fast paths" for creating Kubernetes clusters.
kubeadm performs the actions necessary to get a minimum viable cluster up and running. By design, it cares only about bootstrapping, not about provisioning machines. Likewise, installing various nice-to-have addons, like the Kubernetes Dashboard, monitoring solutions, and cloud-specific addons, is not in scope.
Instead, we expect higher-level and more tailored tooling to be built on top of kubeadm, and ideally, using kubeadm as the basis of all deployments will make it easier to create conformant clusters.
How to install
To install kubeadm, see the installation guide.
What's next
- kubeadm init to bootstrap a Kubernetes control-plane node
- kubeadm join to bootstrap a Kubernetes worker node and join it to the cluster
- kubeadm upgrade to upgrade a Kubernetes cluster to a newer version
- kubeadm config if you initialized your cluster using kubeadm v1.7.x or lower, to configure your cluster for
kubeadm upgrade
- kubeadm token to manage tokens for
kubeadm join
- kubeadm reset to revert any changes made to this host by
kubeadm init
orkubeadm join
- kubeadm certs to manage Kubernetes certificates
- kubeadm kubeconfig to manage kubeconfig files
- kubeadm version to print the kubeadm version
- kubeadm alpha to preview a set of features made available for gathering feedback from the community
6.9.1.1 - Kubeadm Generated
6.9.1.1.1 -
kubeadm: easily bootstrap a secure Kubernetes cluster
Synopsis
┌──────────────────────────────────────────────────────────┐
│ KUBEADM │
│ Easily bootstrap a secure Kubernetes cluster │
│ │
│ Please give us feedback at: │
│ https://github.com/kubernetes/kubeadm/issues │
└──────────────────────────────────────────────────────────┘
Example usage:
Create a two-machine cluster with one control-plane node
(which controls the cluster), and one worker node
(where your workloads, like Pods and Deployments run).
┌──────────────────────────────────────────────────────────┐
│ On the first machine: │
├──────────────────────────────────────────────────────────┤
│ control-plane# kubeadm init │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ On the second machine: │
├──────────────────────────────────────────────────────────┤
│ worker# kubeadm join <arguments-returned-from-init> │
└──────────────────────────────────────────────────────────┘
You can then repeat the second step on as many other machines as you like.
Options
-h, --help | |
help for kubeadm |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.2 -
Commands related to handling kubernetes certificates
Synopsis
Commands related to handling kubernetes certificates
Options
-h, --help | |
help for certs |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.3 -
Generate certificate keys
Synopsis
This command will print out a secure randomly-generated certificate key that can be used with the "init" command.
You can also use "kubeadm init --upload-certs" without specifying a certificate key and it will generate and print one for you.
kubeadm certs certificate-key [flags]
Options
-h, --help | |
help for certificate-key |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.4 -
Check certificates expiration for a Kubernetes cluster
Synopsis
Checks expiration for the certificates in the local PKI managed by kubeadm.
kubeadm certs check-expiration [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for check-expiration |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.5 -
Generate keys and certificate signing requests
Synopsis
Generates keys and certificate signing requests (CSRs) for all the certificates required to run the control plane. This command also generates partial kubeconfig files with private key data in the "users > user > client-key-data" field, and for each kubeconfig file an accompanying ".csr" file is created.
This command is designed for use in Kubeadm External CA Mode. It generates CSRs which you can then submit to your external certificate authority for signing.
The PEM encoded signed certificates should then be saved alongside the key files, using ".crt" as the file extension, or in the case of kubeconfig files, the PEM encoded signed certificate should be base64 encoded and added to the kubeconfig file in the "users > user > client-certificate-data" field.
kubeadm certs generate-csr [flags]
Examples
# The following command will generate keys and CSRs for all control-plane certificates and kubeconfig files:
kubeadm certs generate-csr --kubeconfig-dir /tmp/etc-k8s --cert-dir /tmp/etc-k8s/pki
Options
--cert-dir string | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for generate-csr |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.6 -
Renew certificates for a Kubernetes cluster
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm certs renew [flags]
Options
-h, --help | |
help for renew |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.7 -
Renew the certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself
Synopsis
Renew the certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew admin.conf [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for admin.conf |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.8 -
Renew all available certificates
Synopsis
Renew all known certificates necessary to run the control plane. Renewals are run unconditionally, regardless of expiration date. Renewals can also be run individually for more control.
kubeadm certs renew all [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for all |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.9 -
Renew the certificate the apiserver uses to access etcd
Synopsis
Renew the certificate the apiserver uses to access etcd.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew apiserver-etcd-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver-etcd-client |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.10 -
Renew the certificate for the API server to connect to kubelet
Synopsis
Renew the certificate for the API server to connect to kubelet.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew apiserver-kubelet-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver-kubelet-client |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.11 -
Renew the certificate for serving the Kubernetes API
Synopsis
Renew the certificate for serving the Kubernetes API.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew apiserver [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.12 -
Renew the certificate embedded in the kubeconfig file for the controller manager to use
Synopsis
Renew the certificate embedded in the kubeconfig file for the controller manager to use.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew controller-manager.conf [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for controller-manager.conf |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.13 -
Renew the certificate for liveness probes to healthcheck etcd
Synopsis
Renew the certificate for liveness probes to healthcheck etcd.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew etcd-healthcheck-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-healthcheck-client |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.14 -
Renew the certificate for etcd nodes to communicate with each other
Synopsis
Renew the certificate for etcd nodes to communicate with each other.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew etcd-peer [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-peer |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.15 -
Renew the certificate for serving etcd
Synopsis
Renew the certificate for serving etcd.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew etcd-server [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-server |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.16 -
Renew the certificate for the front proxy client
Synopsis
Renew the certificate for the front proxy client.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew front-proxy-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for front-proxy-client |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.17 -
Renew the certificate embedded in the kubeconfig file for the scheduler manager to use
Synopsis
Renew the certificate embedded in the kubeconfig file for the scheduler manager to use.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew scheduler.conf [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for scheduler.conf |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.18 -
Output shell completion code for the specified shell (bash or zsh)
Synopsis
Output shell completion code for the specified shell (bash or zsh). The shell code must be evaluated to provide interactive completion of kubeadm commands. This can be done by sourcing it from the .bash_profile.
Note: this requires the bash-completion framework.
To install it on Mac use homebrew: $ brew install bash-completion Once installed, bash_completion must be evaluated. This can be done by adding the following line to the .bash_profile $ source $(brew --prefix)/etc/bash_completion
If bash-completion is not installed on Linux, please install the 'bash-completion' package via your distribution's package manager.
Note for zsh users: [1] zsh completions are only supported in versions of zsh >= 5.2
kubeadm completion SHELL [flags]
Examples
# Install bash completion on a Mac using homebrew
brew install bash-completion
printf "\n# Bash completion support\nsource $(brew --prefix)/etc/bash_completion\n" >> $HOME/.bash_profile
source $HOME/.bash_profile
# Load the kubeadm completion code for bash into the current shell
source <(kubeadm completion bash)
# Write bash completion code to a file and source it from .bash_profile
kubeadm completion bash > ~/.kube/kubeadm_completion.bash.inc
printf "\n# Kubeadm shell completion\nsource '$HOME/.kube/kubeadm_completion.bash.inc'\n" >> $HOME/.bash_profile
source $HOME/.bash_profile
# Load the kubeadm completion code for zsh[1] into the current shell
source <(kubeadm completion zsh)
Options
-h, --help | |
help for completion |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.19 -
Manage configuration for a kubeadm cluster persisted in a ConfigMap in the cluster
Synopsis
There is a ConfigMap in the kube-system namespace called "kubeadm-config" that kubeadm uses to store internal configuration about the cluster. kubeadm CLI v1.8.0+ automatically creates this ConfigMap with the config used with 'kubeadm init', but if you initialized your cluster using kubeadm v1.7.x or lower, you must use the 'config upload' command to create this ConfigMap. This is required so that 'kubeadm upgrade' can configure your upgraded cluster correctly.
kubeadm config [flags]
Options
-h, --help | |
help for config |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.20 -
Interact with container images used by kubeadm
Synopsis
Interact with container images used by kubeadm
kubeadm config images [flags]
Options
-h, --help | |
help for images |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.21 -
Print a list of images kubeadm will use. The configuration file is used in case any images or image repositories are customized
Synopsis
Print a list of images kubeadm will use. The configuration file is used in case any images or image repositories are customized
kubeadm config images list [flags]
Options
--allow-missing-template-keys Default: true | |
If true, ignore any errors in templates when a field or map key is missing in the template. Only applies to golang and jsonpath output formats. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-o, --experimental-output string Default: "text" | |
Output format. One of: text|json|yaml|go-template|go-template-file|template|templatefile|jsonpath|jsonpath-as-json|jsonpath-file. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for list |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--show-managed-fields | |
If true, keep the managedFields when printing objects in JSON or YAML format. |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.22 -
Pull images used by kubeadm
Synopsis
Pull images used by kubeadm
kubeadm config images pull [flags]
Options
--config string | |
Path to a kubeadm configuration file. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for pull |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.23 -
Read an older version of the kubeadm configuration API types from a file, and output the similar config object for the newer version
Synopsis
This command lets you convert configuration objects of older versions to the latest supported version, locally in the CLI tool without ever touching anything in the cluster. In this version of kubeadm, the following API versions are supported:
- kubeadm.k8s.io/v1beta3
Further, kubeadm can only write out config of version "kubeadm.k8s.io/v1beta3", but read both types. So regardless of what version you pass to the --old-config parameter here, the API object will be read, deserialized, defaulted, converted, validated, and re-serialized when written to stdout or --new-config if specified.
In other words, the output of this command is what kubeadm actually would read internally if you submitted this file to "kubeadm init"
kubeadm config migrate [flags]
Options
-h, --help | |
help for migrate |
|
--new-config string | |
Path to the resulting equivalent kubeadm config file using the new API version. Optional, if not specified output will be sent to STDOUT. |
|
--old-config string | |
Path to the kubeadm config file that is using an old API version and should be converted. This flag is mandatory. |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.24 -
Print configuration
Synopsis
This command prints configurations for subcommands provided. For details, see: https://pkg.go.dev/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm#section-directories
kubeadm config print [flags]
Options
-h, --help | |
help for print |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.25 -
Print default init configuration, that can be used for 'kubeadm init'
Synopsis
This command prints objects such as the default init configuration that is used for 'kubeadm init'.
Note that sensitive values like the Bootstrap Token fields are replaced with placeholder values like "abcdef.0123456789abcdef" in order to pass validation but not perform the real computation for creating a token.
kubeadm config print init-defaults [flags]
Options
--component-configs strings | |
A comma-separated list for component config API objects to print the default values for. Available values: [KubeProxyConfiguration KubeletConfiguration]. If this flag is not set, no component configs will be printed. |
|
-h, --help | |
help for init-defaults |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.26 -
Print default join configuration, that can be used for 'kubeadm join'
Synopsis
This command prints objects such as the default join configuration that is used for 'kubeadm join'.
Note that sensitive values like the Bootstrap Token fields are replaced with placeholder values like "abcdef.0123456789abcdef" in order to pass validation but not perform the real computation for creating a token.
kubeadm config print join-defaults [flags]
Options
--component-configs strings | |
A comma-separated list for component config API objects to print the default values for. Available values: [KubeProxyConfiguration KubeletConfiguration]. If this flag is not set, no component configs will be printed. |
|
-h, --help | |
help for join-defaults |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.27 -
Run this command in order to set up the Kubernetes control plane
Synopsis
Run this command in order to set up the Kubernetes control plane
The "init" command executes the following phases:
preflight Run pre-flight checks
certs Certificate generation
/ca Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components
/apiserver Generate the certificate for serving the Kubernetes API
/apiserver-kubelet-client Generate the certificate for the API server to connect to kubelet
/front-proxy-ca Generate the self-signed CA to provision identities for front proxy
/front-proxy-client Generate the certificate for the front proxy client
/etcd-ca Generate the self-signed CA to provision identities for etcd
/etcd-server Generate the certificate for serving etcd
/etcd-peer Generate the certificate for etcd nodes to communicate with each other
/etcd-healthcheck-client Generate the certificate for liveness probes to healthcheck etcd
/apiserver-etcd-client Generate the certificate the apiserver uses to access etcd
/sa Generate a private key for signing service account tokens along with its public key
kubeconfig Generate all kubeconfig files necessary to establish the control plane and the admin kubeconfig file
/admin Generate a kubeconfig file for the admin to use and for kubeadm itself
/kubelet Generate a kubeconfig file for the kubelet to use *only* for cluster bootstrapping purposes
/controller-manager Generate a kubeconfig file for the controller manager to use
/scheduler Generate a kubeconfig file for the scheduler to use
kubelet-start Write kubelet settings and (re)start the kubelet
control-plane Generate all static Pod manifest files necessary to establish the control plane
/apiserver Generates the kube-apiserver static Pod manifest
/controller-manager Generates the kube-controller-manager static Pod manifest
/scheduler Generates the kube-scheduler static Pod manifest
etcd Generate static Pod manifest file for local etcd
/local Generate the static Pod manifest file for a local, single-node local etcd instance
upload-config Upload the kubeadm and kubelet configuration to a ConfigMap
/kubeadm Upload the kubeadm ClusterConfiguration to a ConfigMap
/kubelet Upload the kubelet component config to a ConfigMap
upload-certs Upload certificates to kubeadm-certs
mark-control-plane Mark a node as a control-plane
bootstrap-token Generates bootstrap tokens used to join a node to a cluster
kubelet-finalize Updates settings relevant to the kubelet after TLS bootstrap
/experimental-cert-rotation Enable kubelet client certificate rotation
addon Install required addons for passing conformance tests
/coredns Install the CoreDNS addon to a Kubernetes cluster
/kube-proxy Install the kube-proxy addon to a Kubernetes cluster
kubeadm init [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--apiserver-cert-extra-sans strings | |
Optional extra Subject Alternative Names (SANs) to use for the API Server serving certificate. Can be both IP addresses and DNS names. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--certificate-key string | |
Key used to encrypt the control-plane certificates in the kubeadm-certs Secret. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for init |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
|
--skip-certificate-key-print | |
Don't print the key used to encrypt the control-plane certificates. |
|
--skip-phases strings | |
List of phases to be skipped |
|
--skip-token-print | |
Skip printing of the default bootstrap token generated by 'kubeadm init'. |
|
--token string | |
The token to use for establishing bidirectional trust between nodes and control-plane nodes. The format is [a-z0-9]{6}.[a-z0-9]{16} - e.g. abcdef.0123456789abcdef |
|
--token-ttl duration Default: 24h0m0s | |
The duration before the token is automatically deleted (e.g. 1s, 2m, 3h). If set to '0', the token will never expire |
|
--upload-certs | |
Upload control-plane certificates to the kubeadm-certs Secret. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.28 -
Use this command to invoke single phase of the init workflow
Synopsis
Use this command to invoke single phase of the init workflow
Options
-h, --help | |
help for phase |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.29 -
Install required addons for passing conformance tests
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase addon [flags]
Options
-h, --help | |
help for addon |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.30 -
Install all the addons
Synopsis
Install all the addons
kubeadm init phase addon all [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for all |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.31 -
Install the CoreDNS addon to a Kubernetes cluster
Synopsis
Install the CoreDNS addon components via the API server. Please note that although the DNS server is deployed, it will not be scheduled until CNI is installed.
kubeadm init phase addon coredns [flags]
Options
--config string | |
Path to a kubeadm configuration file. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for coredns |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.32 -
Install the kube-proxy addon to a Kubernetes cluster
Synopsis
Install the kube-proxy addon components via the API server.
kubeadm init phase addon kube-proxy [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for kube-proxy |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.33 -
Generates bootstrap tokens used to join a node to a cluster
Synopsis
Bootstrap tokens are used for establishing bidirectional trust between a node joining the cluster and a control-plane node.
This command makes all the configurations required to make bootstrap tokens works and then creates an initial token.
kubeadm init phase bootstrap-token [flags]
Examples
# Make all the bootstrap token configurations and create an initial token, functionally
# equivalent to what generated by kubeadm init.
kubeadm init phase bootstrap-token
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for bootstrap-token |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--skip-token-print | |
Skip printing of the default bootstrap token generated by 'kubeadm init'. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.34 -
Certificate generation
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase certs [flags]
Options
-h, --help | |
help for certs |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.35 -
Generate all certificates
Synopsis
Generate all certificates
kubeadm init phase certs all [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-cert-extra-sans strings | |
Optional extra Subject Alternative Names (SANs) to use for the API Server serving certificate. Can be both IP addresses and DNS names. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for all |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.36 -
Generate the certificate the apiserver uses to access etcd
Synopsis
Generate the certificate the apiserver uses to access etcd, and save them into apiserver-etcd-client.crt and apiserver-etcd-client.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs apiserver-etcd-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver-etcd-client |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.37 -
Generate the certificate for the API server to connect to kubelet
Synopsis
Generate the certificate for the API server to connect to kubelet, and save them into apiserver-kubelet-client.crt and apiserver-kubelet-client.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs apiserver-kubelet-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver-kubelet-client |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.38 -
Generate the certificate for serving the Kubernetes API
Synopsis
Generate the certificate for serving the Kubernetes API, and save them into apiserver.crt and apiserver.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs apiserver [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-cert-extra-sans strings | |
Optional extra Subject Alternative Names (SANs) to use for the API Server serving certificate. Can be both IP addresses and DNS names. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for apiserver |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.39 -
Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components
Synopsis
Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components, and save them into ca.crt and ca.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs ca [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for ca |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.40 -
Generate the self-signed CA to provision identities for etcd
Synopsis
Generate the self-signed CA to provision identities for etcd, and save them into etcd/ca.crt and etcd/ca.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs etcd-ca [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-ca |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.41 -
Generate the certificate for liveness probes to healthcheck etcd
Synopsis
Generate the certificate for liveness probes to healthcheck etcd, and save them into etcd/healthcheck-client.crt and etcd/healthcheck-client.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs etcd-healthcheck-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-healthcheck-client |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.42 -
Generate the certificate for etcd nodes to communicate with each other
Synopsis
Generate the certificate for etcd nodes to communicate with each other, and save them into etcd/peer.crt and etcd/peer.key files.
Default SANs are localhost, 127.0.0.1, 127.0.0.1, ::1
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs etcd-peer [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-peer |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.43 -
Generate the certificate for serving etcd
Synopsis
Generate the certificate for serving etcd, and save them into etcd/server.crt and etcd/server.key files.
Default SANs are localhost, 127.0.0.1, 127.0.0.1, ::1
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs etcd-server [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-server |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.44 -
Generate the self-signed CA to provision identities for front proxy
Synopsis
Generate the self-signed CA to provision identities for front proxy, and save them into front-proxy-ca.crt and front-proxy-ca.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs front-proxy-ca [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for front-proxy-ca |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.45 -
Generate the certificate for the front proxy client
Synopsis
Generate the certificate for the front proxy client, and save them into front-proxy-client.crt and front-proxy-client.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs front-proxy-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for front-proxy-client |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.46 -
Generate a private key for signing service account tokens along with its public key
Synopsis
Generate the private key for signing service account tokens along with its public key, and save them into sa.key and sa.pub files. If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs sa [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
-h, --help | |
help for sa |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.47 -
Generate all static Pod manifest files necessary to establish the control plane
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase control-plane [flags]
Options
-h, --help | |
help for control-plane |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.48 -
Generate all static Pod manifest files
Synopsis
Generate all static Pod manifest files
kubeadm init phase control-plane all [flags]
Examples
# Generates all static Pod manifest files for control plane components,
# functionally equivalent to what is generated by kubeadm init.
kubeadm init phase control-plane all
# Generates all static Pod manifest files using options read from a configuration file.
kubeadm init phase control-plane all --config config.yaml
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--apiserver-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the API Server or override default ones in form of <flagname>=<value> |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
--controller-manager-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the Controller Manager or override default ones in form of <flagname>=<value> |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for all |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
|
--scheduler-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the Scheduler or override default ones in form of <flagname>=<value> |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.49 -
Generates the kube-apiserver static Pod manifest
Synopsis
Generates the kube-apiserver static Pod manifest
kubeadm init phase control-plane apiserver [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--apiserver-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the API Server or override default ones in form of <flagname>=<value> |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for apiserver |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.50 -
Generates the kube-controller-manager static Pod manifest
Synopsis
Generates the kube-controller-manager static Pod manifest
kubeadm init phase control-plane controller-manager [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--controller-manager-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the Controller Manager or override default ones in form of <flagname>=<value> |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
-h, --help | |
help for controller-manager |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.51 -
Generates the kube-scheduler static Pod manifest
Synopsis
Generates the kube-scheduler static Pod manifest
kubeadm init phase control-plane scheduler [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
-h, --help | |
help for scheduler |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--scheduler-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the Scheduler or override default ones in form of <flagname>=<value> |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.52 -
Generate static Pod manifest file for local etcd
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase etcd [flags]
Options
-h, --help | |
help for etcd |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.53 -
Generate the static Pod manifest file for a local, single-node local etcd instance
Synopsis
Generate the static Pod manifest file for a local, single-node local etcd instance
kubeadm init phase etcd local [flags]
Examples
# Generates the static Pod manifest file for etcd, functionally
# equivalent to what is generated by kubeadm init.
kubeadm init phase etcd local
# Generates the static Pod manifest file for etcd using options
# read from a configuration file.
kubeadm init phase etcd local --config config.yaml
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for local |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.54 -
Generate all kubeconfig files necessary to establish the control plane and the admin kubeconfig file
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase kubeconfig [flags]
Options
-h, --help | |
help for kubeconfig |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.55 -
Generate a kubeconfig file for the admin to use and for kubeadm itself
Synopsis
Generate the kubeconfig file for the admin and for kubeadm itself, and save it to admin.conf file.
kubeadm init phase kubeconfig admin [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for admin |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.56 -
Generate all kubeconfig files
Synopsis
Generate all kubeconfig files
kubeadm init phase kubeconfig all [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for all |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.57 -
Generate a kubeconfig file for the controller manager to use
Synopsis
Generate the kubeconfig file for the controller manager to use and save it to controller-manager.conf file
kubeadm init phase kubeconfig controller-manager [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for controller-manager |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.58 -
Generate a kubeconfig file for the kubelet to use only for cluster bootstrapping purposes
Synopsis
Generate the kubeconfig file for the kubelet to use and save it to kubelet.conf file.
Please note that this should only be used for cluster bootstrapping purposes. After your control plane is up, you should request all kubelet credentials from the CSR API.
kubeadm init phase kubeconfig kubelet [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for kubelet |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.59 -
Generate a kubeconfig file for the scheduler to use
Synopsis
Generate the kubeconfig file for the scheduler to use and save it to scheduler.conf file.
kubeadm init phase kubeconfig scheduler [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for scheduler |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.60 -
Updates settings relevant to the kubelet after TLS bootstrap
Synopsis
Updates settings relevant to the kubelet after TLS bootstrap
kubeadm init phase kubelet-finalize [flags]
Examples
# Updates settings relevant to the kubelet after TLS bootstrap"
kubeadm init phase kubelet-finalize all --config
Options
-h, --help | |
help for kubelet-finalize |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.61 -
Run all kubelet-finalize phases
Synopsis
Run all kubelet-finalize phases
kubeadm init phase kubelet-finalize all [flags]
Examples
# Updates settings relevant to the kubelet after TLS bootstrap"
kubeadm init phase kubelet-finalize all --config
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for all |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.62 -
Enable kubelet client certificate rotation
Synopsis
Enable kubelet client certificate rotation
kubeadm init phase kubelet-finalize experimental-cert-rotation [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for experimental-cert-rotation |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.63 -
Write kubelet settings and (re)start the kubelet
Synopsis
Write a file with KubeletConfiguration and an environment file with node specific kubelet settings, and then (re)start kubelet.
kubeadm init phase kubelet-start [flags]
Examples
# Writes a dynamic environment file with kubelet flags from a InitConfiguration file.
kubeadm init phase kubelet-start --config config.yaml
Options
--config string | |
Path to a kubeadm configuration file. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
-h, --help | |
help for kubelet-start |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.64 -
Mark a node as a control-plane
Synopsis
Mark a node as a control-plane
kubeadm init phase mark-control-plane [flags]
Examples
# Applies control-plane label and taint to the current node, functionally equivalent to what executed by kubeadm init.
kubeadm init phase mark-control-plane --config config.yaml
# Applies control-plane label and taint to a specific node
kubeadm init phase mark-control-plane --node-name myNode
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for mark-control-plane |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.65 -
Run pre-flight checks
Synopsis
Run pre-flight checks for kubeadm init.
kubeadm init phase preflight [flags]
Examples
# Run pre-flight checks for kubeadm init using a config file.
kubeadm init phase preflight --config kubeadm-config.yaml
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for preflight |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.66 -
Upload certificates to kubeadm-certs
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase upload-certs [flags]
Options
--certificate-key string | |
Key used to encrypt the control-plane certificates in the kubeadm-certs Secret. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for upload-certs |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--skip-certificate-key-print | |
Don't print the key used to encrypt the control-plane certificates. |
|
--upload-certs | |
Upload control-plane certificates to the kubeadm-certs Secret. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.67 -
Upload the kubeadm and kubelet configuration to a ConfigMap
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase upload-config [flags]
Options
-h, --help | |
help for upload-config |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.68 -
Upload all configuration to a config map
Synopsis
Upload all configuration to a config map
kubeadm init phase upload-config all [flags]
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for all |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.69 -
Upload the kubeadm ClusterConfiguration to a ConfigMap
Synopsis
Upload the kubeadm ClusterConfiguration to a ConfigMap called kubeadm-config in the kube-system namespace. This enables correct configuration of system components and a seamless user experience when upgrading.
Alternatively, you can use kubeadm config.
kubeadm init phase upload-config kubeadm [flags]
Examples
# upload the configuration of your cluster
kubeadm init phase upload-config --config=myConfig.yaml
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for kubeadm |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.70 -
Upload the kubelet component config to a ConfigMap
Synopsis
Upload kubelet configuration extracted from the kubeadm InitConfiguration object to a ConfigMap of the form kubelet-config-1.X in the cluster, where X is the minor version of the current (API Server) Kubernetes version.
kubeadm init phase upload-config kubelet [flags]
Examples
# Upload the kubelet configuration from the kubeadm Config file to a ConfigMap in the cluster.
kubeadm init phase upload-config kubelet --config kubeadm.yaml
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for kubelet |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.71 -
Run this on any machine you wish to join an existing cluster
Synopsis
When joining a kubeadm initialized cluster, we need to establish bidirectional trust. This is split into discovery (having the Node trust the Kubernetes Control Plane) and TLS bootstrap (having the Kubernetes Control Plane trust the Node).
There are 2 main schemes for discovery. The first is to use a shared token along with the IP address of the API server. The second is to provide a file - a subset of the standard kubeconfig file. This file can be a local file or downloaded via an HTTPS URL. The forms are kubeadm join --discovery-token abcdef.1234567890abcdef 1.2.3.4:6443, kubeadm join --discovery-file path/to/file.conf, or kubeadm join --discovery-file https://url/file.conf. Only one form can be used. If the discovery information is loaded from a URL, HTTPS must be used. Also, in that case the host installed CA bundle is used to verify the connection.
If you use a shared token for discovery, you should also pass the --discovery-token-ca-cert-hash flag to validate the public key of the root certificate authority (CA) presented by the Kubernetes Control Plane. The value of this flag is specified as "<hash-type>:<hex-encoded-value>", where the supported hash type is "sha256". The hash is calculated over the bytes of the Subject Public Key Info (SPKI) object (as in RFC7469). This value is available in the output of "kubeadm init" or can be calculated using standard tools. The --discovery-token-ca-cert-hash flag may be repeated multiple times to allow more than one public key.
If you cannot know the CA public key hash ahead of time, you can pass the --discovery-token-unsafe-skip-ca-verification flag to disable this verification. This weakens the kubeadm security model since other nodes can potentially impersonate the Kubernetes Control Plane.
The TLS bootstrap mechanism is also driven via a shared token. This is used to temporarily authenticate with the Kubernetes Control Plane to submit a certificate signing request (CSR) for a locally created key pair. By default, kubeadm will set up the Kubernetes Control Plane to automatically approve these signing requests. This token is passed in with the --tls-bootstrap-token abcdef.1234567890abcdef flag.
Often times the same token is used for both parts. In this case, the --token flag can be used instead of specifying each token individually.
The "join [api-server-endpoint]" command executes the following phases:
preflight Run join pre-flight checks
control-plane-prepare Prepare the machine for serving a control plane
/download-certs [EXPERIMENTAL] Download certificates shared among control-plane nodes from the kubeadm-certs Secret
/certs Generate the certificates for the new control plane components
/kubeconfig Generate the kubeconfig for the new control plane components
/control-plane Generate the manifests for the new control plane components
kubelet-start Write kubelet settings, certificates and (re)start the kubelet
control-plane-join Join a machine as a control plane instance
/etcd Add a new local etcd member
/update-status Register the new control-plane node into the ClusterStatus maintained in the kubeadm-config ConfigMap (DEPRECATED)
/mark-control-plane Mark a node as a control-plane
kubeadm join [api-server-endpoint] [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
If the node should host a new control plane instance, the port for the API Server to bind to. |
|
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
-h, --help | |
help for join |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--skip-phases strings | |
List of phases to be skipped |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.72 -
Use this command to invoke single phase of the join workflow
Synopsis
Use this command to invoke single phase of the join workflow
Options
-h, --help | |
help for phase |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.73 -
Join a machine as a control plane instance
Synopsis
Join a machine as a control plane instance
kubeadm join phase control-plane-join [flags]
Examples
# Joins a machine as a control plane instance
kubeadm join phase control-plane-join all
Options
-h, --help | |
help for control-plane-join |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.74 -
Join a machine as a control plane instance
Synopsis
Join a machine as a control plane instance
kubeadm join phase control-plane-join all [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for all |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.75 -
Add a new local etcd member
Synopsis
Add a new local etcd member
kubeadm join phase control-plane-join etcd [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for etcd |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.76 -
Mark a node as a control-plane
Synopsis
Mark a node as a control-plane
kubeadm join phase control-plane-join mark-control-plane [flags]
Options
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for mark-control-plane |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.77 -
Register the new control-plane node into the ClusterStatus maintained in the kubeadm-config ConfigMap (DEPRECATED)
Synopsis
Register the new control-plane node into the ClusterStatus maintained in the kubeadm-config ConfigMap (DEPRECATED)
kubeadm join phase control-plane-join update-status [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for update-status |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.78 -
Prepare the machine for serving a control plane
Synopsis
Prepare the machine for serving a control plane
kubeadm join phase control-plane-prepare [flags]
Examples
# Prepares the machine for serving a control plane
kubeadm join phase control-plane-prepare all
Options
-h, --help | |
help for control-plane-prepare |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.79 -
Prepare the machine for serving a control plane
Synopsis
Prepare the machine for serving a control plane
kubeadm join phase control-plane-prepare all [api-server-endpoint] [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
If the node should host a new control plane instance, the port for the API Server to bind to. |
|
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for all |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.80 -
Generate the certificates for the new control plane components
Synopsis
Generate the certificates for the new control plane components
kubeadm join phase control-plane-prepare certs [api-server-endpoint] [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for certs |
|
--node-name string | |
Specify the node name. |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.81 -
Generate the manifests for the new control plane components
Synopsis
Generate the manifests for the new control plane components
kubeadm join phase control-plane-prepare control-plane [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
If the node should host a new control plane instance, the port for the API Server to bind to. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for control-plane |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.82 -
[EXPERIMENTAL] Download certificates shared among control-plane nodes from the kubeadm-certs Secret
Synopsis
[EXPERIMENTAL] Download certificates shared among control-plane nodes from the kubeadm-certs Secret
kubeadm join phase control-plane-prepare download-certs [api-server-endpoint] [flags]
Options
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for download-certs |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.83 -
Generate the kubeconfig for the new control plane components
Synopsis
Generate the kubeconfig for the new control plane components
kubeadm join phase control-plane-prepare kubeconfig [api-server-endpoint] [flags]
Options
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for kubeconfig |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.84 -
Write kubelet settings, certificates and (re)start the kubelet
Synopsis
Write a file with KubeletConfiguration and an environment file with node specific kubelet settings, and then (re)start kubelet.
kubeadm join phase kubelet-start [api-server-endpoint] [flags]
Options
--config string | |
Path to kubeadm config file. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for kubelet-start |
|
--node-name string | |
Specify the node name. |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.85 -
Run join pre-flight checks
Synopsis
Run pre-flight checks for kubeadm join.
kubeadm join phase preflight [api-server-endpoint] [flags]
Examples
# Run join pre-flight checks using a config file.
kubeadm join phase preflight --config kubeadm-config.yaml
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
If the node should host a new control plane instance, the port for the API Server to bind to. |
|
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for preflight |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--node-name string | |
Specify the node name. |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.86 -
Kubeconfig file utilities
Synopsis
Kubeconfig file utilities.
Options
-h, --help | |
help for kubeconfig |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.87 -
Output a kubeconfig file for an additional user
Synopsis
Output a kubeconfig file for an additional user.
kubeadm kubeconfig user [flags]
Examples
# Output a kubeconfig file for an additional user named foo using a kubeadm config file bar
kubeadm kubeconfig user --client-name=foo --config=bar
Options
--client-name string | |
The name of user. It will be used as the CN if client certificates are created |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for user |
|
--org strings | |
The orgnizations of the client certificate. It will be used as the O if client certificates are created |
|
--token string | |
The token that should be used as the authentication mechanism for this kubeconfig, instead of client certificates |
|
--validity-period duration Default: 8760h0m0s | |
The validity period of the client certificate. It is an offset from the current time. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.88 -
Performs a best effort revert of changes made to this host by 'kubeadm init' or 'kubeadm join'
Synopsis
Performs a best effort revert of changes made to this host by 'kubeadm init' or 'kubeadm join'
The "reset" command executes the following phases:
preflight Run reset pre-flight checks
remove-etcd-member Remove a local etcd member.
cleanup-node Run cleanup node.
kubeadm reset [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path to the directory where the certificates are stored. If specified, clean this directory. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
-f, --force | |
Reset the node without prompting for confirmation. |
|
-h, --help | |
help for reset |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--skip-phases strings | |
List of phases to be skipped |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.89 -
Use this command to invoke single phase of the reset workflow
Synopsis
Use this command to invoke single phase of the reset workflow
Options
-h, --help | |
help for phase |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.90 -
Run cleanup node.
Synopsis
Run cleanup node.
kubeadm reset phase cleanup-node [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path to the directory where the certificates are stored. If specified, clean this directory. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
-h, --help | |
help for cleanup-node |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.91 -
Run reset pre-flight checks
Synopsis
Run pre-flight checks for kubeadm reset.
kubeadm reset phase preflight [flags]
Options
-f, --force | |
Reset the node without prompting for confirmation. |
|
-h, --help | |
help for preflight |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.92 -
Remove a local etcd member.
Synopsis
Remove a local etcd member for a control plane node.
kubeadm reset phase remove-etcd-member [flags]
Options
-h, --help | |
help for remove-etcd-member |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.93 -
Manage bootstrap tokens
Synopsis
This command manages bootstrap tokens. It is optional and needed only for advanced use cases.
In short, bootstrap tokens are used for establishing bidirectional trust between a client and a server. A bootstrap token can be used when a client (for example a node that is about to join the cluster) needs to trust the server it is talking to. Then a bootstrap token with the "signing" usage can be used. bootstrap tokens can also function as a way to allow short-lived authentication to the API Server (the token serves as a way for the API Server to trust the client), for example for doing the TLS Bootstrap.
What is a bootstrap token more exactly?
- It is a Secret in the kube-system namespace of type "bootstrap.kubernetes.io/token".
- A bootstrap token must be of the form "[a-z0-9]{6}.[a-z0-9]{16}". The former part is the public token ID, while the latter is the Token Secret and it must be kept private at all circumstances!
- The name of the Secret must be named "bootstrap-token-(token-id)".
You can read more about bootstrap tokens here: https://kubernetes.io/docs/admin/bootstrap-tokens/
kubeadm token [flags]
Options
--dry-run | |
Whether to enable dry-run mode or not |
|
-h, --help | |
help for token |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.94 -
Create bootstrap tokens on the server
Synopsis
This command will create a bootstrap token for you. You can specify the usages for this token, the "time to live" and an optional human friendly description.
The [token] is the actual token to write. This should be a securely generated random token of the form "[a-z0-9]{6}.[a-z0-9]{16}". If no [token] is given, kubeadm will generate a random token instead.
kubeadm token create [token]
Options
--certificate-key string | |
When used together with '--print-join-command', print the full 'kubeadm join' flag needed to join the cluster as a control-plane. To create a new certificate key you must use 'kubeadm init phase upload-certs --upload-certs'. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--description string | |
A human friendly description of how this token is used. |
|
--groups strings Default: "system:bootstrappers:kubeadm:default-node-token" | |
Extra groups that this token will authenticate as when used for authentication. Must match "\Asystem:bootstrappers:[a-z0-9:-]{0,255}[a-z0-9]\z" |
|
-h, --help | |
help for create |
|
--print-join-command | |
Instead of printing only the token, print the full 'kubeadm join' flag needed to join the cluster using the token. |
|
--ttl duration Default: 24h0m0s | |
The duration before the token is automatically deleted (e.g. 1s, 2m, 3h). If set to '0', the token will never expire |
|
--usages strings Default: "signing,authentication" | |
Describes the ways in which this token can be used. You can pass --usages multiple times or provide a comma separated list of options. Valid options: [signing,authentication] |
Options inherited from parent commands
--dry-run | |
Whether to enable dry-run mode or not |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.95 -
Delete bootstrap tokens on the server
Synopsis
This command will delete a list of bootstrap tokens for you.
The [token-value] is the full Token of the form "[a-z0-9]{6}.[a-z0-9]{16}" or the Token ID of the form "[a-z0-9]{6}" to delete.
kubeadm token delete [token-value] ...
Options
-h, --help | |
help for delete |
Options inherited from parent commands
--dry-run | |
Whether to enable dry-run mode or not |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.96 -
Generate and print a bootstrap token, but do not create it on the server
Synopsis
This command will print out a randomly-generated bootstrap token that can be used with the "init" and "join" commands.
You don't have to use this command in order to generate a token. You can do so yourself as long as it is in the format "[a-z0-9]{6}.[a-z0-9]{16}". This command is provided for convenience to generate tokens in the given format.
You can also use "kubeadm init" without specifying a token and it will generate and print one for you.
kubeadm token generate [flags]
Options
-h, --help | |
help for generate |
Options inherited from parent commands
--dry-run | |
Whether to enable dry-run mode or not |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.97 -
List bootstrap tokens on the server
Synopsis
This command will list all bootstrap tokens for you.
kubeadm token list [flags]
Options
--allow-missing-template-keys Default: true | |
If true, ignore any errors in templates when a field or map key is missing in the template. Only applies to golang and jsonpath output formats. |
|
-o, --experimental-output string Default: "text" | |
Output format. One of: text|json|yaml|go-template|go-template-file|template|templatefile|jsonpath|jsonpath-as-json|jsonpath-file. |
|
-h, --help | |
help for list |
|
--show-managed-fields | |
If true, keep the managedFields when printing objects in JSON or YAML format. |
Options inherited from parent commands
--dry-run | |
Whether to enable dry-run mode or not |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.98 -
Upgrade your cluster smoothly to a newer version with this command
Synopsis
Upgrade your cluster smoothly to a newer version with this command
kubeadm upgrade [flags]
Options
-h, --help | |
help for upgrade |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.99 -
Upgrade your Kubernetes cluster to the specified version
Synopsis
Upgrade your Kubernetes cluster to the specified version
kubeadm upgrade apply [version]
Options
--allow-experimental-upgrades | |
Show unstable versions of Kubernetes as an upgrade alternative and allow upgrading to an alpha/beta/release candidate versions of Kubernetes. |
|
--allow-release-candidate-upgrades | |
Show release candidate versions of Kubernetes as an upgrade alternative and allow upgrading to a release candidate versions of Kubernetes. |
|
--certificate-renewal Default: true | |
Perform the renewal of certificates used by component changed during upgrades. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--dry-run | |
Do not change any state, just output what actions would be performed. |
|
--etcd-upgrade Default: true | |
Perform the upgrade of etcd. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-f, --force | |
Force upgrading although some requirements might not be met. This also implies non-interactive mode. |
|
-h, --help | |
help for apply |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--print-config | |
Specifies whether the configuration file that will be used in the upgrade should be printed or not. |
|
-y, --yes | |
Perform the upgrade and do not prompt for confirmation (non-interactive mode). |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.100 -
Show what differences would be applied to existing static pod manifests. See also: kubeadm upgrade apply --dry-run
Synopsis
Show what differences would be applied to existing static pod manifests. See also: kubeadm upgrade apply --dry-run
kubeadm upgrade diff [version] [flags]
Options
--api-server-manifest string Default: "/etc/kubernetes/manifests/kube-apiserver.yaml" | |
path to API server manifest |
|
--config string | |
Path to a kubeadm configuration file. |
|
-c, --context-lines int Default: 3 | |
How many lines of context in the diff |
|
--controller-manager-manifest string Default: "/etc/kubernetes/manifests/kube-controller-manager.yaml" | |
path to controller manifest |
|
-h, --help | |
help for diff |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--scheduler-manifest string Default: "/etc/kubernetes/manifests/kube-scheduler.yaml" | |
path to scheduler manifest |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.101 -
Upgrade commands for a node in the cluster
Synopsis
Upgrade commands for a node in the cluster
The "node" command executes the following phases:
preflight Run upgrade node pre-flight checks
control-plane Upgrade the control plane instance deployed on this node, if any
kubelet-config Upgrade the kubelet configuration for this node
kubeadm upgrade node [flags]
Options
--certificate-renewal Default: true | |
Perform the renewal of certificates used by component changed during upgrades. |
|
--dry-run | |
Do not change any state, just output the actions that would be performed. |
|
--etcd-upgrade Default: true | |
Perform the upgrade of etcd. |
|
-h, --help | |
help for node |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--skip-phases strings | |
List of phases to be skipped |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.102 -
Use this command to invoke single phase of the node workflow
Synopsis
Use this command to invoke single phase of the node workflow
Options
-h, --help | |
help for phase |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.103 -
Upgrade the control plane instance deployed on this node, if any
Synopsis
Upgrade the control plane instance deployed on this node, if any
kubeadm upgrade node phase control-plane [flags]
Options
--certificate-renewal Default: true | |
Perform the renewal of certificates used by component changed during upgrades. |
|
--dry-run | |
Do not change any state, just output the actions that would be performed. |
|
--etcd-upgrade Default: true | |
Perform the upgrade of etcd. |
|
-h, --help | |
help for control-plane |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.104 -
Upgrade the kubelet configuration for this node
Synopsis
Download the kubelet configuration from a ConfigMap of the form "kubelet-config-1.X" in the cluster, where X is the minor version of the kubelet. kubeadm uses the KuberneteVersion field in the kubeadm-config ConfigMap to determine what the desired kubelet version is.
kubeadm upgrade node phase kubelet-config [flags]
Options
--dry-run | |
Do not change any state, just output the actions that would be performed. |
|
-h, --help | |
help for kubelet-config |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.105 -
Run upgrade node pre-flight checks
Synopsis
Run pre-flight checks for kubeadm upgrade node.
kubeadm upgrade node phase preflight [flags]
Options
-h, --help | |
help for preflight |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.106 -
Check which versions are available to upgrade to and validate whether your current cluster is upgradeable. To skip the internet check, pass in the optional [version] parameter
Synopsis
Check which versions are available to upgrade to and validate whether your current cluster is upgradeable. To skip the internet check, pass in the optional [version] parameter
kubeadm upgrade plan [version] [flags]
Options
--allow-experimental-upgrades | |
Show unstable versions of Kubernetes as an upgrade alternative and allow upgrading to an alpha/beta/release candidate versions of Kubernetes. |
|
--allow-release-candidate-upgrades | |
Show release candidate versions of Kubernetes as an upgrade alternative and allow upgrading to a release candidate versions of Kubernetes. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for plan |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--print-config | |
Specifies whether the configuration file that will be used in the upgrade should be printed or not. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.107 -
Print the version of kubeadm
Synopsis
Print the version of kubeadm
kubeadm version [flags]
Options
-h, --help | |
help for version |
|
-o, --output string | |
Output format; available options are 'yaml', 'json' and 'short' |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.1.108 -
All files in this directory are auto-generated from other repos. Do not edit them manually. You must edit them in their upstream repo.
6.9.1.2 - kubeadm init
This command initializes a Kubernetes control-plane node.
Run this command in order to set up the Kubernetes control plane
Synopsis
Run this command in order to set up the Kubernetes control plane
The "init" command executes the following phases:
preflight Run pre-flight checks
certs Certificate generation
/ca Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components
/apiserver Generate the certificate for serving the Kubernetes API
/apiserver-kubelet-client Generate the certificate for the API server to connect to kubelet
/front-proxy-ca Generate the self-signed CA to provision identities for front proxy
/front-proxy-client Generate the certificate for the front proxy client
/etcd-ca Generate the self-signed CA to provision identities for etcd
/etcd-server Generate the certificate for serving etcd
/etcd-peer Generate the certificate for etcd nodes to communicate with each other
/etcd-healthcheck-client Generate the certificate for liveness probes to healthcheck etcd
/apiserver-etcd-client Generate the certificate the apiserver uses to access etcd
/sa Generate a private key for signing service account tokens along with its public key
kubeconfig Generate all kubeconfig files necessary to establish the control plane and the admin kubeconfig file
/admin Generate a kubeconfig file for the admin to use and for kubeadm itself
/kubelet Generate a kubeconfig file for the kubelet to use *only* for cluster bootstrapping purposes
/controller-manager Generate a kubeconfig file for the controller manager to use
/scheduler Generate a kubeconfig file for the scheduler to use
kubelet-start Write kubelet settings and (re)start the kubelet
control-plane Generate all static Pod manifest files necessary to establish the control plane
/apiserver Generates the kube-apiserver static Pod manifest
/controller-manager Generates the kube-controller-manager static Pod manifest
/scheduler Generates the kube-scheduler static Pod manifest
etcd Generate static Pod manifest file for local etcd
/local Generate the static Pod manifest file for a local, single-node local etcd instance
upload-config Upload the kubeadm and kubelet configuration to a ConfigMap
/kubeadm Upload the kubeadm ClusterConfiguration to a ConfigMap
/kubelet Upload the kubelet component config to a ConfigMap
upload-certs Upload certificates to kubeadm-certs
mark-control-plane Mark a node as a control-plane
bootstrap-token Generates bootstrap tokens used to join a node to a cluster
kubelet-finalize Updates settings relevant to the kubelet after TLS bootstrap
/experimental-cert-rotation Enable kubelet client certificate rotation
addon Install required addons for passing conformance tests
/coredns Install the CoreDNS addon to a Kubernetes cluster
/kube-proxy Install the kube-proxy addon to a Kubernetes cluster
kubeadm init [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--apiserver-cert-extra-sans strings | |
Optional extra Subject Alternative Names (SANs) to use for the API Server serving certificate. Can be both IP addresses and DNS names. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--certificate-key string | |
Key used to encrypt the control-plane certificates in the kubeadm-certs Secret. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for init |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
|
--skip-certificate-key-print | |
Don't print the key used to encrypt the control-plane certificates. |
|
--skip-phases strings | |
List of phases to be skipped |
|
--skip-token-print | |
Skip printing of the default bootstrap token generated by 'kubeadm init'. |
|
--token string | |
The token to use for establishing bidirectional trust between nodes and control-plane nodes. The format is [a-z0-9]{6}.[a-z0-9]{16} - e.g. abcdef.0123456789abcdef |
|
--token-ttl duration Default: 24h0m0s | |
The duration before the token is automatically deleted (e.g. 1s, 2m, 3h). If set to '0', the token will never expire |
|
--upload-certs | |
Upload control-plane certificates to the kubeadm-certs Secret. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Init workflow
kubeadm init
bootstraps a Kubernetes control-plane node by executing the
following steps:
-
Runs a series of pre-flight checks to validate the system state before making changes. Some checks only trigger warnings, others are considered errors and will exit kubeadm until the problem is corrected or the user specifies
--ignore-preflight-errors=<list-of-errors>
. -
Generates a self-signed CA to set up identities for each component in the cluster. The user can provide their own CA cert and/or key by dropping it in the cert directory configured via
--cert-dir
(/etc/kubernetes/pki
by default). The APIServer certs will have additional SAN entries for any--apiserver-cert-extra-sans
arguments, lowercased if necessary. -
Writes kubeconfig files in
/etc/kubernetes/
for the kubelet, the controller-manager and the scheduler to use to connect to the API server, each with its own identity, as well as an additional kubeconfig file for administration namedadmin.conf
. -
Generates static Pod manifests for the API server, controller-manager and scheduler. In case an external etcd is not provided, an additional static Pod manifest is generated for etcd.
Static Pod manifests are written to
/etc/kubernetes/manifests
; the kubelet watches this directory for Pods to create on startup.Once control plane Pods are up and running, the
kubeadm init
sequence can continue. -
Apply labels and taints to the control-plane node so that no additional workloads will run there.
-
Generates the token that additional nodes can use to register themselves with a control-plane in the future. Optionally, the user can provide a token via
--token
, as described in the kubeadm token docs. -
Makes all the necessary configurations for allowing node joining with the Bootstrap Tokens and TLS Bootstrap mechanism:
-
Write a ConfigMap for making available all the information required for joining, and set up related RBAC access rules.
-
Let Bootstrap Tokens access the CSR signing API.
-
Configure auto-approval for new CSR requests.
See kubeadm join for additional info.
-
-
Installs a DNS server (CoreDNS) and the kube-proxy addon components via the API server. In Kubernetes version 1.11 and later CoreDNS is the default DNS server. Please note that although the DNS server is deployed, it will not be scheduled until CNI is installed.
Warning: kube-dns usage with kubeadm is deprecated as of v1.18 and is removed in v1.21.
Using init phases with kubeadm
Kubeadm allows you to create a control-plane node in phases using the kubeadm init phase
command.
To view the ordered list of phases and sub-phases you can call kubeadm init --help
. The list will be located at the top of the help screen and each phase will have a description next to it.
Note that by calling kubeadm init
all of the phases and sub-phases will be executed in this exact order.
Some phases have unique flags, so if you want to have a look at the list of available options add --help
, for example:
sudo kubeadm init phase control-plane controller-manager --help
You can also use --help
to see the list of sub-phases for a certain parent phase:
sudo kubeadm init phase control-plane --help
kubeadm init
also exposes a flag called --skip-phases
that can be used to skip certain phases. The flag accepts a list of phase names and the names can be taken from the above ordered list.
An example:
sudo kubeadm init phase control-plane all --config=configfile.yaml
sudo kubeadm init phase etcd local --config=configfile.yaml
# you can now modify the control plane and etcd manifest files
sudo kubeadm init --skip-phases=control-plane,etcd --config=configfile.yaml
What this example would do is write the manifest files for the control plane and etcd in /etc/kubernetes/manifests
based on the configuration in configfile.yaml
. This allows you to modify the files and then skip these phases using --skip-phases
. By calling the last command you will create a control plane node with the custom manifest files.
Kubernetes v1.22 [beta]
Alternatively, you can use the skipPhases
field under InitConfiguration
.
Using kubeadm init with a configuration file
It's possible to configure kubeadm init
with a configuration file instead of command
line flags, and some more advanced features may only be available as
configuration file options. This file is passed using the --config
flag and it must
contain a ClusterConfiguration
structure and optionally more structures separated by ---\n
Mixing --config
with others flags may not be allowed in some cases.
The default configuration can be printed out using the kubeadm config print command.
If your configuration is not using the latest version it is recommended that you migrate using the kubeadm config migrate command.
For more information on the fields and usage of the configuration you can navigate to our API reference page.
Adding kube-proxy parameters
For information about kube-proxy parameters in the kubeadm configuration see:
For information about enabling IPVS mode with kubeadm see:
Passing custom flags to control plane components
For information about passing flags to control plane components see:
Running kubeadm without an Internet connection
For running kubeadm without an Internet connection you have to pre-pull the required control-plane images.
You can list and pull the images using the kubeadm config images
sub-command:
kubeadm config images list
kubeadm config images pull
You can pass --config
to the above commands with a kubeadm configuration file
to control the kubernetesVersion
and imageRepository
fields.
All default k8s.gcr.io
images that kubeadm requires support multiple architectures.
Using custom images
By default, kubeadm pulls images from k8s.gcr.io
. If the
requested Kubernetes version is a CI label (such as ci/latest
)
gcr.io/k8s-staging-ci-images
is used.
You can override this behavior by using kubeadm with a configuration file. Allowed customization are:
- To provide
kubernetesVersion
which affects the version of the images. - To provide an alternative
imageRepository
to be used instead ofk8s.gcr.io
. - To provide a specific
imageRepository
andimageTag
for etcd or CoreDNS.
Image paths between the default k8s.gcr.io
and a custom repository specified using
imageRepository
may differ for backwards compatibility reasons. For example,
one image might have a subpath at k8s.gcr.io/subpath/image
, but be defaulted
to my.customrepository.io/image
when using a custom repository.
To ensure you push the images to your custom repository in paths that kubeadm can consume, you must:
- Pull images from the defaults paths at
k8s.gcr.io
usingkubeadm config images {list|pull}
. - Push images to the paths from
kubeadm config images list --config=config.yaml
, whereconfig.yaml
contains the customimageRepository
, and/orimageTag
for etcd and CoreDNS. - Pass the same
config.yaml
tokubeadm init
.
Uploading control-plane certificates to the cluster
By adding the flag --upload-certs
to kubeadm init
you can temporary upload
the control-plane certificates to a Secret in the cluster. Please note that this Secret
will expire automatically after 2 hours. The certificates are encrypted using
a 32byte key that can be specified using --certificate-key
. The same key can be used
to download the certificates when additional control-plane nodes are joining, by passing
--control-plane
and --certificate-key
to kubeadm join
.
The following phase command can be used to re-upload the certificates after expiration:
kubeadm init phase upload-certs --upload-certs --certificate-key=SOME_VALUE --config=SOME_YAML_FILE
If the flag --certificate-key
is not passed to kubeadm init
and
kubeadm init phase upload-certs
a new key will be generated automatically.
The following command can be used to generate a new key on demand:
kubeadm certs certificate-key
Certificate management with kubeadm
For detailed information on certificate management with kubeadm see Certificate Management with kubeadm. The document includes information about using external CA, custom certificates and certificate renewal.
Managing the kubeadm drop-in file for the kubelet
The kubeadm
package ships with a configuration file for running the kubelet
by systemd
. Note that the kubeadm CLI never touches this drop-in file. This drop-in file is part of the kubeadm DEB/RPM package.
For further information, see Managing the kubeadm drop-in file for systemd.
Use kubeadm with CRI runtimes
By default kubeadm attempts to detect your container runtime. For more details on this detection, see the kubeadm CRI installation guide.
Setting the node name
By default, kubeadm
assigns a node name based on a machine's host address. You can override this setting with the --node-name
flag.
The flag passes the appropriate --hostname-override
value to the kubelet.
Be aware that overriding the hostname can interfere with cloud providers.
Automating kubeadm
Rather than copying the token you obtained from kubeadm init
to each node, as
in the basic kubeadm tutorial, you can parallelize the
token distribution for easier automation. To implement this automation, you must
know the IP address that the control-plane node will have after it is started,
or use a DNS name or an address of a load balancer.
-
Generate a token. This token must have the form
<6 character string>.<16 character string>
. More formally, it must match the regex:[a-z0-9]{6}\.[a-z0-9]{16}
.kubeadm can generate a token for you:
kubeadm token generate
-
Start both the control-plane node and the worker nodes concurrently with this token. As they come up they should find each other and form the cluster. The same
--token
argument can be used on bothkubeadm init
andkubeadm join
. -
Similar can be done for
--certificate-key
when joining additional control-plane nodes. The key can be generated using:kubeadm certs certificate-key
Once the cluster is up, you can grab the admin credentials from the control-plane node
at /etc/kubernetes/admin.conf
and use that to talk to the cluster.
Note that this style of bootstrap has some relaxed security guarantees because
it does not allow the root CA hash to be validated with
--discovery-token-ca-cert-hash
(since it's not generated when the nodes are
provisioned). For details, see the kubeadm join.
What's next
- kubeadm init phase to understand more about
kubeadm init
phases - kubeadm join to bootstrap a Kubernetes worker node and join it to the cluster
- kubeadm upgrade to upgrade a Kubernetes cluster to a newer version
- kubeadm reset to revert any changes made to this host by
kubeadm init
orkubeadm join
6.9.1.3 - kubeadm join
This command initializes a Kubernetes worker node and joins it to the cluster.
Run this on any machine you wish to join an existing cluster
Synopsis
When joining a kubeadm initialized cluster, we need to establish bidirectional trust. This is split into discovery (having the Node trust the Kubernetes Control Plane) and TLS bootstrap (having the Kubernetes Control Plane trust the Node).
There are 2 main schemes for discovery. The first is to use a shared token along with the IP address of the API server. The second is to provide a file - a subset of the standard kubeconfig file. This file can be a local file or downloaded via an HTTPS URL. The forms are kubeadm join --discovery-token abcdef.1234567890abcdef 1.2.3.4:6443, kubeadm join --discovery-file path/to/file.conf, or kubeadm join --discovery-file https://url/file.conf. Only one form can be used. If the discovery information is loaded from a URL, HTTPS must be used. Also, in that case the host installed CA bundle is used to verify the connection.
If you use a shared token for discovery, you should also pass the --discovery-token-ca-cert-hash flag to validate the public key of the root certificate authority (CA) presented by the Kubernetes Control Plane. The value of this flag is specified as "<hash-type>:<hex-encoded-value>", where the supported hash type is "sha256". The hash is calculated over the bytes of the Subject Public Key Info (SPKI) object (as in RFC7469). This value is available in the output of "kubeadm init" or can be calculated using standard tools. The --discovery-token-ca-cert-hash flag may be repeated multiple times to allow more than one public key.
If you cannot know the CA public key hash ahead of time, you can pass the --discovery-token-unsafe-skip-ca-verification flag to disable this verification. This weakens the kubeadm security model since other nodes can potentially impersonate the Kubernetes Control Plane.
The TLS bootstrap mechanism is also driven via a shared token. This is used to temporarily authenticate with the Kubernetes Control Plane to submit a certificate signing request (CSR) for a locally created key pair. By default, kubeadm will set up the Kubernetes Control Plane to automatically approve these signing requests. This token is passed in with the --tls-bootstrap-token abcdef.1234567890abcdef flag.
Often times the same token is used for both parts. In this case, the --token flag can be used instead of specifying each token individually.
The "join [api-server-endpoint]" command executes the following phases:
preflight Run join pre-flight checks
control-plane-prepare Prepare the machine for serving a control plane
/download-certs [EXPERIMENTAL] Download certificates shared among control-plane nodes from the kubeadm-certs Secret
/certs Generate the certificates for the new control plane components
/kubeconfig Generate the kubeconfig for the new control plane components
/control-plane Generate the manifests for the new control plane components
kubelet-start Write kubelet settings, certificates and (re)start the kubelet
control-plane-join Join a machine as a control plane instance
/etcd Add a new local etcd member
/update-status Register the new control-plane node into the ClusterStatus maintained in the kubeadm-config ConfigMap (DEPRECATED)
/mark-control-plane Mark a node as a control-plane
kubeadm join [api-server-endpoint] [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
If the node should host a new control plane instance, the port for the API Server to bind to. |
|
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
-h, --help | |
help for join |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--skip-phases strings | |
List of phases to be skipped |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
The join workflow
kubeadm join
bootstraps a Kubernetes worker node or a control-plane node and adds it to the cluster.
This action consists of the following steps for worker nodes:
-
kubeadm downloads necessary cluster information from the API server. By default, it uses the bootstrap token and the CA key hash to verify the authenticity of that data. The root CA can also be discovered directly via a file or URL.
-
Once the cluster information is known, kubelet can start the TLS bootstrapping process.
The TLS bootstrap uses the shared token to temporarily authenticate with the Kubernetes API server to submit a certificate signing request (CSR); by default the control plane signs this CSR request automatically.
-
Finally, kubeadm configures the local kubelet to connect to the API server with the definitive identity assigned to the node.
For control-plane nodes additional steps are performed:
-
Downloading certificates shared among control-plane nodes from the cluster (if explicitly requested by the user).
-
Generating control-plane component manifests, certificates and kubeconfig.
-
Adding new local etcd member.
Using join phases with kubeadm
Kubeadm allows you join a node to the cluster in phases using kubeadm join phase
.
To view the ordered list of phases and sub-phases you can call kubeadm join --help
. The list will be located
at the top of the help screen and each phase will have a description next to it.
Note that by calling kubeadm join
all of the phases and sub-phases will be executed in this exact order.
Some phases have unique flags, so if you want to have a look at the list of available options add --help
, for example:
kubeadm join phase kubelet-start --help
Similar to the kubeadm init phase
command, kubeadm join phase
allows you to skip a list of phases using the --skip-phases
flag.
For example:
sudo kubeadm join --skip-phases=preflight --config=config.yaml
Kubernetes v1.22 [beta]
Alternatively, you can use the skipPhases
field in JoinConfiguration
.
Discovering what cluster CA to trust
The kubeadm discovery has several options, each with security tradeoffs. The right method for your environment depends on how you provision nodes and the security expectations you have about your network and node lifecycles.
Token-based discovery with CA pinning
This is the default mode in kubeadm. In this mode, kubeadm downloads the cluster configuration (including root CA) and validates it using the token as well as validating that the root CA public key matches the provided hash and that the API server certificate is valid under the root CA.
The CA key hash has the format sha256:<hex_encoded_hash>
. By default, the hash value is returned in the kubeadm join
command printed at the end of kubeadm init
or in the output of kubeadm token create --print-join-command
. It is in a standard format (see RFC7469) and can also be calculated by 3rd party tools or provisioning systems. For example, using the OpenSSL CLI:
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
Example kubeadm join
commands:
For worker nodes:
kubeadm join --discovery-token abcdef.1234567890abcdef --discovery-token-ca-cert-hash sha256:1234..cdef 1.2.3.4:6443
For control-plane nodes:
kubeadm join --discovery-token abcdef.1234567890abcdef --discovery-token-ca-cert-hash sha256:1234..cdef --control-plane 1.2.3.4:6443
You can also call join
for a control-plane node with --certificate-key
to copy certificates to this node,
if the kubeadm init
command was called with --upload-certs
.
Advantages:
-
Allows bootstrapping nodes to securely discover a root of trust for the control-plane node even if other worker nodes or the network are compromised.
-
Convenient to execute manually since all of the information required fits into a single
kubeadm join
command.
Disadvantages:
- The CA hash is not normally known until the control-plane node has been provisioned, which can make it more difficult to build automated provisioning tools that use kubeadm. By generating your CA in beforehand, you may workaround this limitation.
Token-based discovery without CA pinning
This mode relies only on the symmetric token to sign
(HMAC-SHA256) the discovery information that establishes the root of trust for
the control-plane. To use the mode the joining nodes must skip the hash validation of the
CA public key, using --discovery-token-unsafe-skip-ca-verification
. You should consider
using one of the other modes if possible.
Example kubeadm join
command:
kubeadm join --token abcdef.1234567890abcdef --discovery-token-unsafe-skip-ca-verification 1.2.3.4:6443
Advantages:
-
Still protects against many network-level attacks.
-
The token can be generated ahead of time and shared with the control-plane node and worker nodes, which can then bootstrap in parallel without coordination. This allows it to be used in many provisioning scenarios.
Disadvantages:
- If an attacker is able to steal a bootstrap token via some vulnerability, they can use that token (along with network-level access) to impersonate the control-plane node to other bootstrapping nodes. This may or may not be an appropriate tradeoff in your environment.
File or HTTPS-based discovery
This provides an out-of-band way to establish a root of trust between the control-plane node and bootstrapping nodes. Consider using this mode if you are building automated provisioning using kubeadm. The format of the discovery file is a regular Kubernetes kubeconfig file.
In case the discovery file does not contain credentials, the TLS discovery token will be used.
Example kubeadm join
commands:
-
kubeadm join --discovery-file path/to/file.conf
(local file) -
kubeadm join --discovery-file https://url/file.conf
(remote HTTPS URL)
Advantages:
- Allows bootstrapping nodes to securely discover a root of trust for the control-plane node even if the network or other worker nodes are compromised.
Disadvantages:
- Requires that you have some way to carry the discovery information from the control-plane node to the bootstrapping nodes. If the discovery file contains credentials you must keep it secret and transfer it over a secure channel. This might be possible with your cloud provider or provisioning tool.
Securing your installation even more
The defaults for kubeadm may not work for everyone. This section documents how to tighten up a kubeadm installation at the cost of some usability.
Turning off auto-approval of node client certificates
By default, there is a CSR auto-approver enabled that basically approves any client certificate request for a kubelet when a Bootstrap Token was used when authenticating. If you don't want the cluster to automatically approve kubelet client certs, you can turn it off by executing this command:
kubectl delete clusterrolebinding kubeadm:node-autoapprove-bootstrap
After that, kubeadm join
will block until the admin has manually approved the CSR in flight:
kubectl get csr
The output is similar to this:
NAME AGE REQUESTOR CONDITION
node-csr-c69HXe7aYcqkS1bKmH4faEnHAWxn6i2bHZ2mD04jZyQ 18s system:bootstrap:878f07 Pending
kubectl certificate approve node-csr-c69HXe7aYcqkS1bKmH4faEnHAWxn6i2bHZ2mD04jZyQ
The output is similar to this:
certificatesigningrequest "node-csr-c69HXe7aYcqkS1bKmH4faEnHAWxn6i2bHZ2mD04jZyQ" approved
kubectl get csr
The output is similar to this:
NAME AGE REQUESTOR CONDITION
node-csr-c69HXe7aYcqkS1bKmH4faEnHAWxn6i2bHZ2mD04jZyQ 1m system:bootstrap:878f07 Approved,Issued
This forces the workflow that kubeadm join
will only succeed if kubectl certificate approve
has been run.
Turning off public access to the cluster-info ConfigMap
In order to achieve the joining flow using the token as the only piece of validation information, a
ConfigMap with some data needed for validation of the control-plane node's identity is exposed publicly by
default. While there is no private data in this ConfigMap, some users might wish to turn
it off regardless. Doing so will disable the ability to use the --discovery-token
flag of the
kubeadm join
flow. Here are the steps to do so:
- Fetch the
cluster-info
file from the API Server:
kubectl -n kube-public get cm cluster-info -o yaml | grep "kubeconfig:" -A11 | grep "apiVersion" -A10 | sed "s/ //" | tee cluster-info.yaml
The output is similar to this:
apiVersion: v1
kind: Config
clusters:
- cluster:
certificate-authority-data: <ca-cert>
server: https://<ip>:<port>
name: ""
contexts: []
current-context: ""
preferences: {}
users: []
-
Use the
cluster-info.yaml
file as an argument tokubeadm join --discovery-file
. -
Turn off public access to the
cluster-info
ConfigMap:
kubectl -n kube-public delete rolebinding kubeadm:bootstrap-signer-clusterinfo
These commands should be run after kubeadm init
but before kubeadm join
.
Using kubeadm join with a configuration file
It's possible to configure kubeadm join
with a configuration file instead of command
line flags, and some more advanced features may only be available as
configuration file options. This file is passed using the --config
flag and it must
contain a JoinConfiguration
structure. Mixing --config
with others flags may not be
allowed in some cases.
The default configuration can be printed out using the kubeadm config print command.
If your configuration is not using the latest version it is recommended that you migrate using the kubeadm config migrate command.
For more information on the fields and usage of the configuration you can navigate to our API reference.
What's next
- kubeadm init to bootstrap a Kubernetes control-plane node
- kubeadm token to manage tokens for
kubeadm join
- kubeadm reset to revert any changes made to this host by
kubeadm init
orkubeadm join
6.9.1.4 - kubeadm upgrade
kubeadm upgrade
is a user-friendly command that wraps complex upgrading logic
behind one command, with support for both planning an upgrade and actually performing it.
kubeadm upgrade guidance
The steps for performing an upgrade using kubeadm are outlined in this document. For older versions of kubeadm, please refer to older documentation sets of the Kubernetes website.
You can use kubeadm upgrade diff
to see the changes that would be applied to static pod manifests.
In Kubernetes v1.15.0 and later, kubeadm upgrade apply
and kubeadm upgrade node
will also
automatically renew the kubeadm managed certificates on this node, including those stored in kubeconfig files.
To opt-out, it is possible to pass the flag --certificate-renewal=false
. For more details about certificate
renewal see the certificate management documentation.
kubeadm upgrade apply
and kubeadm upgrade plan
have a legacy --config
flag which makes it possible to reconfigure the cluster, while performing planning or upgrade of that particular
control-plane node. Please be aware that the upgrade workflow was not designed for this scenario and there are
reports of unexpected results.
kubeadm upgrade plan
Check which versions are available to upgrade to and validate whether your current cluster is upgradeable. To skip the internet check, pass in the optional [version] parameter
Synopsis
Check which versions are available to upgrade to and validate whether your current cluster is upgradeable. To skip the internet check, pass in the optional [version] parameter
kubeadm upgrade plan [version] [flags]
Options
--allow-experimental-upgrades | |
Show unstable versions of Kubernetes as an upgrade alternative and allow upgrading to an alpha/beta/release candidate versions of Kubernetes. |
|
--allow-release-candidate-upgrades | |
Show release candidate versions of Kubernetes as an upgrade alternative and allow upgrading to a release candidate versions of Kubernetes. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for plan |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--print-config | |
Specifies whether the configuration file that will be used in the upgrade should be printed or not. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm upgrade apply
Upgrade your Kubernetes cluster to the specified version
Synopsis
Upgrade your Kubernetes cluster to the specified version
kubeadm upgrade apply [version]
Options
--allow-experimental-upgrades | |
Show unstable versions of Kubernetes as an upgrade alternative and allow upgrading to an alpha/beta/release candidate versions of Kubernetes. |
|
--allow-release-candidate-upgrades | |
Show release candidate versions of Kubernetes as an upgrade alternative and allow upgrading to a release candidate versions of Kubernetes. |
|
--certificate-renewal Default: true | |
Perform the renewal of certificates used by component changed during upgrades. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--dry-run | |
Do not change any state, just output what actions would be performed. |
|
--etcd-upgrade Default: true | |
Perform the upgrade of etcd. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-f, --force | |
Force upgrading although some requirements might not be met. This also implies non-interactive mode. |
|
-h, --help | |
help for apply |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--print-config | |
Specifies whether the configuration file that will be used in the upgrade should be printed or not. |
|
-y, --yes | |
Perform the upgrade and do not prompt for confirmation (non-interactive mode). |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm upgrade diff
Show what differences would be applied to existing static pod manifests. See also: kubeadm upgrade apply --dry-run
Synopsis
Show what differences would be applied to existing static pod manifests. See also: kubeadm upgrade apply --dry-run
kubeadm upgrade diff [version] [flags]
Options
--api-server-manifest string Default: "/etc/kubernetes/manifests/kube-apiserver.yaml" | |
path to API server manifest |
|
--config string | |
Path to a kubeadm configuration file. |
|
-c, --context-lines int Default: 3 | |
How many lines of context in the diff |
|
--controller-manager-manifest string Default: "/etc/kubernetes/manifests/kube-controller-manager.yaml" | |
path to controller manifest |
|
-h, --help | |
help for diff |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--scheduler-manifest string Default: "/etc/kubernetes/manifests/kube-scheduler.yaml" | |
path to scheduler manifest |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm upgrade node
Upgrade commands for a node in the cluster
Synopsis
Upgrade commands for a node in the cluster
The "node" command executes the following phases:
preflight Run upgrade node pre-flight checks
control-plane Upgrade the control plane instance deployed on this node, if any
kubelet-config Upgrade the kubelet configuration for this node
kubeadm upgrade node [flags]
Options
--certificate-renewal Default: true | |
Perform the renewal of certificates used by component changed during upgrades. |
|
--dry-run | |
Do not change any state, just output the actions that would be performed. |
|
--etcd-upgrade Default: true | |
Perform the upgrade of etcd. |
|
-h, --help | |
help for node |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--skip-phases strings | |
List of phases to be skipped |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
What's next
- kubeadm config if you initialized your cluster using kubeadm v1.7.x or lower, to configure your cluster for
kubeadm upgrade
6.9.1.5 - kubeadm config
During kubeadm init
, kubeadm uploads the ClusterConfiguration
object to your cluster
in a ConfigMap called kubeadm-config
in the kube-system
namespace. This configuration is then read during
kubeadm join
, kubeadm reset
and kubeadm upgrade
.
You can use kubeadm config print
to print the default static configuration that kubeadm
uses for kubeadm init
and kubeadm join
.
For more information on init
and join
navigate to
Using kubeadm init with a configuration file
or Using kubeadm join with a configuration file.
For more information on using the kubeadm configuration API navigate to Customizing components with the kubeadm API.
You can use kubeadm config migrate
to convert your old configuration files that contain a deprecated
API version to a newer, supported API version.
kubeadm config images list
and kubeadm config images pull
can be used to list and pull the images
that kubeadm requires.
kubeadm config print
Print configuration
Synopsis
This command prints configurations for subcommands provided. For details, see: https://pkg.go.dev/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm#section-directories
kubeadm config print [flags]
Options
-h, --help | |
help for print |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm config print init-defaults
Print default init configuration, that can be used for 'kubeadm init'
Synopsis
This command prints objects such as the default init configuration that is used for 'kubeadm init'.
Note that sensitive values like the Bootstrap Token fields are replaced with placeholder values like "abcdef.0123456789abcdef" in order to pass validation but not perform the real computation for creating a token.
kubeadm config print init-defaults [flags]
Options
--component-configs strings | |
A comma-separated list for component config API objects to print the default values for. Available values: [KubeProxyConfiguration KubeletConfiguration]. If this flag is not set, no component configs will be printed. |
|
-h, --help | |
help for init-defaults |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm config print join-defaults
Print default join configuration, that can be used for 'kubeadm join'
Synopsis
This command prints objects such as the default join configuration that is used for 'kubeadm join'.
Note that sensitive values like the Bootstrap Token fields are replaced with placeholder values like "abcdef.0123456789abcdef" in order to pass validation but not perform the real computation for creating a token.
kubeadm config print join-defaults [flags]
Options
--component-configs strings | |
A comma-separated list for component config API objects to print the default values for. Available values: [KubeProxyConfiguration KubeletConfiguration]. If this flag is not set, no component configs will be printed. |
|
-h, --help | |
help for join-defaults |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm config migrate
Read an older version of the kubeadm configuration API types from a file, and output the similar config object for the newer version
Synopsis
This command lets you convert configuration objects of older versions to the latest supported version, locally in the CLI tool without ever touching anything in the cluster. In this version of kubeadm, the following API versions are supported:
- kubeadm.k8s.io/v1beta3
Further, kubeadm can only write out config of version "kubeadm.k8s.io/v1beta3", but read both types. So regardless of what version you pass to the --old-config parameter here, the API object will be read, deserialized, defaulted, converted, validated, and re-serialized when written to stdout or --new-config if specified.
In other words, the output of this command is what kubeadm actually would read internally if you submitted this file to "kubeadm init"
kubeadm config migrate [flags]
Options
-h, --help | |
help for migrate |
|
--new-config string | |
Path to the resulting equivalent kubeadm config file using the new API version. Optional, if not specified output will be sent to STDOUT. |
|
--old-config string | |
Path to the kubeadm config file that is using an old API version and should be converted. This flag is mandatory. |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm config images list
Print a list of images kubeadm will use. The configuration file is used in case any images or image repositories are customized
Synopsis
Print a list of images kubeadm will use. The configuration file is used in case any images or image repositories are customized
kubeadm config images list [flags]
Options
--allow-missing-template-keys Default: true | |
If true, ignore any errors in templates when a field or map key is missing in the template. Only applies to golang and jsonpath output formats. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-o, --experimental-output string Default: "text" | |
Output format. One of: text|json|yaml|go-template|go-template-file|template|templatefile|jsonpath|jsonpath-as-json|jsonpath-file. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for list |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--show-managed-fields | |
If true, keep the managedFields when printing objects in JSON or YAML format. |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm config images pull
Pull images used by kubeadm
Synopsis
Pull images used by kubeadm
kubeadm config images pull [flags]
Options
--config string | |
Path to a kubeadm configuration file. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for pull |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
What's next
- kubeadm upgrade to upgrade a Kubernetes cluster to a newer version
6.9.1.6 - kubeadm reset
Performs a best effort revert of changes made by kubeadm init
or kubeadm join
.
Performs a best effort revert of changes made to this host by 'kubeadm init' or 'kubeadm join'
Synopsis
Performs a best effort revert of changes made to this host by 'kubeadm init' or 'kubeadm join'
The "reset" command executes the following phases:
preflight Run reset pre-flight checks
remove-etcd-member Remove a local etcd member.
cleanup-node Run cleanup node.
kubeadm reset [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path to the directory where the certificates are stored. If specified, clean this directory. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
-f, --force | |
Reset the node without prompting for confirmation. |
|
-h, --help | |
help for reset |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--skip-phases strings | |
List of phases to be skipped |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Reset workflow
kubeadm reset
is responsible for cleaning up a node local file system from files that were created using
the kubeadm init
or kubeadm join
commands. For control-plane nodes reset
also removes the local stacked
etcd member of this node from the etcd cluster.
kubeadm reset phase
can be used to execute the separate phases of the above workflow.
To skip a list of phases you can use the --skip-phases
flag, which works in a similar way to
the kubeadm join
and kubeadm init
phase runners.
External etcd clean up
kubeadm reset
will not delete any etcd data if external etcd is used. This means that if you run kubeadm init
again using the same etcd endpoints, you will see state from previous clusters.
To wipe etcd data it is recommended you use a client like etcdctl, such as:
etcdctl del "" --prefix
See the etcd documentation for more information.
What's next
- kubeadm init to bootstrap a Kubernetes control-plane node
- kubeadm join to bootstrap a Kubernetes worker node and join it to the cluster
6.9.1.7 - kubeadm token
Bootstrap tokens are used for establishing bidirectional trust between a node joining the cluster and a control-plane node, as described in authenticating with bootstrap tokens.
kubeadm init
creates an initial token with a 24-hour TTL. The following commands allow you to manage
such a token and also to create and manage new ones.
kubeadm token create
Create bootstrap tokens on the server
Synopsis
This command will create a bootstrap token for you. You can specify the usages for this token, the "time to live" and an optional human friendly description.
The [token] is the actual token to write. This should be a securely generated random token of the form "[a-z0-9]{6}.[a-z0-9]{16}". If no [token] is given, kubeadm will generate a random token instead.
kubeadm token create [token]
Options
--certificate-key string | |
When used together with '--print-join-command', print the full 'kubeadm join' flag needed to join the cluster as a control-plane. To create a new certificate key you must use 'kubeadm init phase upload-certs --upload-certs'. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--description string | |
A human friendly description of how this token is used. |
|
--groups strings Default: "system:bootstrappers:kubeadm:default-node-token" | |
Extra groups that this token will authenticate as when used for authentication. Must match "\Asystem:bootstrappers:[a-z0-9:-]{0,255}[a-z0-9]\z" |
|
-h, --help | |
help for create |
|
--print-join-command | |
Instead of printing only the token, print the full 'kubeadm join' flag needed to join the cluster using the token. |
|
--ttl duration Default: 24h0m0s | |
The duration before the token is automatically deleted (e.g. 1s, 2m, 3h). If set to '0', the token will never expire |
|
--usages strings Default: "signing,authentication" | |
Describes the ways in which this token can be used. You can pass --usages multiple times or provide a comma separated list of options. Valid options: [signing,authentication] |
Options inherited from parent commands
--dry-run | |
Whether to enable dry-run mode or not |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm token delete
Delete bootstrap tokens on the server
Synopsis
This command will delete a list of bootstrap tokens for you.
The [token-value] is the full Token of the form "[a-z0-9]{6}.[a-z0-9]{16}" or the Token ID of the form "[a-z0-9]{6}" to delete.
kubeadm token delete [token-value] ...
Options
-h, --help | |
help for delete |
Options inherited from parent commands
--dry-run | |
Whether to enable dry-run mode or not |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm token generate
Generate and print a bootstrap token, but do not create it on the server
Synopsis
This command will print out a randomly-generated bootstrap token that can be used with the "init" and "join" commands.
You don't have to use this command in order to generate a token. You can do so yourself as long as it is in the format "[a-z0-9]{6}.[a-z0-9]{16}". This command is provided for convenience to generate tokens in the given format.
You can also use "kubeadm init" without specifying a token and it will generate and print one for you.
kubeadm token generate [flags]
Options
-h, --help | |
help for generate |
Options inherited from parent commands
--dry-run | |
Whether to enable dry-run mode or not |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm token list
List bootstrap tokens on the server
Synopsis
This command will list all bootstrap tokens for you.
kubeadm token list [flags]
Options
--allow-missing-template-keys Default: true | |
If true, ignore any errors in templates when a field or map key is missing in the template. Only applies to golang and jsonpath output formats. |
|
-o, --experimental-output string Default: "text" | |
Output format. One of: text|json|yaml|go-template|go-template-file|template|templatefile|jsonpath|jsonpath-as-json|jsonpath-file. |
|
-h, --help | |
help for list |
|
--show-managed-fields | |
If true, keep the managedFields when printing objects in JSON or YAML format. |
Options inherited from parent commands
--dry-run | |
Whether to enable dry-run mode or not |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
What's next
- kubeadm join to bootstrap a Kubernetes worker node and join it to the cluster
6.9.1.8 - kubeadm version
This command prints the version of kubeadm.
Print the version of kubeadm
Synopsis
Print the version of kubeadm
kubeadm version [flags]
Options
-h, --help | |
help for version |
|
-o, --output string | |
Output format; available options are 'yaml', 'json' and 'short' |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.9 - kubeadm alpha
kubeadm alpha
provides a preview of a set of features made available for gathering feedback
from the community. Please try it out and give us feedback!
Currently there are no experimental commands under kubeadm alpha
.
What's next
- kubeadm init to bootstrap a Kubernetes control-plane node
- kubeadm join to connect a node to the cluster
- kubeadm reset to revert any changes made to this host by
kubeadm init
orkubeadm join
6.9.1.10 - kubeadm certs
kubeadm certs
provides utilities for managing certificates.
For more details on how these commands can be used, see
Certificate Management with kubeadm.
kubeadm certs
A collection of operations for operating Kubernetes certificates.
Commands related to handling kubernetes certificates
Synopsis
Commands related to handling kubernetes certificates
Options
-h, --help | |
help for certs |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm certs renew
You can renew all Kubernetes certificates using the all
subcommand or renew them selectively.
For more details see Manual certificate renewal.
Renew certificates for a Kubernetes cluster
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm certs renew [flags]
Options
-h, --help | |
help for renew |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew all available certificates
Synopsis
Renew all known certificates necessary to run the control plane. Renewals are run unconditionally, regardless of expiration date. Renewals can also be run individually for more control.
kubeadm certs renew all [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for all |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself
Synopsis
Renew the certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew admin.conf [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for admin.conf |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate the apiserver uses to access etcd
Synopsis
Renew the certificate the apiserver uses to access etcd.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew apiserver-etcd-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver-etcd-client |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate for the API server to connect to kubelet
Synopsis
Renew the certificate for the API server to connect to kubelet.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew apiserver-kubelet-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver-kubelet-client |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate for serving the Kubernetes API
Synopsis
Renew the certificate for serving the Kubernetes API.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew apiserver [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate embedded in the kubeconfig file for the controller manager to use
Synopsis
Renew the certificate embedded in the kubeconfig file for the controller manager to use.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew controller-manager.conf [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for controller-manager.conf |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate for liveness probes to healthcheck etcd
Synopsis
Renew the certificate for liveness probes to healthcheck etcd.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew etcd-healthcheck-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-healthcheck-client |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate for etcd nodes to communicate with each other
Synopsis
Renew the certificate for etcd nodes to communicate with each other.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew etcd-peer [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-peer |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate for serving etcd
Synopsis
Renew the certificate for serving etcd.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew etcd-server [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-server |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate for the front proxy client
Synopsis
Renew the certificate for the front proxy client.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew front-proxy-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for front-proxy-client |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Renew the certificate embedded in the kubeconfig file for the scheduler manager to use
Synopsis
Renew the certificate embedded in the kubeconfig file for the scheduler manager to use.
Renewals run unconditionally, regardless of certificate expiration date; extra attributes such as SANs will be based on the existing file/certificates, there is no need to resupply them.
Renewal by default tries to use the certificate authority in the local PKI managed by kubeadm; as alternative it is possible to use K8s certificate API for certificate renewal, or as a last option, to generate a CSR request.
After renewal, in order to make changes effective, is required to restart control-plane components and eventually re-distribute the renewed certificate in case the file is used elsewhere.
kubeadm certs renew scheduler.conf [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for scheduler.conf |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm certs certificate-key
This command can be used to generate a new control-plane certificate key.
The key can be passed as --certificate-key
to kubeadm init
and kubeadm join
to enable the automatic copy of certificates when joining additional control-plane nodes.
Generate certificate keys
Synopsis
This command will print out a secure randomly-generated certificate key that can be used with the "init" command.
You can also use "kubeadm init --upload-certs" without specifying a certificate key and it will generate and print one for you.
kubeadm certs certificate-key [flags]
Options
-h, --help | |
help for certificate-key |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm certs check-expiration
This command checks expiration for the certificates in the local PKI managed by kubeadm. For more details see Check certificate expiration.
Check certificates expiration for a Kubernetes cluster
Synopsis
Checks expiration for the certificates in the local PKI managed by kubeadm.
kubeadm certs check-expiration [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for check-expiration |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm certs generate-csr
This command can be used to generate keys and CSRs for all control-plane certificates and kubeconfig files. The user can then sign the CSRs with a CA of their choice.
Generate keys and certificate signing requests
Synopsis
Generates keys and certificate signing requests (CSRs) for all the certificates required to run the control plane. This command also generates partial kubeconfig files with private key data in the "users > user > client-key-data" field, and for each kubeconfig file an accompanying ".csr" file is created.
This command is designed for use in Kubeadm External CA Mode. It generates CSRs which you can then submit to your external certificate authority for signing.
The PEM encoded signed certificates should then be saved alongside the key files, using ".crt" as the file extension, or in the case of kubeconfig files, the PEM encoded signed certificate should be base64 encoded and added to the kubeconfig file in the "users > user > client-certificate-data" field.
kubeadm certs generate-csr [flags]
Examples
# The following command will generate keys and CSRs for all control-plane certificates and kubeconfig files:
kubeadm certs generate-csr --kubeconfig-dir /tmp/etc-k8s --cert-dir /tmp/etc-k8s/pki
Options
--cert-dir string | |
The path where to save the certificates |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for generate-csr |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
What's next
- kubeadm init to bootstrap a Kubernetes control-plane node
- kubeadm join to connect a node to the cluster
- kubeadm reset to revert any changes made to this host by
kubeadm init
orkubeadm join
6.9.1.11 - kubeadm init phase
kubeadm init phase
enables you to invoke atomic steps of the bootstrap process.
Hence, you can let kubeadm do some of the work and you can fill in the gaps
if you wish to apply customization.
kubeadm init phase
is consistent with the kubeadm init workflow,
and behind the scene both use the same code.
kubeadm init phase preflight
Using this command you can execute preflight checks on a control-plane node.
Run pre-flight checks
Synopsis
Run pre-flight checks for kubeadm init.
kubeadm init phase preflight [flags]
Examples
# Run pre-flight checks for kubeadm init using a config file.
kubeadm init phase preflight --config kubeadm-config.yaml
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for preflight |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase kubelet-start
This phase will write the kubelet configuration file and environment file and then start the kubelet.
Write kubelet settings and (re)start the kubelet
Synopsis
Write a file with KubeletConfiguration and an environment file with node specific kubelet settings, and then (re)start kubelet.
kubeadm init phase kubelet-start [flags]
Examples
# Writes a dynamic environment file with kubelet flags from a InitConfiguration file.
kubeadm init phase kubelet-start --config config.yaml
Options
--config string | |
Path to a kubeadm configuration file. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
-h, --help | |
help for kubelet-start |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase certs
Can be used to create all required certificates by kubeadm.
Certificate generation
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase certs [flags]
Options
-h, --help | |
help for certs |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate all certificates
Synopsis
Generate all certificates
kubeadm init phase certs all [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-cert-extra-sans strings | |
Optional extra Subject Alternative Names (SANs) to use for the API Server serving certificate. Can be both IP addresses and DNS names. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for all |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components
Synopsis
Generate the self-signed Kubernetes CA to provision identities for other Kubernetes components, and save them into ca.crt and ca.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs ca [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for ca |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the certificate for serving the Kubernetes API
Synopsis
Generate the certificate for serving the Kubernetes API, and save them into apiserver.crt and apiserver.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs apiserver [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-cert-extra-sans strings | |
Optional extra Subject Alternative Names (SANs) to use for the API Server serving certificate. Can be both IP addresses and DNS names. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for apiserver |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the certificate for the API server to connect to kubelet
Synopsis
Generate the certificate for the API server to connect to kubelet, and save them into apiserver-kubelet-client.crt and apiserver-kubelet-client.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs apiserver-kubelet-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver-kubelet-client |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the self-signed CA to provision identities for front proxy
Synopsis
Generate the self-signed CA to provision identities for front proxy, and save them into front-proxy-ca.crt and front-proxy-ca.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs front-proxy-ca [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for front-proxy-ca |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the certificate for the front proxy client
Synopsis
Generate the certificate for the front proxy client, and save them into front-proxy-client.crt and front-proxy-client.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs front-proxy-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for front-proxy-client |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the self-signed CA to provision identities for etcd
Synopsis
Generate the self-signed CA to provision identities for etcd, and save them into etcd/ca.crt and etcd/ca.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs etcd-ca [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-ca |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the certificate for serving etcd
Synopsis
Generate the certificate for serving etcd, and save them into etcd/server.crt and etcd/server.key files.
Default SANs are localhost, 127.0.0.1, 127.0.0.1, ::1
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs etcd-server [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-server |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the certificate for etcd nodes to communicate with each other
Synopsis
Generate the certificate for etcd nodes to communicate with each other, and save them into etcd/peer.crt and etcd/peer.key files.
Default SANs are localhost, 127.0.0.1, 127.0.0.1, ::1
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs etcd-peer [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-peer |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the certificate for liveness probes to healthcheck etcd
Synopsis
Generate the certificate for liveness probes to healthcheck etcd, and save them into etcd/healthcheck-client.crt and etcd/healthcheck-client.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs etcd-healthcheck-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for etcd-healthcheck-client |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the certificate the apiserver uses to access etcd
Synopsis
Generate the certificate the apiserver uses to access etcd, and save them into apiserver-etcd-client.crt and apiserver-etcd-client.key files.
If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs apiserver-etcd-client [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for apiserver-etcd-client |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate a private key for signing service account tokens along with its public key
Synopsis
Generate the private key for signing service account tokens along with its public key, and save them into sa.key and sa.pub files. If both files already exist, kubeadm skips the generation step and existing files will be used.
Alpha Disclaimer: this command is currently alpha.
kubeadm init phase certs sa [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
-h, --help | |
help for sa |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase kubeconfig
You can create all required kubeconfig files by calling the all
subcommand or call them individually.
Generate all kubeconfig files necessary to establish the control plane and the admin kubeconfig file
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase kubeconfig [flags]
Options
-h, --help | |
help for kubeconfig |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate all kubeconfig files
Synopsis
Generate all kubeconfig files
kubeadm init phase kubeconfig all [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for all |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate a kubeconfig file for the admin to use and for kubeadm itself
Synopsis
Generate the kubeconfig file for the admin and for kubeadm itself, and save it to admin.conf file.
kubeadm init phase kubeconfig admin [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for admin |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate a kubeconfig file for the kubelet to use only for cluster bootstrapping purposes
Synopsis
Generate the kubeconfig file for the kubelet to use and save it to kubelet.conf file.
Please note that this should only be used for cluster bootstrapping purposes. After your control plane is up, you should request all kubelet credentials from the CSR API.
kubeadm init phase kubeconfig kubelet [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for kubelet |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate a kubeconfig file for the controller manager to use
Synopsis
Generate the kubeconfig file for the controller manager to use and save it to controller-manager.conf file
kubeadm init phase kubeconfig controller-manager [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for controller-manager |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate a kubeconfig file for the scheduler to use
Synopsis
Generate the kubeconfig file for the scheduler to use and save it to scheduler.conf file.
kubeadm init phase kubeconfig scheduler [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for scheduler |
|
--kubeconfig-dir string Default: "/etc/kubernetes" | |
The path where to save the kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase control-plane
Using this phase you can create all required static Pod files for the control plane components.
Generate all static Pod manifest files necessary to establish the control plane
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase control-plane [flags]
Options
-h, --help | |
help for control-plane |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate all static Pod manifest files
Synopsis
Generate all static Pod manifest files
kubeadm init phase control-plane all [flags]
Examples
# Generates all static Pod manifest files for control plane components,
# functionally equivalent to what is generated by kubeadm init.
kubeadm init phase control-plane all
# Generates all static Pod manifest files using options read from a configuration file.
kubeadm init phase control-plane all --config config.yaml
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--apiserver-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the API Server or override default ones in form of <flagname>=<value> |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
--controller-manager-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the Controller Manager or override default ones in form of <flagname>=<value> |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for all |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
|
--scheduler-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the Scheduler or override default ones in form of <flagname>=<value> |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generates the kube-apiserver static Pod manifest
Synopsis
Generates the kube-apiserver static Pod manifest
kubeadm init phase control-plane apiserver [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--apiserver-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the API Server or override default ones in form of <flagname>=<value> |
|
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for apiserver |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generates the kube-controller-manager static Pod manifest
Synopsis
Generates the kube-controller-manager static Pod manifest
kubeadm init phase control-plane controller-manager [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--controller-manager-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the Controller Manager or override default ones in form of <flagname>=<value> |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
-h, --help | |
help for controller-manager |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generates the kube-scheduler static Pod manifest
Synopsis
Generates the kube-scheduler static Pod manifest
kubeadm init phase control-plane scheduler [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--dry-run | |
Don't apply any changes; just output what would be done. |
|
-h, --help | |
help for scheduler |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--scheduler-extra-args <comma-separated 'key=value' pairs> | |
A set of extra flags to pass to the Scheduler or override default ones in form of <flagname>=<value> |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase etcd
Use the following phase to create a local etcd instance based on a static Pod file.
Generate static Pod manifest file for local etcd
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase etcd [flags]
Options
-h, --help | |
help for etcd |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the static Pod manifest file for a local, single-node local etcd instance
Synopsis
Generate the static Pod manifest file for a local, single-node local etcd instance
kubeadm init phase etcd local [flags]
Examples
# Generates the static Pod manifest file for etcd, functionally
# equivalent to what is generated by kubeadm init.
kubeadm init phase etcd local
# Generates the static Pod manifest file for etcd using options
# read from a configuration file.
kubeadm init phase etcd local --config config.yaml
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for local |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase upload-config
You can use this command to upload the kubeadm configuration to your cluster. Alternatively, you can use kubeadm config.
Upload the kubeadm and kubelet configuration to a ConfigMap
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase upload-config [flags]
Options
-h, --help | |
help for upload-config |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Upload all configuration to a config map
Synopsis
Upload all configuration to a config map
kubeadm init phase upload-config all [flags]
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for all |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Upload the kubeadm ClusterConfiguration to a ConfigMap
Synopsis
Upload the kubeadm ClusterConfiguration to a ConfigMap called kubeadm-config in the kube-system namespace. This enables correct configuration of system components and a seamless user experience when upgrading.
Alternatively, you can use kubeadm config.
kubeadm init phase upload-config kubeadm [flags]
Examples
# upload the configuration of your cluster
kubeadm init phase upload-config --config=myConfig.yaml
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for kubeadm |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Upload the kubelet component config to a ConfigMap
Synopsis
Upload kubelet configuration extracted from the kubeadm InitConfiguration object to a ConfigMap of the form kubelet-config-1.X in the cluster, where X is the minor version of the current (API Server) Kubernetes version.
kubeadm init phase upload-config kubelet [flags]
Examples
# Upload the kubelet configuration from the kubeadm Config file to a ConfigMap in the cluster.
kubeadm init phase upload-config kubelet --config kubeadm.yaml
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for kubelet |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase upload-certs
Use the following phase to upload control-plane certificates to the cluster. By default the certs and encryption key expire after two hours.
Upload certificates to kubeadm-certs
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase upload-certs [flags]
Options
--certificate-key string | |
Key used to encrypt the control-plane certificates in the kubeadm-certs Secret. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for upload-certs |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--skip-certificate-key-print | |
Don't print the key used to encrypt the control-plane certificates. |
|
--upload-certs | |
Upload control-plane certificates to the kubeadm-certs Secret. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase mark-control-plane
Use the following phase to label and taint the node with the node-role.kubernetes.io/master=""
key-value pair.
Mark a node as a control-plane
Synopsis
Mark a node as a control-plane
kubeadm init phase mark-control-plane [flags]
Examples
# Applies control-plane label and taint to the current node, functionally equivalent to what executed by kubeadm init.
kubeadm init phase mark-control-plane --config config.yaml
# Applies control-plane label and taint to a specific node
kubeadm init phase mark-control-plane --node-name myNode
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for mark-control-plane |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase bootstrap-token
Use the following phase to configure bootstrap tokens.
Generates bootstrap tokens used to join a node to a cluster
Synopsis
Bootstrap tokens are used for establishing bidirectional trust between a node joining the cluster and a control-plane node.
This command makes all the configurations required to make bootstrap tokens works and then creates an initial token.
kubeadm init phase bootstrap-token [flags]
Examples
# Make all the bootstrap token configurations and create an initial token, functionally
# equivalent to what generated by kubeadm init.
kubeadm init phase bootstrap-token
Options
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for bootstrap-token |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--skip-token-print | |
Skip printing of the default bootstrap token generated by 'kubeadm init'. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase kubelet-finalize
Use the following phase to update settings relevant to the kubelet after TLS
bootstrap. You can use the all
subcommand to run all kubelet-finalize
phases.
Updates settings relevant to the kubelet after TLS bootstrap
Synopsis
Updates settings relevant to the kubelet after TLS bootstrap
kubeadm init phase kubelet-finalize [flags]
Examples
# Updates settings relevant to the kubelet after TLS bootstrap"
kubeadm init phase kubelet-finalize all --config
Options
-h, --help | |
help for kubelet-finalize |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Run all kubelet-finalize phases
Synopsis
Run all kubelet-finalize phases
kubeadm init phase kubelet-finalize all [flags]
Examples
# Updates settings relevant to the kubelet after TLS bootstrap"
kubeadm init phase kubelet-finalize all --config
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for all |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Enable kubelet client certificate rotation
Synopsis
Enable kubelet client certificate rotation
kubeadm init phase kubelet-finalize experimental-cert-rotation [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path where to save and store the certificates. |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for experimental-cert-rotation |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm init phase addon
You can install all the available addons with the all
subcommand, or
install them selectively.
Install required addons for passing conformance tests
Synopsis
This command is not meant to be run on its own. See list of available subcommands.
kubeadm init phase addon [flags]
Options
-h, --help | |
help for addon |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Install all the addons
Synopsis
Install all the addons
kubeadm init phase addon all [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for all |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Install the CoreDNS addon to a Kubernetes cluster
Synopsis
Install the CoreDNS addon components via the API server. Please note that although the DNS server is deployed, it will not be scheduled until CNI is installed.
kubeadm init phase addon coredns [flags]
Options
--config string | |
Path to a kubeadm configuration file. |
|
--feature-gates string | |
A set of key=value pairs that describe feature gates for various features. Options are: |
|
-h, --help | |
help for coredns |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--service-cidr string Default: "10.96.0.0/12" | |
Use alternative range of IP address for service VIPs. |
|
--service-dns-domain string Default: "cluster.local" | |
Use alternative domain for services, e.g. "myorg.internal". |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Install the kube-proxy addon to a Kubernetes cluster
Synopsis
Install the kube-proxy addon components via the API server.
kubeadm init phase addon kube-proxy [flags]
Options
--apiserver-advertise-address string | |
The IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
Port for the API Server to bind to. |
|
--config string | |
Path to a kubeadm configuration file. |
|
--control-plane-endpoint string | |
Specify a stable IP address or DNS name for the control plane. |
|
-h, --help | |
help for kube-proxy |
|
--image-repository string Default: "k8s.gcr.io" | |
Choose a container registry to pull control plane images from |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--kubernetes-version string Default: "stable-1" | |
Choose a specific Kubernetes version for the control plane. |
|
--pod-network-cidr string | |
Specify range of IP addresses for the pod network. If set, the control plane will automatically allocate CIDRs for every node. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
For more details on each field in the v1beta3
configuration you can navigate to our
API reference pages.
What's next
- kubeadm init to bootstrap a Kubernetes control-plane node
- kubeadm join to connect a node to the cluster
- kubeadm reset to revert any changes made to this host by
kubeadm init
orkubeadm join
- kubeadm alpha to try experimental functionality
6.9.1.12 - kubeadm join phase
kubeadm join phase
enables you to invoke atomic steps of the join process.
Hence, you can let kubeadm do some of the work and you can fill in the gaps
if you wish to apply customization.
kubeadm join phase
is consistent with the kubeadm join workflow,
and behind the scene both use the same code.
kubeadm join phase
Use this command to invoke single phase of the join workflow
Synopsis
Use this command to invoke single phase of the join workflow
Options
-h, --help | |
help for phase |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm join phase preflight
Using this phase you can execute preflight checks on a joining node.
Run join pre-flight checks
Synopsis
Run pre-flight checks for kubeadm join.
kubeadm join phase preflight [api-server-endpoint] [flags]
Examples
# Run join pre-flight checks using a config file.
kubeadm join phase preflight --config kubeadm-config.yaml
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
If the node should host a new control plane instance, the port for the API Server to bind to. |
|
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for preflight |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
|
--node-name string | |
Specify the node name. |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm join phase control-plane-prepare
Using this phase you can prepare a node for serving a control-plane.
Prepare the machine for serving a control plane
Synopsis
Prepare the machine for serving a control plane
kubeadm join phase control-plane-prepare [flags]
Examples
# Prepares the machine for serving a control plane
kubeadm join phase control-plane-prepare all
Options
-h, --help | |
help for control-plane-prepare |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Prepare the machine for serving a control plane
Synopsis
Prepare the machine for serving a control plane
kubeadm join phase control-plane-prepare all [api-server-endpoint] [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
If the node should host a new control plane instance, the port for the API Server to bind to. |
|
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for all |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
[EXPERIMENTAL] Download certificates shared among control-plane nodes from the kubeadm-certs Secret
Synopsis
[EXPERIMENTAL] Download certificates shared among control-plane nodes from the kubeadm-certs Secret
kubeadm join phase control-plane-prepare download-certs [api-server-endpoint] [flags]
Options
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for download-certs |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the certificates for the new control plane components
Synopsis
Generate the certificates for the new control plane components
kubeadm join phase control-plane-prepare certs [api-server-endpoint] [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for certs |
|
--node-name string | |
Specify the node name. |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the kubeconfig for the new control plane components
Synopsis
Generate the kubeconfig for the new control plane components
kubeadm join phase control-plane-prepare kubeconfig [api-server-endpoint] [flags]
Options
--certificate-key string | |
Use this key to decrypt the certificate secrets uploaded by init. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for kubeconfig |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Generate the manifests for the new control plane components
Synopsis
Generate the manifests for the new control plane components
kubeadm join phase control-plane-prepare control-plane [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--apiserver-bind-port int32 Default: 6443 | |
If the node should host a new control plane instance, the port for the API Server to bind to. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for control-plane |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm join phase kubelet-start
Using this phase you can write the kubelet settings, certificates and (re)start the kubelet.
Write kubelet settings, certificates and (re)start the kubelet
Synopsis
Write a file with KubeletConfiguration and an environment file with node specific kubelet settings, and then (re)start kubelet.
kubeadm join phase kubelet-start [api-server-endpoint] [flags]
Options
--config string | |
Path to kubeadm config file. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
--discovery-file string | |
For file-based discovery, a file or URL from which to load cluster information. |
|
--discovery-token string | |
For token-based discovery, the token used to validate cluster information fetched from the API server. |
|
--discovery-token-ca-cert-hash strings | |
For token-based discovery, validate that the root CA public key matches this hash (format: "<type>:<value>"). |
|
--discovery-token-unsafe-skip-ca-verification | |
For token-based discovery, allow joining without --discovery-token-ca-cert-hash pinning. |
|
-h, --help | |
help for kubelet-start |
|
--node-name string | |
Specify the node name. |
|
--tls-bootstrap-token string | |
Specify the token used to temporarily authenticate with the Kubernetes Control Plane while joining the node. |
|
--token string | |
Use this token for both discovery-token and tls-bootstrap-token when those values are not provided. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm join phase control-plane-join
Using this phase you can join a node as a control-plane instance.
Join a machine as a control plane instance
Synopsis
Join a machine as a control plane instance
kubeadm join phase control-plane-join [flags]
Examples
# Joins a machine as a control plane instance
kubeadm join phase control-plane-join all
Options
-h, --help | |
help for control-plane-join |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Join a machine as a control plane instance
Synopsis
Join a machine as a control plane instance
kubeadm join phase control-plane-join all [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for all |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Add a new local etcd member
Synopsis
Add a new local etcd member
kubeadm join phase control-plane-join etcd [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for etcd |
|
--node-name string | |
Specify the node name. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Register the new control-plane node into the ClusterStatus maintained in the kubeadm-config ConfigMap (DEPRECATED)
Synopsis
Register the new control-plane node into the ClusterStatus maintained in the kubeadm-config ConfigMap (DEPRECATED)
kubeadm join phase control-plane-join update-status [flags]
Options
--apiserver-advertise-address string | |
If the node should host a new control plane instance, the IP address the API Server will advertise it's listening on. If not set the default network interface will be used. |
|
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for update-status |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Mark a node as a control-plane
Synopsis
Mark a node as a control-plane
kubeadm join phase control-plane-join mark-control-plane [flags]
Options
--config string | |
Path to kubeadm config file. |
|
--control-plane | |
Create a new control plane instance on this node |
|
-h, --help | |
help for mark-control-plane |
|
--node-name string | |
Specify the node name. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
What's next
- kubeadm init to bootstrap a Kubernetes control-plane node
- kubeadm join to connect a node to the cluster
- kubeadm reset to revert any changes made to this host by
kubeadm init
orkubeadm join
- kubeadm alpha to try experimental functionality
6.9.1.13 - kubeadm kubeconfig
kubeadm kubeconfig
provides utilities for managing kubeconfig files.
For examples on how to use kubeadm kubeconfig user
see
Generating kubeconfig files for additional users.
kubeadm kubeconfig
Kubeconfig file utilities
Synopsis
Kubeconfig file utilities.
Options
-h, --help | |
help for kubeconfig |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm kubeconfig user
This command can be used to output a kubeconfig file for an additional user.
Output a kubeconfig file for an additional user
Synopsis
Output a kubeconfig file for an additional user.
kubeadm kubeconfig user [flags]
Examples
# Output a kubeconfig file for an additional user named foo using a kubeadm config file bar
kubeadm kubeconfig user --client-name=foo --config=bar
Options
--client-name string | |
The name of user. It will be used as the CN if client certificates are created |
|
--config string | |
Path to a kubeadm configuration file. |
|
-h, --help | |
help for user |
|
--org strings | |
The orgnizations of the client certificate. It will be used as the O if client certificates are created |
|
--token string | |
The token that should be used as the authentication mechanism for this kubeconfig, instead of client certificates |
|
--validity-period duration Default: 8760h0m0s | |
The validity period of the client certificate. It is an offset from the current time. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
6.9.1.14 - kubeadm reset phase
kubeadm reset phase
enables you to invoke atomic steps of the node reset process.
Hence, you can let kubeadm do some of the work and you can fill in the gaps
if you wish to apply customization.
kubeadm reset phase
is consistent with the kubeadm reset workflow,
and behind the scene both use the same code.
kubeadm reset phase
Use this command to invoke single phase of the reset workflow
Synopsis
Use this command to invoke single phase of the reset workflow
Options
-h, --help | |
help for phase |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm reset phase preflight
Using this phase you can execute preflight checks on a node that is being reset.
Run reset pre-flight checks
Synopsis
Run pre-flight checks for kubeadm reset.
kubeadm reset phase preflight [flags]
Options
-f, --force | |
Reset the node without prompting for confirmation. |
|
-h, --help | |
help for preflight |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm reset phase remove-etcd-member
Using this phase you can remove this control-plane node's etcd member from the etcd cluster.
Remove a local etcd member.
Synopsis
Remove a local etcd member for a control plane node.
kubeadm reset phase remove-etcd-member [flags]
Options
-h, --help | |
help for remove-etcd-member |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
kubeadm reset phase cleanup-node
Using this phase you can perform cleanup on this node.
Run cleanup node.
Synopsis
Run cleanup node.
kubeadm reset phase cleanup-node [flags]
Options
--cert-dir string Default: "/etc/kubernetes/pki" | |
The path to the directory where the certificates are stored. If specified, clean this directory. |
|
--cri-socket string | |
Path to the CRI socket to connect. If empty kubeadm will try to auto-detect this value; use this option only if you have more than one CRI installed or if you have non-standard CRI socket. |
|
-h, --help | |
help for cleanup-node |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
What's next
- kubeadm init to bootstrap a Kubernetes control-plane node
- kubeadm join to connect a node to the cluster
- kubeadm reset to revert any changes made to this host by
kubeadm init
orkubeadm join
- kubeadm alpha to try experimental functionality
6.9.1.15 - kubeadm upgrade phase
In v1.15.0, kubeadm introduced preliminary support for kubeadm upgrade node
phases.
Phases for other kubeadm upgrade
sub-commands such as apply
, could be added in the
following releases.
kubeadm upgrade node phase
Using this phase you can choose to execute the separate steps of the upgrade of
secondary control-plane or worker nodes. Please note that kubeadm upgrade apply
still has to
be called on a primary control-plane node.
Use this command to invoke single phase of the node workflow
Synopsis
Use this command to invoke single phase of the node workflow
Options
-h, --help | |
help for phase |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Run upgrade node pre-flight checks
Synopsis
Run pre-flight checks for kubeadm upgrade node.
kubeadm upgrade node phase preflight [flags]
Options
-h, --help | |
help for preflight |
|
--ignore-preflight-errors strings | |
A list of checks whose errors will be shown as warnings. Example: 'IsPrivilegedUser,Swap'. Value 'all' ignores errors from all checks. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Upgrade the control plane instance deployed on this node, if any
Synopsis
Upgrade the control plane instance deployed on this node, if any
kubeadm upgrade node phase control-plane [flags]
Options
--certificate-renewal Default: true | |
Perform the renewal of certificates used by component changed during upgrades. |
|
--dry-run | |
Do not change any state, just output the actions that would be performed. |
|
--etcd-upgrade Default: true | |
Perform the upgrade of etcd. |
|
-h, --help | |
help for control-plane |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
|
--patches string | |
Path to a directory that contains files named "target[suffix][+patchtype].extension". For example, "kube-apiserver0+merge.yaml" or just "etcd.json". "target" can be one of "kube-apiserver", "kube-controller-manager", "kube-scheduler", "etcd". "patchtype" can be one of "strategic", "merge" or "json" and they match the patch formats supported by kubectl. The default "patchtype" is "strategic". "extension" must be either "json" or "yaml". "suffix" is an optional string that can be used to determine which patches are applied first alpha-numerically. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
Upgrade the kubelet configuration for this node
Synopsis
Download the kubelet configuration from a ConfigMap of the form "kubelet-config-1.X" in the cluster, where X is the minor version of the kubelet. kubeadm uses the KuberneteVersion field in the kubeadm-config ConfigMap to determine what the desired kubelet version is.
kubeadm upgrade node phase kubelet-config [flags]
Options
--dry-run | |
Do not change any state, just output the actions that would be performed. |
|
-h, --help | |
help for kubelet-config |
|
--kubeconfig string Default: "/etc/kubernetes/admin.conf" | |
The kubeconfig file to use when talking to the cluster. If the flag is not set, a set of standard locations can be searched for an existing kubeconfig file. |
Options inherited from parent commands
--rootfs string | |
[EXPERIMENTAL] The path to the 'real' host root filesystem. |
What's next
- kubeadm init to bootstrap a Kubernetes control-plane node
- kubeadm join to connect a node to the cluster
- kubeadm reset to revert any changes made to this host by
kubeadm init
orkubeadm join
- kubeadm upgrade to upgrade a kubeadm node
- kubeadm alpha to try experimental functionality
6.9.1.16 - Implementation details
Kubernetes v1.10 [stable]
kubeadm init
and kubeadm join
together provides a nice user experience for creating a best-practice but bare Kubernetes cluster from scratch.
However, it might not be obvious how kubeadm does that.
This document provides additional details on what happen under the hood, with the aim of sharing knowledge on Kubernetes cluster best practices.
Core design principles
The cluster that kubeadm init
and kubeadm join
set up should be:
- Secure: It should adopt latest best-practices like:
- enforcing RBAC
- using the Node Authorizer
- using secure communication between the control plane components
- using secure communication between the API server and the kubelets
- lock-down the kubelet API
- locking down access to the API for system components like the kube-proxy and CoreDNS
- locking down what a Bootstrap Token can access
- User-friendly: The user should not have to run anything more than a couple of commands:
kubeadm init
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl apply -f <network-of-choice.yaml>
kubeadm join --token <token> <endpoint>:<port>
- Extendable:
- It should not favor any particular network provider. Configuring the cluster network is out-of-scope
- It should provide the possibility to use a config file for customizing various parameters
Constants and well-known values and paths
In order to reduce complexity and to simplify development of higher level tools that build on top of kubeadm, it uses a limited set of constant values for well-known paths and file names.
The Kubernetes directory /etc/kubernetes
is a constant in the application, since it is clearly the given path
in a majority of cases, and the most intuitive location; other constants paths and file names are:
/etc/kubernetes/manifests
as the path where kubelet should look for static Pod manifests. Names of static Pod manifests are:etcd.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
/etc/kubernetes/
as the path where kubeconfig files with identities for control plane components are stored. Names of kubeconfig files are:kubelet.conf
(bootstrap-kubelet.conf
during TLS bootstrap)controller-manager.conf
scheduler.conf
admin.conf
for the cluster admin and kubeadm itself
- Names of certificates and key files :
ca.crt
,ca.key
for the Kubernetes certificate authorityapiserver.crt
,apiserver.key
for the API server certificateapiserver-kubelet-client.crt
,apiserver-kubelet-client.key
for the client certificate used by the API server to connect to the kubelets securelysa.pub
,sa.key
for the key used by the controller manager when signing ServiceAccountfront-proxy-ca.crt
,front-proxy-ca.key
for the front proxy certificate authorityfront-proxy-client.crt
,front-proxy-client.key
for the front proxy client
kubeadm init workflow internal design
The kubeadm init
internal workflow consists of a sequence of atomic work tasks to perform,
as described in kubeadm init
.
The kubeadm init phase
command allows users to invoke each task individually, and ultimately offers a reusable and composable
API/toolbox that can be used by other Kubernetes bootstrap tools, by any IT automation tool or by an advanced user
for creating custom clusters.
Preflight checks
Kubeadm executes a set of preflight checks before starting the init, with the aim to verify preconditions and avoid common cluster startup problems.
The user can skip specific preflight checks or all of them with the --ignore-preflight-errors
option.
- [warning] If the Kubernetes version to use (specified with the
--kubernetes-version
flag) is at least one minor version higher than the kubeadm CLI version. - Kubernetes system requirements:
- if running on linux:
- [error] if Kernel is older than the minimum required version
- [error] if required cgroups subsystem aren't in set up
- if using docker:
- [warning/error] if Docker service does not exist, if it is disabled, if it is not active.
- [error] if Docker endpoint does not exist or does not work
- [warning] if docker version is not in the list of validated docker versions
- If using other cri engine:
- [error] if crictl socket does not answer
- if running on linux:
- [error] if user is not root
- [error] if the machine hostname is not a valid DNS subdomain
- [warning] if the host name cannot be reached via network lookup
- [error] if kubelet version is lower that the minimum kubelet version supported by kubeadm (current minor -1)
- [error] if kubelet version is at least one minor higher than the required controlplane version (unsupported version skew)
- [warning] if kubelet service does not exist or if it is disabled
- [warning] if firewalld is active
- [error] if API server bindPort or ports 10250/10251/10252 are used
- [Error] if
/etc/kubernetes/manifest
folder already exists and it is not empty - [Error] if
/proc/sys/net/bridge/bridge-nf-call-iptables
file does not exist/does not contain 1 - [Error] if advertise address is ipv6 and
/proc/sys/net/bridge/bridge-nf-call-ip6tables
does not exist/does not contain 1. - [Error] if swap is on
- [Error] if
conntrack
,ip
,iptables
,mount
,nsenter
commands are not present in the command path - [warning] if
ebtables
,ethtool
,socat
,tc
,touch
,crictl
commands are not present in the command path - [warning] if extra arg flags for API server, controller manager, scheduler contains some invalid options
- [warning] if connection to https://API.AdvertiseAddress:API.BindPort goes through proxy
- [warning] if connection to services subnet goes through proxy (only first address checked)
- [warning] if connection to Pods subnet goes through proxy (only first address checked)
- If external etcd is provided:
- [Error] if etcd version is older than the minimum required version
- [Error] if etcd certificates or keys are specified, but not provided
- If external etcd is NOT provided (and thus local etcd will be installed):
- [Error] if ports 2379 is used
- [Error] if Etcd.DataDir folder already exists and it is not empty
- If authorization mode is ABAC:
- [Error] if abac_policy.json does not exist
- If authorization mode is WebHook
- [Error] if webhook_authz.conf does not exist
Please note that:
- Preflight checks can be invoked individually with the
kubeadm init phase preflight
command
Generate the necessary certificates
Kubeadm generates certificate and private key pairs for different purposes:
- A self signed certificate authority for the Kubernetes cluster saved into
ca.crt
file andca.key
private key file - A serving certificate for the API server, generated using
ca.crt
as the CA, and saved intoapiserver.crt
file with its private keyapiserver.key
. This certificate should contain following alternative names:- The Kubernetes service's internal clusterIP (the first address in the services CIDR, e.g.
10.96.0.1
if service subnet is10.96.0.0/12
) - Kubernetes DNS names, e.g.
kubernetes.default.svc.cluster.local
if--service-dns-domain
flag value iscluster.local
, plus default DNS nameskubernetes.default.svc
,kubernetes.default
,kubernetes
- The node-name
- The
--apiserver-advertise-address
- Additional alternative names specified by the user
- The Kubernetes service's internal clusterIP (the first address in the services CIDR, e.g.
- A client certificate for the API server to connect to the kubelets securely, generated using
ca.crt
as the CA and saved intoapiserver-kubelet-client.crt
file with its private keyapiserver-kubelet-client.key
. This certificate should be in thesystem:masters
organization - A private key for signing ServiceAccount Tokens saved into
sa.key
file along with its public keysa.pub
- A certificate authority for the front proxy saved into
front-proxy-ca.crt
file with its keyfront-proxy-ca.key
- A client cert for the front proxy client, generate using
front-proxy-ca.crt
as the CA and saved intofront-proxy-client.crt
file with its private keyfront-proxy-client.key
Certificates are stored by default in /etc/kubernetes/pki
, but this directory is configurable using the --cert-dir
flag.
Please note that:
- If a given certificate and private key pair both exist, and its content is evaluated compliant with the above specs, the existing files will
be used and the generation phase for the given certificate skipped. This means the user can, for example, copy an existing CA to
/etc/kubernetes/pki/ca.{crt,key}
, and then kubeadm will use those files for signing the rest of the certs. See also using custom certificates - Only for the CA, it is possible to provide the
ca.crt
file but not theca.key
file, if all other certificates and kubeconfig files already are in place kubeadm recognize this condition and activates the ExternalCA , which also implies thecsrsigner
controller in controller-manager won't be started - If kubeadm is running in external CA mode; all the certificates must be provided by the user, because kubeadm cannot generate them by itself
- In case of kubeadm is executed in the
--dry-run
mode, certificates files are written in a temporary folder - Certificate generation can be invoked individually with the
kubeadm init phase certs all
command
Generate kubeconfig files for control plane components
Kubeadm generates kubeconfig files with identities for control plane components:
- A kubeconfig file for the kubelet to use during TLS bootstrap - /etc/kubernetes/bootstrap-kubelet.conf. Inside this file there is a bootstrap-token or embedded client certificates for authenticating this node with the cluster.
This client cert should:
- Be in the
system:nodes
organization, as required by the Node Authorization module - Have the Common Name (CN)
system:node:<hostname-lowercased>
- Be in the
- A kubeconfig file for controller-manager,
/etc/kubernetes/controller-manager.conf
; inside this file is embedded a client certificate with controller-manager identity. This client cert should have the CNsystem:kube-controller-manager
, as defined by default RBAC core components roles - A kubeconfig file for scheduler,
/etc/kubernetes/scheduler.conf
; inside this file is embedded a client certificate with scheduler identity. This client cert should have the CNsystem:kube-scheduler
, as defined by default RBAC core components roles
Additionally, a kubeconfig file for kubeadm itself and the admin is generated and saved into the /etc/kubernetes/admin.conf
file.
The "admin" here is defined as the actual person(s) that is administering the cluster and wants to have full control (root) over the cluster.
The embedded client certificate for admin should be in the system:masters
organization, as defined by default
RBAC user facing role bindings. It should also include a
CN. Kubeadm uses the kubernetes-admin
CN.
Please note that:
ca.crt
certificate is embedded in all the kubeconfig files.- If a given kubeconfig file exists, and its content is evaluated compliant with the above specs, the existing file will be used and the generation phase for the given kubeconfig skipped
- If kubeadm is running in ExternalCA mode, all the required kubeconfig must be provided by the user as well, because kubeadm cannot generate any of them by itself
- In case of kubeadm is executed in the
--dry-run
mode, kubeconfig files are written in a temporary folder - Kubeconfig files generation can be invoked individually with the
kubeadm init phase kubeconfig all
command
Generate static Pod manifests for control plane components
Kubeadm writes static Pod manifest files for control plane components to /etc/kubernetes/manifests
. The kubelet watches this directory for Pods to create on startup.
Static Pod manifest share a set of common properties:
- All static Pods are deployed on
kube-system
namespace - All static Pods get
tier:control-plane
andcomponent:{component-name}
labels - All static Pods use the
system-node-critical
priority class hostNetwork: true
is set on all static Pods to allow control plane startup before a network is configured; as a consequence:- The
address
that the controller-manager and the scheduler use to refer the API server is127.0.0.1
- If using a local etcd server,
etcd-servers
address will be set to127.0.0.1:2379
- The
- Leader election is enabled for both the controller-manager and the scheduler
- Controller-manager and the scheduler will reference kubeconfig files with their respective, unique identities
- All static Pods get any extra flags specified by the user as described in passing custom arguments to control plane components
- All static Pods get any extra Volumes specified by the user (Host path)
Please note that:
- All images will be pulled from k8s.gcr.io by default. See using custom images for customizing the image repository
- In case of kubeadm is executed in the
--dry-run
mode, static Pods files are written in a temporary folder - Static Pod manifest generation for control plane components can be invoked individually with the
kubeadm init phase control-plane all
command
API server
The static Pod manifest for the API server is affected by following parameters provided by the users:
- The
apiserver-advertise-address
andapiserver-bind-port
to bind to; if not provided, those value defaults to the IP address of the default network interface on the machine and port 6443 - The
service-cluster-ip-range
to use for services - If an external etcd server is specified, the
etcd-servers
address and related TLS settings (etcd-cafile
,etcd-certfile
,etcd-keyfile
); if an external etcd server is not be provided, a local etcd will be used (via host network) - If a cloud provider is specified, the corresponding
--cloud-provider
is configured, together with the--cloud-config
path if such file exists (this is experimental, alpha and will be removed in a future version)
Other API server flags that are set unconditionally are:
--insecure-port=0
to avoid insecure connections to the api server--enable-bootstrap-token-auth=true
to enable theBootstrapTokenAuthenticator
authentication module. See TLS Bootstrapping for more details--allow-privileged
totrue
(required e.g. by kube proxy)--requestheader-client-ca-file
tofront-proxy-ca.crt
--enable-admission-plugins
to:NamespaceLifecycle
e.g. to avoid deletion of system reserved namespacesLimitRanger
andResourceQuota
to enforce limits on namespacesServiceAccount
to enforce service account automationPersistentVolumeLabel
attaches region or zone labels to PersistentVolumes as defined by the cloud provider (This admission controller is deprecated and will be removed in a future version. It is not deployed by kubeadm by default with v1.9 onwards when not explicitly opting into usinggce
oraws
as cloud providers)DefaultStorageClass
to enforce default storage class onPersistentVolumeClaim
objectsDefaultTolerationSeconds
NodeRestriction
to limit what a kubelet can modify (e.g. only pods on this node)
--kubelet-preferred-address-types
toInternalIP,ExternalIP,Hostname;
this makeskubectl logs
and other API server-kubelet communication work in environments where the hostnames of the nodes aren't resolvable- Flags for using certificates generated in previous steps:
--client-ca-file
toca.crt
--tls-cert-file
toapiserver.crt
--tls-private-key-file
toapiserver.key
--kubelet-client-certificate
toapiserver-kubelet-client.crt
--kubelet-client-key
toapiserver-kubelet-client.key
--service-account-key-file
tosa.pub
--requestheader-client-ca-file
tofront-proxy-ca.crt
--proxy-client-cert-file
tofront-proxy-client.crt
--proxy-client-key-file
tofront-proxy-client.key
- Other flags for securing the front proxy (API Aggregation) communications:
--requestheader-username-headers=X-Remote-User
--requestheader-group-headers=X-Remote-Group
--requestheader-extra-headers-prefix=X-Remote-Extra-
--requestheader-allowed-names=front-proxy-client
Controller manager
The static Pod manifest for the controller manager is affected by following parameters provided by the users:
- If kubeadm is invoked specifying a
--pod-network-cidr
, the subnet manager feature required for some CNI network plugins is enabled by setting:--allocate-node-cidrs=true
--cluster-cidr
and--node-cidr-mask-size
flags according to the given CIDR
- If a cloud provider is specified, the corresponding
--cloud-provider
is specified, together with the--cloud-config
path if such configuration file exists (this is experimental, alpha and will be removed in a future version)
Other flags that are set unconditionally are:
--controllers
enabling all the default controllers plusBootstrapSigner
andTokenCleaner
controllers for TLS bootstrap. See TLS Bootstrapping for more details--use-service-account-credentials
totrue
- Flags for using certificates generated in previous steps:
--root-ca-file
toca.crt
--cluster-signing-cert-file
toca.crt
, if External CA mode is disabled, otherwise to""
--cluster-signing-key-file
toca.key
, if External CA mode is disabled, otherwise to""
--service-account-private-key-file
tosa.key
Scheduler
The static Pod manifest for the scheduler is not affected by parameters provided by the users.
Generate static Pod manifest for local etcd
If the user specified an external etcd this step will be skipped, otherwise kubeadm generates a static Pod manifest file for creating a local etcd instance running in a Pod with following attributes:
- listen on
localhost:2379
and useHostNetwork=true
- make a
hostPath
mount out from thedataDir
to the host's filesystem - Any extra flags specified by the user
Please note that:
- The etcd image will be pulled from
k8s.gcr.io
by default. See using custom images for customizing the image repository - in case of kubeadm is executed in the
--dry-run
mode, the etcd static Pod manifest is written in a temporary folder - Static Pod manifest generation for local etcd can be invoked individually with the
kubeadm init phase etcd local
command
Wait for the control plane to come up
kubeadm waits (upto 4m0s) until localhost:6443/healthz
(kube-apiserver liveness) returns ok
. However in order to detect
deadlock conditions, kubeadm fails fast if localhost:10255/healthz
(kubelet liveness) or
localhost:10255/healthz/syncloop
(kubelet readiness) don't return ok
within 40s and 60s respectively.
kubeadm relies on the kubelet to pull the control plane images and run them properly as static Pods. After the control plane is up, kubeadm completes the tasks described in following paragraphs.
Save the kubeadm ClusterConfiguration in a ConfigMap for later reference
kubeadm saves the configuration passed to kubeadm init
in a ConfigMap named kubeadm-config
under kube-system
namespace.
This will ensure that kubeadm actions executed in future (e.g kubeadm upgrade
) will be able to determine the actual/current cluster
state and make new decisions based on that data.
Please note that:
- Before saving the ClusterConfiguration, sensitive information like the token is stripped from the configuration
- Upload of control plane node configuration can be invoked individually with the
kubeadm init phase upload-config
command
Mark the node as control-plane
As soon as the control plane is available, kubeadm executes following actions:
- Labels the node as control-plane with
node-role.kubernetes.io/master=""
- Taints the node with
node-role.kubernetes.io/master:NoSchedule
Please note that:
- Mark control-plane phase phase can be invoked individually with the
kubeadm init phase mark-control-plane
command
Configure TLS-Bootstrapping for node joining
Kubeadm uses Authenticating with Bootstrap Tokens for joining new nodes to an existing cluster; for more details see also design proposal.
kubeadm init
ensures that everything is properly configured for this process, and this includes following steps as well as
setting API server and controller flags as already described in previous paragraphs.
Please note that:
- TLS bootstrapping for nodes can be configured with the
kubeadm init phase bootstrap-token
command, executing all the configuration steps described in following paragraphs; alternatively, each step can be invoked individually
Create a bootstrap token
kubeadm init
create a first bootstrap token, either generated automatically or provided by the user with the --token
flag; as documented
in bootstrap token specification, token should be saved as secrets with name bootstrap-token-<token-id>
under kube-system
namespace.
Please note that:
- The default token created by
kubeadm init
will be used to validate temporary user during TLS bootstrap process; those users will be member ofsystem:bootstrappers:kubeadm:default-node-token
group - The token has a limited validity, default 24 hours (the interval may be changed with the
—token-ttl
flag) - Additional tokens can be created with the
kubeadm token
command, that provide as well other useful functions for token management
Allow joining nodes to call CSR API
Kubeadm ensures that users in system:bootstrappers:kubeadm:default-node-token
group are able to access the certificate signing API.
This is implemented by creating a ClusterRoleBinding named kubeadm:kubelet-bootstrap
between the group above and the default
RBAC role system:node-bootstrapper
.
Setup auto approval for new bootstrap tokens
Kubeadm ensures that the Bootstrap Token will get its CSR request automatically approved by the csrapprover controller.
This is implemented by creating ClusterRoleBinding named kubeadm:node-autoapprove-bootstrap
between
the system:bootstrappers:kubeadm:default-node-token
group and the default role system:certificates.k8s.io:certificatesigningrequests:nodeclient
.
The role system:certificates.k8s.io:certificatesigningrequests:nodeclient
should be created as well, granting
POST permission to /apis/certificates.k8s.io/certificatesigningrequests/nodeclient
.
Setup nodes certificate rotation with auto approval
Kubeadm ensures that certificate rotation is enabled for nodes, and that new certificate request for nodes will get its CSR request automatically approved by the csrapprover controller.
This is implemented by creating ClusterRoleBinding named kubeadm:node-autoapprove-certificate-rotation
between the system:nodes
group
and the default role system:certificates.k8s.io:certificatesigningrequests:selfnodeclient
.
Create the public cluster-info ConfigMap
This phase creates the cluster-info
ConfigMap in the kube-public
namespace.
Additionally it creates a Role and a RoleBinding granting access to the ConfigMap for unauthenticated users
(i.e. users in RBAC group system:unauthenticated
).
Please note that:
- The access to the
cluster-info
ConfigMap is not rate-limited. This may or may not be a problem if you expose your cluster's API server to the internet; worst-case scenario here is a DoS attack where an attacker uses all the in-flight requests the kube-apiserver can handle to serving thecluster-info
ConfigMap.
Install addons
Kubeadm installs the internal DNS server and the kube-proxy addon components via the API server. Please note that:
- This phase can be invoked individually with the
kubeadm init phase addon all
command.
proxy
A ServiceAccount for kube-proxy
is created in the kube-system
namespace; then kube-proxy is deployed as a DaemonSet:
- The credentials (
ca.crt
andtoken
) to the control plane come from the ServiceAccount - The location (URL) of the API server comes from a ConfigMap
- The
kube-proxy
ServiceAccount is bound to the privileges in thesystem:node-proxier
ClusterRole
DNS
- The CoreDNS service is named
kube-dns
. This is done to prevent any interruption in service when the user is switching the cluster DNS from kube-dns to CoreDNS the--config
method described here. - A ServiceAccount for CoreDNS is created in the
kube-system
namespace. - The
coredns
ServiceAccount is bound to the privileges in thesystem:coredns
ClusterRole
In Kubernetes version 1.21, support for using kube-dns
with kubeadm was removed.
You can use CoreDNS with kubeadm even when the related Service is named kube-dns
.
kubeadm join phases internal design
Similarly to kubeadm init
, also kubeadm join
internal workflow consists of a sequence of atomic work tasks to perform.
This is split into discovery (having the Node trust the Kubernetes Master) and TLS bootstrap (having the Kubernetes Master trust the Node).
see Authenticating with Bootstrap Tokens or the corresponding design proposal.
Preflight checks
kubeadm
executes a set of preflight checks before starting the join, with the aim to verify preconditions and avoid common
cluster startup problems.
Please note that:
kubeadm join
preflight checks are basically a subsetkubeadm init
preflight checks- Starting from 1.9, kubeadm provides better support for CRI-generic functionality; in that case, docker specific controls are skipped or replaced by similar controls for crictl.
- Starting from 1.9, kubeadm provides support for joining nodes running on Windows; in that case, linux specific controls are skipped.
- In any case the user can skip specific preflight checks (or eventually all preflight checks) with the
--ignore-preflight-errors
option.
Discovery cluster-info
There are 2 main schemes for discovery. The first is to use a shared token along with the IP address of the API server. The second is to provide a file (that is a subset of the standard kubeconfig file).
Shared token discovery
If kubeadm join
is invoked with --discovery-token
, token discovery is used; in this case the node basically retrieves
the cluster CA certificates from the cluster-info
ConfigMap in the kube-public
namespace.
In order to prevent "man in the middle" attacks, several steps are taken:
- First, the CA certificate is retrieved via insecure connection (this is possible because
kubeadm init
granted access tocluster-info
users forsystem:unauthenticated
) - Then the CA certificate goes trough following validation steps:
- Basic validation: using the token ID against a JWT signature
- Pub key validation: using provided
--discovery-token-ca-cert-hash
. This value is available in the output ofkubeadm init
or can be calculated using standard tools (the hash is calculated over the bytes of the Subject Public Key Info (SPKI) object as in RFC7469). The--discovery-token-ca-cert-hash flag
may be repeated multiple times to allow more than one public key. - As a additional validation, the CA certificate is retrieved via secure connection and then compared with the CA retrieved initially
Please note that:
- Pub key validation can be skipped passing
--discovery-token-unsafe-skip-ca-verification
flag; This weakens the kubeadm security model since others can potentially impersonate the Kubernetes Master.
File/https discovery
If kubeadm join
is invoked with --discovery-file
, file discovery is used; this file can be a local file or downloaded via an HTTPS URL; in case of HTTPS, the host installed CA bundle is used to verify the connection.
With file discovery, the cluster CA certificates is provided into the file itself; in fact, the discovery file is a kubeconfig
file with only server
and certificate-authority-data
attributes set, as described in kubeadm join
reference doc;
when the connection with the cluster is established, kubeadm try to access the cluster-info
ConfigMap, and if available, uses it.
TLS Bootstrap
Once the cluster info are known, the file bootstrap-kubelet.conf
is written, thus allowing kubelet to do TLS Bootstrapping.
The TLS bootstrap mechanism uses the shared token to temporarily authenticate with the Kubernetes API server to submit a certificate signing request (CSR) for a locally created key pair.
The request is then automatically approved and the operation completes saving ca.crt
file and kubelet.conf
file to be used
by kubelet for joining the cluster, whilebootstrap-kubelet.conf
is deleted.
Please note that:
- The temporary authentication is validated against the token saved during the
kubeadm init
process (or with additional tokens created withkubeadm token
) - The temporary authentication resolve to a user member of
system:bootstrappers:kubeadm:default-node-token
group which was granted access to CSR api during thekubeadm init
process - The automatic CSR approval is managed by the csrapprover controller, according with configuration done the
kubeadm init
process
6.10 - Command line tool (kubectl)
Kubernetes provides a command line tool for communicating with a Kubernetes cluster's control plane, using the Kubernetes API.
This tool is named kubectl
.
For configuration, kubectl
looks for a file named config
in the $HOME/.kube
directory.
You can specify other kubeconfig
files by setting the KUBECONFIG
environment variable or by setting the
--kubeconfig
flag.
This overview covers kubectl
syntax, describes the command operations, and provides common examples.
For details about each command, including all the supported flags and subcommands, see the
kubectl reference documentation.
For installation instructions, see Installing kubectl;
for a quick guide, see the cheat sheet.
If you're used to using the docker
command-line tool, kubectl
for Docker Users explains some equivalent commands for Kubernetes.
Syntax
Use the following syntax to run kubectl
commands from your terminal window:
kubectl [command] [TYPE] [NAME] [flags]
where command
, TYPE
, NAME
, and flags
are:
-
command
: Specifies the operation that you want to perform on one or more resources, for examplecreate
,get
,describe
,delete
. -
TYPE
: Specifies the resource type. Resource types are case-insensitive and you can specify the singular, plural, or abbreviated forms. For example, the following commands produce the same output:kubectl get pod pod1 kubectl get pods pod1 kubectl get po pod1
-
NAME
: Specifies the name of the resource. Names are case-sensitive. If the name is omitted, details for all resources are displayed, for examplekubectl get pods
.When performing an operation on multiple resources, you can specify each resource by type and name or specify one or more files:
-
To specify resources by type and name:
-
To group resources if they are all the same type:
TYPE1 name1 name2 name<#>
.
Example:kubectl get pod example-pod1 example-pod2
-
To specify multiple resource types individually:
TYPE1/name1 TYPE1/name2 TYPE2/name3 TYPE<#>/name<#>
.
Example:kubectl get pod/example-pod1 replicationcontroller/example-rc1
-
-
To specify resources with one or more files:
-f file1 -f file2 -f file<#>
- Use YAML rather than JSON since YAML tends to be more user-friendly, especially for configuration files.
Example:kubectl get -f ./pod.yaml
- Use YAML rather than JSON since YAML tends to be more user-friendly, especially for configuration files.
-
-
flags
: Specifies optional flags. For example, you can use the-s
or--server
flags to specify the address and port of the Kubernetes API server.
If you need help, run kubectl help
from the terminal window.
In-cluster authentication and namespace overrides
By default kubectl
will first determine if it is running within a pod, and thus in a cluster. It starts by checking for the KUBERNETES_SERVICE_HOST
and KUBERNETES_SERVICE_PORT
environment variables and the existence of a service account token file at /var/run/secrets/kubernetes.io/serviceaccount/token
. If all three are found in-cluster authentication is assumed.
To maintain backwards compatibility, if the POD_NAMESPACE
environment variable is set during in-cluster authentication it will override the default namespace from the service account token. Any manifests or tools relying on namespace defaulting will be affected by this.
POD_NAMESPACE
environment variable
If the POD_NAMESPACE
environment variable is set, cli operations on namespaced resources will default to the variable value. For example, if the variable is set to seattle
, kubectl get pods
would return pods in the seattle
namespace. This is because pods are a namespaced resource, and no namespace was provided in the command. Review the output of kubectl api-resources
to determine if a resource is namespaced.
Explicit use of --namespace <value>
overrides this behavior.
How kubectl handles ServiceAccount tokens
If:
- there is Kubernetes service account token file mounted at
/var/run/secrets/kubernetes.io/serviceaccount/token
, and - the
KUBERNETES_SERVICE_HOST
environment variable is set, and - the
KUBERNETES_SERVICE_PORT
environment variable is set, and - you don't explicitly specify a namespace on the kubectl command line
then kubectl assumes it is running in your cluster. The kubectl tool looks up the
namespace of that ServiceAccount (this is the same as the namespace of the Pod)
and acts against that namespace. This is different from what happens outside of a
cluster; when kubectl runs outside a cluster and you don't specify a namespace,
the kubectl command acts against the default
namespace.
Operations
The following table includes short descriptions and the general syntax for all of the kubectl
operations:
Operation | Syntax | Description |
---|---|---|
alpha |
kubectl alpha SUBCOMMAND [flags] |
List the available commands that correspond to alpha features, which are not enabled in Kubernetes clusters by default. |
annotate |
kubectl annotate (-f FILENAME | TYPE NAME | TYPE/NAME) KEY_1=VAL_1 ... KEY_N=VAL_N [--overwrite] [--all] [--resource-version=version] [flags] |
Add or update the annotations of one or more resources. |
api-resources |
kubectl api-resources [flags] |
List the API resources that are available. |
api-versions |
kubectl api-versions [flags] |
List the API versions that are available. |
apply |
kubectl apply -f FILENAME [flags] |
Apply a configuration change to a resource from a file or stdin. |
attach |
kubectl attach POD -c CONTAINER [-i] [-t] [flags] |
Attach to a running container either to view the output stream or interact with the container (stdin). |
auth |
kubectl auth [flags] [options] |
Inspect authorization. |
autoscale |
kubectl autoscale (-f FILENAME | TYPE NAME | TYPE/NAME) [--min=MINPODS] --max=MAXPODS [--cpu-percent=CPU] [flags] |
Automatically scale the set of pods that are managed by a replication controller. |
certificate |
kubectl certificate SUBCOMMAND [options] |
Modify certificate resources. |
cluster-info |
kubectl cluster-info [flags] |
Display endpoint information about the master and services in the cluster. |
completion |
kubectl completion SHELL [options] |
Output shell completion code for the specified shell (bash or zsh). |
config |
kubectl config SUBCOMMAND [flags] |
Modifies kubeconfig files. See the individual subcommands for details. |
convert |
kubectl convert -f FILENAME [options] |
Convert config files between different API versions. Both YAML and JSON formats are accepted. Note - requires kubectl-convert plugin to be installed. |
cordon |
kubectl cordon NODE [options] |
Mark node as unschedulable. |
cp |
kubectl cp <file-spec-src> <file-spec-dest> [options] |
Copy files and directories to and from containers. |
create |
kubectl create -f FILENAME [flags] |
Create one or more resources from a file or stdin. |
delete |
kubectl delete (-f FILENAME | TYPE [NAME | /NAME | -l label | --all]) [flags] |
Delete resources either from a file, stdin, or specifying label selectors, names, resource selectors, or resources. |
describe |
kubectl describe (-f FILENAME | TYPE [NAME_PREFIX | /NAME | -l label]) [flags] |
Display the detailed state of one or more resources. |
diff |
kubectl diff -f FILENAME [flags] |
Diff file or stdin against live configuration. |
drain |
kubectl drain NODE [options] |
Drain node in preparation for maintenance. |
edit |
kubectl edit (-f FILENAME | TYPE NAME | TYPE/NAME) [flags] |
Edit and update the definition of one or more resources on the server by using the default editor. |
exec |
kubectl exec POD [-c CONTAINER] [-i] [-t] [flags] [-- COMMAND [args...]] |
Execute a command against a container in a pod. |
explain |
kubectl explain [--recursive=false] [flags] |
Get documentation of various resources. For instance pods, nodes, services, etc. |
expose |
kubectl expose (-f FILENAME | TYPE NAME | TYPE/NAME) [--port=port] [--protocol=TCP|UDP] [--target-port=number-or-name] [--name=name] [--external-ip=external-ip-of-service] [--type=type] [flags] |
Expose a replication controller, service, or pod as a new Kubernetes service. |
get |
kubectl get (-f FILENAME | TYPE [NAME | /NAME | -l label]) [--watch] [--sort-by=FIELD] [[-o | --output]=OUTPUT_FORMAT] [flags] |
List one or more resources. |
kustomize |
kubectl kustomize <dir> [flags] [options] |
List a set of API resources generated from instructions in a kustomization.yaml file. The argument must be the path to the directory containing the file, or a git repository URL with a path suffix specifying same with respect to the repository root. |
label |
kubectl label (-f FILENAME | TYPE NAME | TYPE/NAME) KEY_1=VAL_1 ... KEY_N=VAL_N [--overwrite] [--all] [--resource-version=version] [flags] |
Add or update the labels of one or more resources. |
logs |
kubectl logs POD [-c CONTAINER] [--follow] [flags] |
Print the logs for a container in a pod. |
options |
kubectl options |
List of global command-line options, which apply to all commands. |
patch |
kubectl patch (-f FILENAME | TYPE NAME | TYPE/NAME) --patch PATCH [flags] |
Update one or more fields of a resource by using the strategic merge patch process. |
plugin |
kubectl plugin [flags] [options] |
Provides utilities for interacting with plugins. |
port-forward |
kubectl port-forward POD [LOCAL_PORT:]REMOTE_PORT [...[LOCAL_PORT_N:]REMOTE_PORT_N] [flags] |
Forward one or more local ports to a pod. |
proxy |
kubectl proxy [--port=PORT] [--www=static-dir] [--www-prefix=prefix] [--api-prefix=prefix] [flags] |
Run a proxy to the Kubernetes API server. |
replace |
kubectl replace -f FILENAME |
Replace a resource from a file or stdin. |
rollout |
kubectl rollout SUBCOMMAND [options] |
Manage the rollout of a resource. Valid resource types include: deployments, daemonsets and statefulsets. |
run |
kubectl run NAME --image=image [--env="key=value"] [--port=port] [--dry-run=server|client|none] [--overrides=inline-json] [flags] |
Run a specified image on the cluster. |
scale |
kubectl scale (-f FILENAME | TYPE NAME | TYPE/NAME) --replicas=COUNT [--resource-version=version] [--current-replicas=count] [flags] |
Update the size of the specified replication controller. |
set |
kubectl set SUBCOMMAND [options] |
Configure application resources. |
taint |
kubectl taint NODE NAME KEY_1=VAL_1:TAINT_EFFECT_1 ... KEY_N=VAL_N:TAINT_EFFECT_N [options] |
Update the taints on one or more nodes. |
top |
kubectl top [flags] [options] |
Display Resource (CPU/Memory/Storage) usage. |
uncordon |
kubectl uncordon NODE [options] |
Mark node as schedulable. |
version |
kubectl version [--client] [flags] |
Display the Kubernetes version running on the client and server. |
wait |
kubectl wait ([-f FILENAME] | resource.group/resource.name | resource.group [(-l label | --all)]) [--for=delete|--for condition=available] [options] |
Experimental: Wait for a specific condition on one or many resources. |
To learn more about command operations, see the kubectl reference documentation.
Resource types
The following table includes a list of all the supported resource types and their abbreviated aliases.
(This output can be retrieved from kubectl api-resources
, and was accurate as of Kubernetes 1.19.1.)
NAME | SHORTNAMES | APIGROUP | NAMESPACED | KIND |
---|---|---|---|---|
bindings |
true | Binding | ||
componentstatuses |
cs |
false | ComponentStatus | |
configmaps |
cm |
true | ConfigMap | |
endpoints |
ep |
true | Endpoints | |
events |
ev |
true | Event | |
limitranges |
limits |
true | LimitRange | |
namespaces |
ns |
false | Namespace | |
nodes |
no |
false | Node | |
persistentvolumeclaims |
pvc |
true | PersistentVolumeClaim | |
persistentvolumes |
pv |
false | PersistentVolume | |
pods |
po |
true | Pod | |
podtemplates |
true | PodTemplate | ||
replicationcontrollers |
rc |
true | ReplicationController | |
resourcequotas |
quota |
true | ResourceQuota | |
secrets |
true | Secret | ||
serviceaccounts |
sa |
true | ServiceAccount | |
services |
svc |
true | Service | |
mutatingwebhookconfigurations |
admissionregistration.k8s.io | false | MutatingWebhookConfiguration | |
validatingwebhookconfigurations |
admissionregistration.k8s.io | false | ValidatingWebhookConfiguration | |
customresourcedefinitions |
crd,crds |
apiextensions.k8s.io | false | CustomResourceDefinition |
apiservices |
apiregistration.k8s.io | false | APIService | |
controllerrevisions |
apps | true | ControllerRevision | |
daemonsets |
ds |
apps | true | DaemonSet |
deployments |
deploy |
apps | true | Deployment |
replicasets |
rs |
apps | true | ReplicaSet |
statefulsets |
sts |
apps | true | StatefulSet |
tokenreviews |
authentication.k8s.io | false | TokenReview | |
localsubjectaccessreviews |
authorization.k8s.io | true | LocalSubjectAccessReview | |
selfsubjectaccessreviews |
authorization.k8s.io | false | SelfSubjectAccessReview | |
selfsubjectrulesreviews |
authorization.k8s.io | false | SelfSubjectRulesReview | |
subjectaccessreviews |
authorization.k8s.io | false | SubjectAccessReview | |
horizontalpodautoscalers |
hpa |
autoscaling | true | HorizontalPodAutoscaler |
cronjobs |
cj |
batch | true | CronJob |
jobs |
batch | true | Job | |
certificatesigningrequests |
csr |
certificates.k8s.io | false | CertificateSigningRequest |
leases |
coordination.k8s.io | true | Lease | |
endpointslices |
discovery.k8s.io | true | EndpointSlice | |
events |
ev |
events.k8s.io | true | Event |
ingresses |
ing |
extensions | true | Ingress |
flowschemas |
flowcontrol.apiserver.k8s.io | false | FlowSchema | |
prioritylevelconfigurations |
flowcontrol.apiserver.k8s.io | false | PriorityLevelConfiguration | |
ingressclasses |
networking.k8s.io | false | IngressClass | |
ingresses |
ing |
networking.k8s.io | true | Ingress |
networkpolicies |
netpol |
networking.k8s.io | true | NetworkPolicy |
runtimeclasses |
node.k8s.io | false | RuntimeClass | |
poddisruptionbudgets |
pdb |
policy | true | PodDisruptionBudget |
podsecuritypolicies |
psp |
policy | false | PodSecurityPolicy |
clusterrolebindings |
rbac.authorization.k8s.io | false | ClusterRoleBinding | |
clusterroles |
rbac.authorization.k8s.io | false | ClusterRole | |
rolebindings |
rbac.authorization.k8s.io | true | RoleBinding | |
roles |
rbac.authorization.k8s.io | true | Role | |
priorityclasses |
pc |
scheduling.k8s.io | false | PriorityClass |
csidrivers |
storage.k8s.io | false | CSIDriver | |
csinodes |
storage.k8s.io | false | CSINode | |
storageclasses |
sc |
storage.k8s.io | false | StorageClass |
volumeattachments |
storage.k8s.io | false | VolumeAttachment |
Output options
Use the following sections for information about how you can format or sort the output of certain commands. For details about which commands support the various output options, see the kubectl reference documentation.
Formatting output
The default output format for all kubectl
commands is the human readable plain-text format. To output details to your terminal window in a specific format, you can add either the -o
or --output
flags to a supported kubectl
command.
Syntax
kubectl [command] [TYPE] [NAME] -o <output_format>
Depending on the kubectl
operation, the following output formats are supported:
Output format | Description |
---|---|
-o custom-columns=<spec> |
Print a table using a comma separated list of custom columns. |
-o custom-columns-file=<filename> |
Print a table using the custom columns template in the <filename> file. |
-o json |
Output a JSON formatted API object. |
-o jsonpath=<template> |
Print the fields defined in a jsonpath expression. |
-o jsonpath-file=<filename> |
Print the fields defined by the jsonpath expression in the <filename> file. |
-o name |
Print only the resource name and nothing else. |
-o wide |
Output in the plain-text format with any additional information. For pods, the node name is included. |
-o yaml |
Output a YAML formatted API object. |
Example
In this example, the following command outputs the details for a single pod as a YAML formatted object:
kubectl get pod web-pod-13je7 -o yaml
Remember: See the kubectl reference documentation for details about which output format is supported by each command.
Custom columns
To define custom columns and output only the details that you want into a table, you can use the custom-columns
option.
You can choose to define the custom columns inline or use a template file: -o custom-columns=<spec>
or -o custom-columns-file=<filename>
.
Examples
Inline:
kubectl get pods <pod-name> -o custom-columns=NAME:.metadata.name,RSRC:.metadata.resourceVersion
Template file:
kubectl get pods <pod-name> -o custom-columns-file=template.txt
where the template.txt
file contains:
NAME RSRC
metadata.name metadata.resourceVersion
The result of running either command is similar to:
NAME RSRC
submit-queue 610995
Server-side columns
kubectl
supports receiving specific column information from the server about objects.
This means that for any given resource, the server will return columns and rows relevant to that resource, for the client to print.
This allows for consistent human-readable output across clients used against the same cluster, by having the server encapsulate the details of printing.
This feature is enabled by default. To disable it, add the
--server-print=false
flag to the kubectl get
command.
Examples
To print information about the status of a pod, use a command like the following:
kubectl get pods <pod-name> --server-print=false
The output is similar to:
NAME AGE
pod-name 1m
Sorting list objects
To output objects to a sorted list in your terminal window, you can add the --sort-by
flag to a supported kubectl
command. Sort your objects by specifying any numeric or string field with the --sort-by
flag. To specify a field, use a jsonpath expression.
Syntax
kubectl [command] [TYPE] [NAME] --sort-by=<jsonpath_exp>
Example
To print a list of pods sorted by name, you run:
kubectl get pods --sort-by=.metadata.name
Examples: Common operations
Use the following set of examples to help you familiarize yourself with running the commonly used kubectl
operations:
kubectl apply
- Apply or Update a resource from a file or stdin.
# Create a service using the definition in example-service.yaml.
kubectl apply -f example-service.yaml
# Create a replication controller using the definition in example-controller.yaml.
kubectl apply -f example-controller.yaml
# Create the objects that are defined in any .yaml, .yml, or .json file within the <directory> directory.
kubectl apply -f <directory>
kubectl get
- List one or more resources.
# List all pods in plain-text output format.
kubectl get pods
# List all pods in plain-text output format and include additional information (such as node name).
kubectl get pods -o wide
# List the replication controller with the specified name in plain-text output format. Tip: You can shorten and replace the 'replicationcontroller' resource type with the alias 'rc'.
kubectl get replicationcontroller <rc-name>
# List all replication controllers and services together in plain-text output format.
kubectl get rc,services
# List all daemon sets in plain-text output format.
kubectl get ds
# List all pods running on node server01
kubectl get pods --field-selector=spec.nodeName=server01
kubectl describe
- Display detailed state of one or more resources, including the uninitialized ones by default.
# Display the details of the node with name <node-name>.
kubectl describe nodes <node-name>
# Display the details of the pod with name <pod-name>.
kubectl describe pods/<pod-name>
# Display the details of all the pods that are managed by the replication controller named <rc-name>.
# Remember: Any pods that are created by the replication controller get prefixed with the name of the replication controller.
kubectl describe pods <rc-name>
# Describe all pods
kubectl describe pods
kubectl get
command is usually used for retrieving one or more
resources of the same resource type. It features a rich set of flags that allows
you to customize the output format using the -o
or --output
flag, for example.
You can specify the -w
or --watch
flag to start watching updates to a particular
object. The kubectl describe
command is more focused on describing the many
related aspects of a specified resource. It may invoke several API calls to the
API server to build a view for the user. For example, the kubectl describe node
command retrieves not only the information about the node, but also a summary of
the pods running on it, the events generated for the node etc.
kubectl delete
- Delete resources either from a file, stdin, or specifying label selectors, names, resource selectors, or resources.
# Delete a pod using the type and name specified in the pod.yaml file.
kubectl delete -f pod.yaml
# Delete all the pods and services that have the label '<label-key>=<label-value>'.
kubectl delete pods,services -l <label-key>=<label-value>
# Delete all pods, including uninitialized ones.
kubectl delete pods --all
kubectl exec
- Execute a command against a container in a pod.
# Get output from running 'date' from pod <pod-name>. By default, output is from the first container.
kubectl exec <pod-name> -- date
# Get output from running 'date' in container <container-name> of pod <pod-name>.
kubectl exec <pod-name> -c <container-name> -- date
# Get an interactive TTY and run /bin/bash from pod <pod-name>. By default, output is from the first container.
kubectl exec -ti <pod-name> -- /bin/bash
kubectl logs
- Print the logs for a container in a pod.
# Return a snapshot of the logs from pod <pod-name>.
kubectl logs <pod-name>
# Start streaming the logs from pod <pod-name>. This is similar to the 'tail -f' Linux command.
kubectl logs -f <pod-name>
kubectl diff
- View a diff of the proposed updates to a cluster.
# Diff resources included in "pod.json".
kubectl diff -f pod.json
# Diff file read from stdin.
cat service.yaml | kubectl diff -f -
Examples: Creating and using plugins
Use the following set of examples to help you familiarize yourself with writing and using kubectl
plugins:
# create a simple plugin in any language and name the resulting executable file
# so that it begins with the prefix "kubectl-"
cat ./kubectl-hello
#!/bin/sh
# this plugin prints the words "hello world"
echo "hello world"
With a plugin written, let's make it executable:
chmod a+x ./kubectl-hello
# and move it to a location in our PATH
sudo mv ./kubectl-hello /usr/local/bin
sudo chown root:root /usr/local/bin
# You have now created and "installed" a kubectl plugin.
# You can begin using this plugin by invoking it from kubectl as if it were a regular command
kubectl hello
hello world
# You can "uninstall" a plugin, by removing it from the folder in your
# $PATH where you placed it
sudo rm /usr/local/bin/kubectl-hello
In order to view all of the plugins that are available to kubectl
, use
the kubectl plugin list
subcommand:
kubectl plugin list
The output is similar to:
The following kubectl-compatible plugins are available:
/usr/local/bin/kubectl-hello
/usr/local/bin/kubectl-foo
/usr/local/bin/kubectl-bar
kubectl plugin list
also warns you about plugins that are not
executable, or that are shadowed by other plugins; for example:
sudo chmod -x /usr/local/bin/kubectl-foo # remove execute permission
kubectl plugin list
The following kubectl-compatible plugins are available:
/usr/local/bin/kubectl-hello
/usr/local/bin/kubectl-foo
- warning: /usr/local/bin/kubectl-foo identified as a plugin, but it is not executable
/usr/local/bin/kubectl-bar
error: one plugin warning was found
You can think of plugins as a means to build more complex functionality on top of the existing kubectl commands:
cat ./kubectl-whoami
The next few examples assume that you already made kubectl-whoami
have
the following contents:
#!/bin/bash
# this plugin makes use of the `kubectl config` command in order to output
# information about the current user, based on the currently selected context
kubectl config view --template='{{ range .contexts }}{{ if eq .name "'$(kubectl config current-context)'" }}Current user: {{ printf "%s\n" .context.user }}{{ end }}{{ end }}'
Running the above command gives you an output containing the user for the current context in your KUBECONFIG file:
# make the file executable
sudo chmod +x ./kubectl-whoami
# and move it into your PATH
sudo mv ./kubectl-whoami /usr/local/bin
kubectl whoami
Current user: plugins-user
What's next
- Read the
kubectl
reference documentation:- the kubectl command reference
- the command line arguments reference
- Learn about
kubectl
usage conventions - Read about JSONPath support in kubectl
- Read about how to extend kubectl with plugins
- To find out more about plugins, take a look at the example CLI plugin.
6.10.1 - kubectl Cheat Sheet
This page contains a list of commonly used kubectl
commands and flags.
Kubectl autocomplete
BASH
source <(kubectl completion bash) # setup autocomplete in bash into the current shell, bash-completion package should be installed first.
echo "source <(kubectl completion bash)" >> ~/.bashrc # add autocomplete permanently to your bash shell.
You can also use a shorthand alias for kubectl
that also works with completion:
alias k=kubectl
complete -F __start_kubectl k
ZSH
source <(kubectl completion zsh) # setup autocomplete in zsh into the current shell
echo "[[ $commands[kubectl] ]] && source <(kubectl completion zsh)" >> ~/.zshrc # add autocomplete permanently to your zsh shell
A Note on --all-namespaces
Appending --all-namespaces
happens frequently enough where you should be aware of the shorthand for --all-namespaces
:
kubectl -A
Kubectl context and configuration
Set which Kubernetes cluster kubectl
communicates with and modifies configuration
information. See Authenticating Across Clusters with kubeconfig documentation for
detailed config file information.
kubectl config view # Show Merged kubeconfig settings.
# use multiple kubeconfig files at the same time and view merged config
KUBECONFIG=~/.kube/config:~/.kube/kubconfig2
kubectl config view
# get the password for the e2e user
kubectl config view -o jsonpath='{.users[?(@.name == "e2e")].user.password}'
kubectl config view -o jsonpath='{.users[].name}' # display the first user
kubectl config view -o jsonpath='{.users[*].name}' # get a list of users
kubectl config get-contexts # display list of contexts
kubectl config current-context # display the current-context
kubectl config use-context my-cluster-name # set the default context to my-cluster-name
# add a new user to your kubeconf that supports basic auth
kubectl config set-credentials kubeuser/foo.kubernetes.com --username=kubeuser --password=kubepassword
# permanently save the namespace for all subsequent kubectl commands in that context.
kubectl config set-context --current --namespace=ggckad-s2
# set a context utilizing a specific username and namespace.
kubectl config set-context gce --user=cluster-admin --namespace=foo \
&& kubectl config use-context gce
kubectl config unset users.foo # delete user foo
# short alias to set/show context/namespace (only works for bash and bash-compatible shells, current context to be set before using kn to set namespace)
alias kx='f() { [ "$1" ] && kubectl config use-context $1 || kubectl config current-context ; } ; f'
alias kn='f() { [ "$1" ] && kubectl config set-context --current --namespace $1 || kubectl config view --minify | grep namespace | cut -d" " -f6 ; } ; f'
Kubectl apply
apply
manages applications through files defining Kubernetes resources. It creates and updates resources in a cluster through running kubectl apply
. This is the recommended way of managing Kubernetes applications on production. See Kubectl Book.
Creating objects
Kubernetes manifests can be defined in YAML or JSON. The file extension .yaml
,
.yml
, and .json
can be used.
kubectl apply -f ./my-manifest.yaml # create resource(s)
kubectl apply -f ./my1.yaml -f ./my2.yaml # create from multiple files
kubectl apply -f ./dir # create resource(s) in all manifest files in dir
kubectl apply -f https://git.io/vPieo # create resource(s) from url
kubectl create deployment nginx --image=nginx # start a single instance of nginx
# create a Job which prints "Hello World"
kubectl create job hello --image=busybox:1.28 -- echo "Hello World"
# create a CronJob that prints "Hello World" every minute
kubectl create cronjob hello --image=busybox:1.28 --schedule="*/1 * * * *" -- echo "Hello World"
kubectl explain pods # get the documentation for pod manifests
# Create multiple YAML objects from stdin
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: busybox-sleep
spec:
containers:
- name: busybox
image: busybox:1.28
args:
- sleep
- "1000000"
---
apiVersion: v1
kind: Pod
metadata:
name: busybox-sleep-less
spec:
containers:
- name: busybox
image: busybox:1.28
args:
- sleep
- "1000"
EOF
# Create a secret with several keys
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
data:
password: $(echo -n "s33msi4" | base64 -w0)
username: $(echo -n "jane" | base64 -w0)
EOF
Viewing, finding resources
# Get commands with basic output
kubectl get services # List all services in the namespace
kubectl get pods --all-namespaces # List all pods in all namespaces
kubectl get pods -o wide # List all pods in the current namespace, with more details
kubectl get deployment my-dep # List a particular deployment
kubectl get pods # List all pods in the namespace
kubectl get pod my-pod -o yaml # Get a pod's YAML
# Describe commands with verbose output
kubectl describe nodes my-node
kubectl describe pods my-pod
# List Services Sorted by Name
kubectl get services --sort-by=.metadata.name
# List pods Sorted by Restart Count
kubectl get pods --sort-by='.status.containerStatuses[0].restartCount'
# List PersistentVolumes sorted by capacity
kubectl get pv --sort-by=.spec.capacity.storage
# Get the version label of all pods with label app=cassandra
kubectl get pods --selector=app=cassandra -o \
jsonpath='{.items[*].metadata.labels.version}'
# Retrieve the value of a key with dots, e.g. 'ca.crt'
kubectl get configmap myconfig \
-o jsonpath='{.data.ca\.crt}'
# Get all worker nodes (use a selector to exclude results that have a label
# named 'node-role.kubernetes.io/master')
kubectl get node --selector='!node-role.kubernetes.io/master'
# Get all running pods in the namespace
kubectl get pods --field-selector=status.phase=Running
# Get ExternalIPs of all nodes
kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="ExternalIP")].address}'
# List Names of Pods that belong to Particular RC
# "jq" command useful for transformations that are too complex for jsonpath, it can be found at https://stedolan.github.io/jq/
sel=${$(kubectl get rc my-rc --output=json | jq -j '.spec.selector | to_entries | .[] | "\(.key)=\(.value),"')%?}
echo $(kubectl get pods --selector=$sel --output=jsonpath={.items..metadata.name})
# Show labels for all pods (or any other Kubernetes object that supports labelling)
kubectl get pods --show-labels
# Check which nodes are ready
JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}' \
&& kubectl get nodes -o jsonpath="$JSONPATH" | grep "Ready=True"
# Output decoded secrets without external tools
kubectl get secret my-secret -o go-template='{{range $k,$v := .data}}{{"### "}}{{$k}}{{"\n"}}{{$v|base64decode}}{{"\n\n"}}{{end}}'
# List all Secrets currently in use by a pod
kubectl get pods -o json | jq '.items[].spec.containers[].env[]?.valueFrom.secretKeyRef.name' | grep -v null | sort | uniq
# List all containerIDs of initContainer of all pods
# Helpful when cleaning up stopped containers, while avoiding removal of initContainers.
kubectl get pods --all-namespaces -o jsonpath='{range .items[*].status.initContainerStatuses[*]}{.containerID}{"\n"}{end}' | cut -d/ -f3
# List Events sorted by timestamp
kubectl get events --sort-by=.metadata.creationTimestamp
# Compares the current state of the cluster against the state that the cluster would be in if the manifest was applied.
kubectl diff -f ./my-manifest.yaml
# Produce a period-delimited tree of all keys returned for nodes
# Helpful when locating a key within a complex nested JSON structure
kubectl get nodes -o json | jq -c 'paths|join(".")'
# Produce a period-delimited tree of all keys returned for pods, etc
kubectl get pods -o json | jq -c 'paths|join(".")'
# Produce ENV for all pods, assuming you have a default container for the pods, default namespace and the `env` command is supported.
# Helpful when running any supported command across all pods, not just `env`
for pod in $(kubectl get po --output=jsonpath={.items..metadata.name}); do echo $pod && kubectl exec -it $pod -- env; done
Updating resources
kubectl set image deployment/frontend www=image:v2 # Rolling update "www" containers of "frontend" deployment, updating the image
kubectl rollout history deployment/frontend # Check the history of deployments including the revision
kubectl rollout undo deployment/frontend # Rollback to the previous deployment
kubectl rollout undo deployment/frontend --to-revision=2 # Rollback to a specific revision
kubectl rollout status -w deployment/frontend # Watch rolling update status of "frontend" deployment until completion
kubectl rollout restart deployment/frontend # Rolling restart of the "frontend" deployment
cat pod.json | kubectl replace -f - # Replace a pod based on the JSON passed into stdin
# Force replace, delete and then re-create the resource. Will cause a service outage.
kubectl replace --force -f ./pod.json
# Create a service for a replicated nginx, which serves on port 80 and connects to the containers on port 8000
kubectl expose rc nginx --port=80 --target-port=8000
# Update a single-container pod's image version (tag) to v4
kubectl get pod mypod -o yaml | sed 's/\(image: myimage\):.*$/\1:v4/' | kubectl replace -f -
kubectl label pods my-pod new-label=awesome # Add a Label
kubectl annotate pods my-pod icon-url=http://goo.gl/XXBTWq # Add an annotation
kubectl autoscale deployment foo --min=2 --max=10 # Auto scale a deployment "foo"
Patching resources
# Partially update a node
kubectl patch node k8s-node-1 -p '{"spec":{"unschedulable":true}}'
# Update a container's image; spec.containers[*].name is required because it's a merge key
kubectl patch pod valid-pod -p '{"spec":{"containers":[{"name":"kubernetes-serve-hostname","image":"new image"}]}}'
# Update a container's image using a json patch with positional arrays
kubectl patch pod valid-pod --type='json' -p='[{"op": "replace", "path": "/spec/containers/0/image", "value":"new image"}]'
# Disable a deployment livenessProbe using a json patch with positional arrays
kubectl patch deployment valid-deployment --type json -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/livenessProbe"}]'
# Add a new element to a positional array
kubectl patch sa default --type='json' -p='[{"op": "add", "path": "/secrets/1", "value": {"name": "whatever" } }]'
Editing resources
Edit any API resource in your preferred editor.
kubectl edit svc/docker-registry # Edit the service named docker-registry
KUBE_EDITOR="nano" kubectl edit svc/docker-registry # Use an alternative editor
Scaling resources
kubectl scale --replicas=3 rs/foo # Scale a replicaset named 'foo' to 3
kubectl scale --replicas=3 -f foo.yaml # Scale a resource specified in "foo.yaml" to 3
kubectl scale --current-replicas=2 --replicas=3 deployment/mysql # If the deployment named mysql's current size is 2, scale mysql to 3
kubectl scale --replicas=5 rc/foo rc/bar rc/baz # Scale multiple replication controllers
Deleting resources
kubectl delete -f ./pod.json # Delete a pod using the type and name specified in pod.json
kubectl delete pod unwanted --now # Delete a pod with no grace period
kubectl delete pod,service baz foo # Delete pods and services with same names "baz" and "foo"
kubectl delete pods,services -l name=myLabel # Delete pods and services with label name=myLabel
kubectl -n my-ns delete pod,svc --all # Delete all pods and services in namespace my-ns,
# Delete all pods matching the awk pattern1 or pattern2
kubectl get pods -n mynamespace --no-headers=true | awk '/pattern1|pattern2/{print $1}' | xargs kubectl delete -n mynamespace pod
Interacting with running Pods
kubectl logs my-pod # dump pod logs (stdout)
kubectl logs -l name=myLabel # dump pod logs, with label name=myLabel (stdout)
kubectl logs my-pod --previous # dump pod logs (stdout) for a previous instantiation of a container
kubectl logs my-pod -c my-container # dump pod container logs (stdout, multi-container case)
kubectl logs -l name=myLabel -c my-container # dump pod logs, with label name=myLabel (stdout)
kubectl logs my-pod -c my-container --previous # dump pod container logs (stdout, multi-container case) for a previous instantiation of a container
kubectl logs -f my-pod # stream pod logs (stdout)
kubectl logs -f my-pod -c my-container # stream pod container logs (stdout, multi-container case)
kubectl logs -f -l name=myLabel --all-containers # stream all pods logs with label name=myLabel (stdout)
kubectl run -i --tty busybox --image=busybox:1.28 -- sh # Run pod as interactive shell
kubectl run nginx --image=nginx -n mynamespace # Start a single instance of nginx pod in the namespace of mynamespace
kubectl run nginx --image=nginx # Run pod nginx and write its spec into a file called pod.yaml
--dry-run=client -o yaml > pod.yaml
kubectl attach my-pod -i # Attach to Running Container
kubectl port-forward my-pod 5000:6000 # Listen on port 5000 on the local machine and forward to port 6000 on my-pod
kubectl exec my-pod -- ls / # Run command in existing pod (1 container case)
kubectl exec --stdin --tty my-pod -- /bin/sh # Interactive shell access to a running pod (1 container case)
kubectl exec my-pod -c my-container -- ls / # Run command in existing pod (multi-container case)
kubectl top pod POD_NAME --containers # Show metrics for a given pod and its containers
kubectl top pod POD_NAME --sort-by=cpu # Show metrics for a given pod and sort it by 'cpu' or 'memory'
Copy files and directories to and from containers
kubectl cp /tmp/foo_dir my-pod:/tmp/bar_dir # Copy /tmp/foo_dir local directory to /tmp/bar_dir in a remote pod in the current namespace
kubectl cp /tmp/foo my-pod:/tmp/bar -c my-container # Copy /tmp/foo local file to /tmp/bar in a remote pod in a specific container
kubectl cp /tmp/foo my-namespace/my-pod:/tmp/bar # Copy /tmp/foo local file to /tmp/bar in a remote pod in namespace my-namespace
kubectl cp my-namespace/my-pod:/tmp/foo /tmp/bar # Copy /tmp/foo from a remote pod to /tmp/bar locally
kubectl cp
requires that the 'tar' binary is present in your container image. If 'tar' is not present,kubectl cp
will fail.
For advanced use cases, such as symlinks, wildcard expansion or file mode preservation consider using kubectl exec
.
tar cf - /tmp/foo | kubectl exec -i -n my-namespace my-pod -- tar xf - -C /tmp/bar # Copy /tmp/foo local file to /tmp/bar in a remote pod in namespace my-namespace
kubectl exec -n my-namespace my-pod -- tar cf - /tmp/foo | tar xf - -C /tmp/bar # Copy /tmp/foo from a remote pod to /tmp/bar locally
Interacting with Deployments and Services
kubectl logs deploy/my-deployment # dump Pod logs for a Deployment (single-container case)
kubectl logs deploy/my-deployment -c my-container # dump Pod logs for a Deployment (multi-container case)
kubectl port-forward svc/my-service 5000 # listen on local port 5000 and forward to port 5000 on Service backend
kubectl port-forward svc/my-service 5000:my-service-port # listen on local port 5000 and forward to Service target port with name <my-service-port>
kubectl port-forward deploy/my-deployment 5000:6000 # listen on local port 5000 and forward to port 6000 on a Pod created by <my-deployment>
kubectl exec deploy/my-deployment -- ls # run command in first Pod and first container in Deployment (single- or multi-container cases)
Interacting with Nodes and cluster
kubectl cordon my-node # Mark my-node as unschedulable
kubectl drain my-node # Drain my-node in preparation for maintenance
kubectl uncordon my-node # Mark my-node as schedulable
kubectl top node my-node # Show metrics for a given node
kubectl cluster-info # Display addresses of the master and services
kubectl cluster-info dump # Dump current cluster state to stdout
kubectl cluster-info dump --output-directory=/path/to/cluster-state # Dump current cluster state to /path/to/cluster-state
# If a taint with that key and effect already exists, its value is replaced as specified.
kubectl taint nodes foo dedicated=special-user:NoSchedule
Resource types
List all supported resource types along with their shortnames, API group, whether they are namespaced, and Kind:
kubectl api-resources
Other operations for exploring API resources:
kubectl api-resources --namespaced=true # All namespaced resources
kubectl api-resources --namespaced=false # All non-namespaced resources
kubectl api-resources -o name # All resources with simple output (only the resource name)
kubectl api-resources -o wide # All resources with expanded (aka "wide") output
kubectl api-resources --verbs=list,get # All resources that support the "list" and "get" request verbs
kubectl api-resources --api-group=extensions # All resources in the "extensions" API group
Formatting output
To output details to your terminal window in a specific format, add the -o
(or --output
) flag to a supported kubectl
command.
Output format | Description |
---|---|
-o=custom-columns=<spec> |
Print a table using a comma separated list of custom columns |
-o=custom-columns-file=<filename> |
Print a table using the custom columns template in the <filename> file |
-o=json |
Output a JSON formatted API object |
-o=jsonpath=<template> |
Print the fields defined in a jsonpath expression |
-o=jsonpath-file=<filename> |
Print the fields defined by the jsonpath expression in the <filename> file |
-o=name |
Print only the resource name and nothing else |
-o=wide |
Output in the plain-text format with any additional information, and for pods, the node name is included |
-o=yaml |
Output a YAML formatted API object |
Examples using -o=custom-columns
:
# All images running in a cluster
kubectl get pods -A -o=custom-columns='DATA:spec.containers[*].image'
# All images running in namespace: default, grouped by Pod
kubectl get pods --namespace default --output=custom-columns="NAME:.metadata.name,IMAGE:.spec.containers[*].image"
# All images excluding "k8s.gcr.io/coredns:1.6.2"
kubectl get pods -A -o=custom-columns='DATA:spec.containers[?(@.image!="k8s.gcr.io/coredns:1.6.2")].image'
# All fields under metadata regardless of name
kubectl get pods -A -o=custom-columns='DATA:metadata.*'
More examples in the kubectl reference documentation.
Kubectl output verbosity and debugging
Kubectl verbosity is controlled with the -v
or --v
flags followed by an integer representing the log level. General Kubernetes logging conventions and the associated log levels are described here.
Verbosity | Description |
---|---|
--v=0 |
Generally useful for this to always be visible to a cluster operator. |
--v=1 |
A reasonable default log level if you don't want verbosity. |
--v=2 |
Useful steady state information about the service and important log messages that may correlate to significant changes in the system. This is the recommended default log level for most systems. |
--v=3 |
Extended information about changes. |
--v=4 |
Debug level verbosity. |
--v=5 |
Trace level verbosity. |
--v=6 |
Display requested resources. |
--v=7 |
Display HTTP request headers. |
--v=8 |
Display HTTP request contents. |
--v=9 |
Display HTTP request contents without truncation of contents. |
What's next
-
Read the kubectl overview and learn about JsonPath.
-
See kubectl options.
-
Also read kubectl Usage Conventions to understand how to use kubectl in reusable scripts.
-
See more community kubectl cheatsheets.
6.10.2 - kubectl Commands
6.10.3 - kubectl
Synopsis
kubectl controls the Kubernetes cluster manager.
Find more information at: https://kubernetes.io/docs/reference/kubectl/overview/
kubectl [flags]
Options
--add-dir-header | |
If true, adds the file directory to the header of the log messages | |
--alsologtostderr | |
log to standard error as well as files | |
--as string | |
Username to impersonate for the operation | |
--as-group stringArray | |
Group to impersonate for the operation, this flag can be repeated to specify multiple groups. | |
--azure-container-registry-config string | |
Path to the file containing Azure container registry configuration information. | |
--cache-dir string Default: "$HOME/.kube/cache" | |
Default cache directory | |
--certificate-authority string | |
Path to a cert file for the certificate authority | |
--client-certificate string | |
Path to a client certificate file for TLS | |
--client-key string | |
Path to a client key file for TLS | |
--cloud-provider-gce-l7lb-src-cidrs cidrs Default: 130.211.0.0/22,35.191.0.0/16 | |
CIDRs opened in GCE firewall for L7 LB traffic proxy & health checks | |
--cloud-provider-gce-lb-src-cidrs cidrs Default: 130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16 | |
CIDRs opened in GCE firewall for L4 LB traffic proxy & health checks | |
--cluster string | |
The name of the kubeconfig cluster to use | |
--context string | |
The name of the kubeconfig context to use | |
--default-not-ready-toleration-seconds int Default: 300 | |
Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a toleration. | |
--default-unreachable-toleration-seconds int Default: 300 | |
Indicates the tolerationSeconds of the toleration for unreachable:NoExecute that is added by default to every pod that does not already have such a toleration. | |
-h, --help | |
help for kubectl | |
--insecure-skip-tls-verify | |
If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure | |
--kubeconfig string | |
Path to the kubeconfig file to use for CLI requests. | |
--log-backtrace-at traceLocation Default: :0 | |
when logging hits line file:N, emit a stack trace | |
--log-dir string | |
If non-empty, write log files in this directory | |
--log-file string | |
If non-empty, use this log file | |
--log-file-max-size uint Default: 1800 | |
Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. | |
--log-flush-frequency duration Default: 5s | |
Maximum number of seconds between log flushes | |
--logtostderr Default: true | |
log to standard error instead of files | |
--match-server-version | |
Require server version to match client version | |
-n, --namespace string | |
If present, the namespace scope for this CLI request | |
--one-output | |
If true, only write logs to their native severity level (vs also writing to each lower severity level | |
--password string | |
Password for basic authentication to the API server | |
--profile string Default: "none" | |
Name of profile to capture. One of (none|cpu|heap|goroutine|threadcreate|block|mutex) | |
--profile-output string Default: "profile.pprof" | |
Name of the file to write the profile to | |
--request-timeout string Default: "0" | |
The length of time to wait before giving up on a single server request. Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests. | |
-s, --server string | |
The address and port of the Kubernetes API server | |
--skip-headers | |
If true, avoid header prefixes in the log messages | |
--skip-log-headers | |
If true, avoid headers when opening log files | |
--stderrthreshold severity Default: 2 | |
logs at or above this threshold go to stderr | |
--tls-server-name string | |
Server name to use for server certificate validation. If it is not provided, the hostname used to contact the server is used | |
--token string | |
Bearer token for authentication to the API server | |
--user string | |
The name of the kubeconfig user to use | |
--username string | |
Username for basic authentication to the API server | |
-v, --v Level | |
number for the log level verbosity | |
--version version[=true] | |
Print version information and quit | |
--vmodule moduleSpec | |
comma-separated list of pattern=N settings for file-filtered logging | |
--warnings-as-errors | |
Treat warnings received from the server as errors and exit with a non-zero exit code |
Environment variables
KUBECONFIG | |
Path to the kubectl configuration ("kubeconfig") file. Default: "$HOME/.kube/config" | |
KUBECTL_COMMAND_HEADERS | |
When set to false, turns off extra HTTP headers detailing invoked kubectl command (Kubernetes version v1.22 or later) |
See Also
- kubectl annotate - Update the annotations on a resource
- kubectl api-resources - Print the supported API resources on the server
- kubectl api-versions - Print the supported API versions on the server, in the form of "group/version"
- kubectl apply - Apply a configuration to a resource by filename or stdin
- kubectl attach - Attach to a running container
- kubectl auth - Inspect authorization
- kubectl autoscale - Auto-scale a Deployment, ReplicaSet, or ReplicationController
- kubectl certificate - Modify certificate resources.
- kubectl cluster-info - Display cluster info
- kubectl completion - Output shell completion code for the specified shell (bash or zsh)
- kubectl config - Modify kubeconfig files
- kubectl cordon - Mark node as unschedulable
- kubectl cp - Copy files and directories to and from containers.
- kubectl create - Create a resource from a file or from stdin.
- kubectl debug - Create debugging sessions for troubleshooting workloads and nodes
- kubectl delete - Delete resources by filenames, stdin, resources and names, or by resources and label selector
- kubectl describe - Show details of a specific resource or group of resources
- kubectl diff - Diff live version against would-be applied version
- kubectl drain - Drain node in preparation for maintenance
- kubectl edit - Edit a resource on the server
- kubectl exec - Execute a command in a container
- kubectl explain - Documentation of resources
- kubectl expose - Take a replication controller, service, deployment or pod and expose it as a new Kubernetes Service
- kubectl get - Display one or many resources
- kubectl kustomize - Build a kustomization target from a directory or a remote url.
- kubectl label - Update the labels on a resource
- kubectl logs - Print the logs for a container in a pod
- kubectl options - Print the list of flags inherited by all commands
- kubectl patch - Update field(s) of a resource
- kubectl plugin - Provides utilities for interacting with plugins.
- kubectl port-forward - Forward one or more local ports to a pod
- kubectl proxy - Run a proxy to the Kubernetes API server
- kubectl replace - Replace a resource by filename or stdin
- kubectl rollout - Manage the rollout of a resource
- kubectl run - Run a particular image on the cluster
- kubectl scale - Set a new size for a Deployment, ReplicaSet or Replication Controller
- kubectl set - Set specific features on objects
- kubectl taint - Update the taints on one or more nodes
- kubectl top - Display Resource (CPU/Memory/Storage) usage.
- kubectl uncordon - Mark node as schedulable
- kubectl version - Print the client and server version information
- kubectl wait - Experimental: Wait for a specific condition on one or many resources.
6.10.4 - JSONPath Support
Kubectl supports JSONPath template.
JSONPath template is composed of JSONPath expressions enclosed by curly braces {}. Kubectl uses JSONPath expressions to filter on specific fields in the JSON object and format the output. In addition to the original JSONPath template syntax, the following functions and syntax are valid:
- Use double quotes to quote text inside JSONPath expressions.
- Use the
range
,end
operators to iterate lists. - Use negative slice indices to step backwards through a list. Negative indices do not "wrap around" a list and are valid as long as
-index + listLength >= 0
.
-
The
$
operator is optional since the expression always starts from the root object by default. -
The result object is printed as its String() function.
Given the JSON input:
{
"kind": "List",
"items":[
{
"kind":"None",
"metadata":{"name":"127.0.0.1"},
"status":{
"capacity":{"cpu":"4"},
"addresses":[{"type": "LegacyHostIP", "address":"127.0.0.1"}]
}
},
{
"kind":"None",
"metadata":{"name":"127.0.0.2"},
"status":{
"capacity":{"cpu":"8"},
"addresses":[
{"type": "LegacyHostIP", "address":"127.0.0.2"},
{"type": "another", "address":"127.0.0.3"}
]
}
}
],
"users":[
{
"name": "myself",
"user": {}
},
{
"name": "e2e",
"user": {"username": "admin", "password": "secret"}
}
]
}
Function | Description | Example | Result |
---|---|---|---|
text |
the plain text | kind is {.kind} |
kind is List |
@ |
the current object | {@} |
the same as input |
. or [] |
child operator | {.kind} , {['kind']} or {['name\.type']} |
List |
.. |
recursive descent | {..name} |
127.0.0.1 127.0.0.2 myself e2e |
* |
wildcard. Get all objects | {.items[*].metadata.name} |
[127.0.0.1 127.0.0.2] |
[start:end:step] |
subscript operator | {.users[0].name} |
myself |
[,] |
union operator | {.items[*]['metadata.name', 'status.capacity']} |
127.0.0.1 127.0.0.2 map[cpu:4] map[cpu:8] |
?() |
filter | {.users[?(@.name=="e2e")].user.password} |
secret |
range , end |
iterate list | {range .items[*]}[{.metadata.name}, {.status.capacity}] {end} |
[127.0.0.1, map[cpu:4]] [127.0.0.2, map[cpu:8]] |
'' |
quote interpreted string | {range .items[*]}{.metadata.name}{'\t'}{end} |
127.0.0.1 127.0.0.2 |
Examples using kubectl
and JSONPath expressions:
kubectl get pods -o json
kubectl get pods -o=jsonpath='{@}'
kubectl get pods -o=jsonpath='{.items[0]}'
kubectl get pods -o=jsonpath='{.items[0].metadata.name}'
kubectl get pods -o=jsonpath="{.items[*]['metadata.name', 'status.capacity']}"
kubectl get pods -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.startTime}{"\n"}{end}'
On Windows, you must double quote any JSONPath template that contains spaces (not single quote as shown above for bash). This in turn means that you must use a single quote or escaped double quote around any literals in the template. For example:
kubectl get pods -o=jsonpath="{range .items[*]}{.metadata.name}{'\t'}{.status.startTime}{'\n'}{end}"
kubectl get pods -o=jsonpath="{range .items[*]}{.metadata.name}{\"\t\"}{.status.startTime}{\"\n\"}{end}"
JSONPath regular expressions are not supported. If you want to match using regular expressions, you can use a tool such as jq
.
# kubectl does not support regular expressions for JSONpath output
# The following command does not work
kubectl get pods -o jsonpath='{.items[?(@.metadata.name=~/^test$/)].metadata.name}'
# The following command achieves the desired result
kubectl get pods -o json | jq -r '.items[] | select(.metadata.name | test("test-")).spec.containers[].image'
6.10.5 - kubectl for Docker Users
You can use the Kubernetes command line tool kubectl
to interact with the API Server. Using kubectl is straightforward if you are familiar with the Docker command line tool. However, there are a few differences between the Docker commands and the kubectl commands. The following sections show a Docker sub-command and describe the equivalent kubectl
command.
docker run
To run an nginx Deployment and expose the Deployment, see kubectl create deployment. docker:
docker run -d --restart=always -e DOMAIN=cluster --name nginx-app -p 80:80 nginx
55c103fa129692154a7652490236fee9be47d70a8dd562281ae7d2f9a339a6db
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
55c103fa1296 nginx "nginx -g 'daemon of…" 9 seconds ago Up 9 seconds 0.0.0.0:80->80/tcp nginx-app
kubectl:
# start the pod running nginx
kubectl create deployment --image=nginx nginx-app
deployment.apps/nginx-app created
# add env to nginx-app
kubectl set env deployment/nginx-app DOMAIN=cluster
deployment.apps/nginx-app env updated
kubectl
commands print the type and name of the resource created or mutated, which can then be used in subsequent commands. You can expose a new Service after a Deployment is created.
# expose a port through with a service
kubectl expose deployment nginx-app --port=80 --name=nginx-http
service "nginx-http" exposed
By using kubectl, you can create a Deployment to ensure that N pods are running nginx, where N is the number of replicas stated in the spec and defaults to 1. You can also create a service with a selector that matches the pod labels. For more information, see Use a Service to Access an Application in a Cluster.
By default images run in the background, similar to docker run -d ...
. To run things in the foreground, use kubectl run
to create pod:
kubectl run [-i] [--tty] --attach <name> --image=<image>
Unlike docker run ...
, if you specify --attach
, then you attach stdin
, stdout
and stderr
. You cannot control which streams are attached (docker -a ...
).
To detach from the container, you can type the escape sequence Ctrl+P followed by Ctrl+Q.
docker ps
To list what is currently running, see kubectl get.
docker:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
14636241935f ubuntu:16.04 "echo test" 5 seconds ago Exited (0) 5 seconds ago cocky_fermi
55c103fa1296 nginx "nginx -g 'daemon of…" About a minute ago Up About a minute 0.0.0.0:80->80/tcp nginx-app
kubectl:
kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-app-8df569cb7-4gd89 1/1 Running 0 3m
ubuntu 0/1 Completed 0 20s
docker attach
To attach a process that is already running in a container, see kubectl attach.
docker:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
55c103fa1296 nginx "nginx -g 'daemon of…" 5 minutes ago Up 5 minutes 0.0.0.0:80->80/tcp nginx-app
docker attach 55c103fa1296
...
kubectl:
kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-app-5jyvm 1/1 Running 0 10m
kubectl attach -it nginx-app-5jyvm
...
To detach from the container, you can type the escape sequence Ctrl+P followed by Ctrl+Q.
docker exec
To execute a command in a container, see kubectl exec.
docker:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
55c103fa1296 nginx "nginx -g 'daemon of…" 6 minutes ago Up 6 minutes 0.0.0.0:80->80/tcp nginx-app
docker exec 55c103fa1296 cat /etc/hostname
55c103fa1296
kubectl:
kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-app-5jyvm 1/1 Running 0 10m
kubectl exec nginx-app-5jyvm -- cat /etc/hostname
nginx-app-5jyvm
To use interactive commands.
docker:
docker exec -ti 55c103fa1296 /bin/sh
# exit
kubectl:
kubectl exec -ti nginx-app-5jyvm -- /bin/sh
# exit
For more information, see Get a Shell to a Running Container.
docker logs
To follow stdout/stderr of a process that is running, see kubectl logs.
docker:
docker logs -f a9e
192.168.9.1 - - [14/Jul/2015:01:04:02 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
192.168.9.1 - - [14/Jul/2015:01:04:03 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.35.0" "-"
kubectl:
kubectl logs -f nginx-app-zibvs
10.240.63.110 - - [14/Jul/2015:01:09:01 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.26.0" "-"
10.240.63.110 - - [14/Jul/2015:01:09:02 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.26.0" "-"
There is a slight difference between pods and containers; by default pods do not terminate if their processes exit. Instead the pods restart the process. This is similar to the docker run option --restart=always
with one major difference. In docker, the output for each invocation of the process is concatenated, but for Kubernetes, each invocation is separate. To see the output from a previous run in Kubernetes, do this:
kubectl logs --previous nginx-app-zibvs
10.240.63.110 - - [14/Jul/2015:01:09:01 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.26.0" "-"
10.240.63.110 - - [14/Jul/2015:01:09:02 +0000] "GET / HTTP/1.1" 200 612 "-" "curl/7.26.0" "-"
For more information, see Logging Architecture.
docker stop and docker rm
To stop and delete a running process, see kubectl delete.
docker:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a9ec34d98787 nginx "nginx -g 'daemon of" 22 hours ago Up 22 hours 0.0.0.0:80->80/tcp, 443/tcp nginx-app
docker stop a9ec34d98787
a9ec34d98787
docker rm a9ec34d98787
a9ec34d98787
kubectl:
kubectl get deployment nginx-app
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-app 1/1 1 1 2m
kubectl get po -l app=nginx-app
NAME READY STATUS RESTARTS AGE
nginx-app-2883164633-aklf7 1/1 Running 0 2m
kubectl delete deployment nginx-app
deployment "nginx-app" deleted
kubectl get po -l app=nginx-app
# Return nothing
docker login
There is no direct analog of docker login
in kubectl. If you are interested in using Kubernetes with a private registry, see Using a Private Registry.
docker version
To get the version of client and server, see kubectl version.
docker:
docker version
Client version: 1.7.0
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 0baf609
OS/Arch (client): linux/amd64
Server version: 1.7.0
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 0baf609
OS/Arch (server): linux/amd64
kubectl:
kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.9+a3d1dfa6f4335", GitCommit:"9b77fed11a9843ce3780f70dd251e92901c43072", GitTreeState:"dirty", BuildDate:"2017-08-29T20:32:58Z", OpenPaasKubernetesVersion:"v1.03.02", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.9+a3d1dfa6f4335", GitCommit:"9b77fed11a9843ce3780f70dd251e92901c43072", GitTreeState:"dirty", BuildDate:"2017-08-29T20:32:58Z", OpenPaasKubernetesVersion:"v1.03.02", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
docker info
To get miscellaneous information about the environment and configuration, see kubectl cluster-info.
docker:
docker info
Containers: 40
Images: 168
Storage Driver: aufs
Root Dir: /usr/local/google/docker/aufs
Backing Filesystem: extfs
Dirs: 248
Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-53-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 12
Total Memory: 31.32 GiB
Name: k8s-is-fun.mtv.corp.google.com
ID: ADUV:GCYR:B3VJ:HMPO:LNPQ:KD5S:YKFQ:76VN:IANZ:7TFV:ZBF4:BYJO
WARNING: No swap limit support
kubectl:
kubectl cluster-info
Kubernetes master is running at https://203.0.113.141
KubeDNS is running at https://203.0.113.141/api/v1/namespaces/kube-system/services/kube-dns/proxy
kubernetes-dashboard is running at https://203.0.113.141/api/v1/namespaces/kube-system/services/kubernetes-dashboard/proxy
Grafana is running at https://203.0.113.141/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
Heapster is running at https://203.0.113.141/api/v1/namespaces/kube-system/services/monitoring-heapster/proxy
InfluxDB is running at https://203.0.113.141/api/v1/namespaces/kube-system/services/monitoring-influxdb/proxy
6.10.6 - kubectl Usage Conventions
Recommended usage conventions for kubectl
.
Using kubectl
in Reusable Scripts
For a stable output in a script:
- Request one of the machine-oriented output forms, such as
-o name
,-o json
,-o yaml
,-o go-template
, or-o jsonpath
. - Fully-qualify the version. For example,
jobs.v1.batch/myjob
. This will ensure that kubectl does not use its default version that can change over time. - Don't rely on context, preferences, or other implicit states.
Best Practices
kubectl run
For kubectl run
to satisfy infrastructure as code:
- Tag the image with a version-specific tag and don't move that tag to a new version. For example, use
:v1234
,v1.2.3
,r03062016-1-4
, rather than:latest
(For more information, see Best Practices for Configuration). - Check in the script for an image that is heavily parameterized.
- Switch to configuration files checked into source control for features that are needed, but not expressible via
kubectl run
flags.
You can use the --dry-run=client
flag to preview the object that would be sent to your cluster, without really submitting it.
kubectl apply
- You can use
kubectl apply
to create or update resources. For more information about using kubectl apply to update resources, see Kubectl Book.
6.11 - Component tools
6.11.1 - Feature Gates
This page contains an overview of the various feature gates an administrator can specify on different Kubernetes components.
See feature stages for an explanation of the stages for a feature.
Overview
Feature gates are a set of key=value pairs that describe Kubernetes features.
You can turn these features on or off using the --feature-gates
command line flag
on each Kubernetes component.
Each Kubernetes component lets you enable or disable a set of feature gates that
are relevant to that component.
Use -h
flag to see a full set of feature gates for all components.
To set feature gates for a component, such as kubelet, use the --feature-gates
flag assigned to a list of feature pairs:
--feature-gates=...,GracefulNodeShutdown=true
The following tables are a summary of the feature gates that you can set on different Kubernetes components.
- The "Since" column contains the Kubernetes release when a feature is introduced or its release stage is changed.
- The "Until" column, if not empty, contains the last Kubernetes release in which you can still use a feature gate.
- If a feature is in the Alpha or Beta state, you can find the feature listed in the Alpha/Beta feature gate table.
- If a feature is stable you can find all stages for that feature listed in the Graduated/Deprecated feature gate table.
- The Graduated/Deprecated feature gate table also lists deprecated and withdrawn features.
Feature gates for Alpha or Beta features
Feature | Default | Stage | Since | Until |
---|---|---|---|---|
APIListChunking |
false |
Alpha | 1.8 | 1.8 |
APIListChunking |
true |
Beta | 1.9 | |
APIPriorityAndFairness |
false |
Alpha | 1.18 | 1.19 |
APIPriorityAndFairness |
true |
Beta | 1.20 | |
APIResponseCompression |
false |
Alpha | 1.7 | 1.15 |
APIResponseCompression |
true |
Beta | 1.16 | |
APIServerIdentity |
false |
Alpha | 1.20 | |
APIServerTracing |
false |
Alpha | 1.22 | |
AllowInsecureBackendProxy |
true |
Beta | 1.17 | |
AnyVolumeDataSource |
false |
Alpha | 1.18 | |
AppArmor |
true |
Beta | 1.4 | |
ControllerManagerLeaderMigration |
false |
Alpha | 1.21 | 1.21 |
ControllerManagerLeaderMigration |
true |
Beta | 1.22 | |
CPUManager |
false |
Alpha | 1.8 | 1.9 |
CPUManager |
true |
Beta | 1.10 | |
CPUManagerPolicyAlphaOptions |
false |
Alpha | 1.23 | |
CPUManagerPolicyBetaOptions |
true |
Beta | 1.23 | |
CPUManagerPolicyOptions |
false |
Alpha | 1.22 | 1.22 |
CPUManagerPolicyOptions |
true |
Beta | 1.23 | |
CSIInlineVolume |
false |
Alpha | 1.15 | 1.15 |
CSIInlineVolume |
true |
Beta | 1.16 | - |
CSIMigration |
false |
Alpha | 1.14 | 1.16 |
CSIMigration |
true |
Beta | 1.17 | |
CSIMigrationAWS |
false |
Alpha | 1.14 | 1.16 |
CSIMigrationAWS |
false |
Beta | 1.17 | 1.22 |
CSIMigrationAWS |
true |
Beta | 1.23 | |
CSIMigrationAzureDisk |
false |
Alpha | 1.15 | 1.18 |
CSIMigrationAzureDisk |
false |
Beta | 1.19 | 1.22 |
CSIMigrationAzureDisk |
true |
Beta | 1.23 | |
CSIMigrationAzureFile |
false |
Alpha | 1.15 | 1.19 |
CSIMigrationAzureFile |
false |
Beta | 1.21 | |
CSIMigrationGCE |
false |
Alpha | 1.14 | 1.16 |
CSIMigrationGCE |
false |
Beta | 1.17 | 1.22 |
CSIMigrationGCE |
true |
Beta | 1.23 | |
CSIMigrationOpenStack |
false |
Alpha | 1.14 | 1.17 |
CSIMigrationOpenStack |
true |
Beta | 1.18 | |
CSIMigrationvSphere |
false |
Beta | 1.19 | |
CSIMigrationPortworx |
false |
Alpha | 1.23 | |
csiMigrationRBD |
false |
Alpha | 1.23 | |
CSIStorageCapacity |
false |
Alpha | 1.19 | 1.20 |
CSIStorageCapacity |
true |
Beta | 1.21 | |
CSIVolumeHealth |
false |
Alpha | 1.21 | |
CSRDuration |
true |
Beta | 1.22 | |
ControllerManagerLeaderMigration |
false |
Alpha | 1.21 | 1.21 |
ControllerManagerLeaderMigration |
true |
Beta | 1.22 | |
CustomCPUCFSQuotaPeriod |
false |
Alpha | 1.12 | |
CustomResourceValidationExpressions |
false |
Alpha | 1.23 | |
DaemonSetUpdateSurge |
false |
Alpha | 1.21 | 1.21 |
DaemonSetUpdateSurge |
true |
Beta | 1.22 | |
DefaultPodTopologySpread |
false |
Alpha | 1.19 | 1.19 |
DefaultPodTopologySpread |
true |
Beta | 1.20 | |
DelegateFSGroupToCSIDriver |
false |
Alpha | 1.22 | 1.22 |
DelegateFSGroupToCSIDriver |
true |
Beta | 1.23 | |
DevicePlugins |
false |
Alpha | 1.8 | 1.9 |
DevicePlugins |
true |
Beta | 1.10 | |
DisableAcceleratorUsageMetrics |
false |
Alpha | 1.19 | 1.19 |
DisableAcceleratorUsageMetrics |
true |
Beta | 1.20 | |
DisableCloudProviders |
false |
Alpha | 1.22 | |
DisableKubeletCloudCredentialProviders |
false |
Alpha | 1.23 | |
DownwardAPIHugePages |
false |
Alpha | 1.20 | 1.20 |
DownwardAPIHugePages |
false |
Beta | 1.21 | |
EfficientWatchResumption |
false |
Alpha | 1.20 | 1.20 |
EfficientWatchResumption |
true |
Beta | 1.21 | |
EndpointSliceTerminatingCondition |
false |
Alpha | 1.20 | 1.21 |
EndpointSliceTerminatingCondition |
true |
Beta | 1.22 | |
EphemeralContainers |
false |
Alpha | 1.16 | 1.22 |
EphemeralContainers |
true |
Beta | 1.23 | |
ExpandCSIVolumes |
false |
Alpha | 1.14 | 1.15 |
ExpandCSIVolumes |
true |
Beta | 1.16 | |
ExpandedDNSConfig |
false |
Alpha | 1.22 | |
ExpandInUsePersistentVolumes |
false |
Alpha | 1.11 | 1.14 |
ExpandInUsePersistentVolumes |
true |
Beta | 1.15 | |
ExpandPersistentVolumes |
false |
Alpha | 1.8 | 1.10 |
ExpandPersistentVolumes |
true |
Beta | 1.11 | |
ExperimentalHostUserNamespaceDefaulting |
false |
Beta | 1.5 | |
GracefulNodeShutdown |
false |
Alpha | 1.20 | 1.20 |
GracefulNodeShutdown |
true |
Beta | 1.21 | |
GracefulNodeShutdownBasedOnPodPriority |
false |
Alpha | 1.23 | |
GRPCContainerProbe |
false |
Alpha | 1.23 | |
HonorPVReclaimPolicy |
false |
Alpha | 1.23 | |
HPAContainerMetrics |
false |
Alpha | 1.20 | |
HPAScaleToZero |
false |
Alpha | 1.16 | |
IdentifyPodOS |
false |
Alpha | 1.23 | |
IndexedJob |
false |
Alpha | 1.21 | 1.21 |
IndexedJob |
true |
Beta | 1.22 | |
InTreePluginAWSUnregister |
false |
Alpha | 1.21 | |
InTreePluginAzureDiskUnregister |
false |
Alpha | 1.21 | |
InTreePluginAzureFileUnregister |
false |
Alpha | 1.21 | |
InTreePluginGCEUnregister |
false |
Alpha | 1.21 | |
InTreePluginOpenStackUnregister |
false |
Alpha | 1.21 | |
InTreePluginPortworxUnregister |
false |
Alpha | 1.23 | |
InTreePluginRBDUnregister |
false |
Alpha | 1.23 | |
InTreePluginvSphereUnregister |
false |
Alpha | 1.21 | |
JobMutableNodeSchedulingDirectives |
true |
Beta | 1.23 | |
JobReadyPods |
false |
Alpha | 1.23 | |
JobTrackingWithFinalizers |
false |
Alpha | 1.22 | 1.22 |
JobTrackingWithFinalizers |
true |
Beta | 1.23 | |
KubeletCredentialProviders |
false |
Alpha | 1.20 | |
KubeletInUserNamespace |
false |
Alpha | 1.22 | |
KubeletPodResources |
false |
Alpha | 1.13 | 1.14 |
KubeletPodResources |
true |
Beta | 1.15 | |
KubeletPodResourcesGetAllocatable |
false |
Alpha | 1.21 | 1.22 |
KubeletPodResourcesGetAllocatable |
false |
Beta | 1.23 | |
LocalStorageCapacityIsolation |
false |
Alpha | 1.7 | 1.9 |
LocalStorageCapacityIsolation |
true |
Beta | 1.10 | |
LocalStorageCapacityIsolationFSQuotaMonitoring |
false |
Alpha | 1.15 | |
LogarithmicScaleDown |
false |
Alpha | 1.21 | 1.21 |
LogarithmicScaleDown |
true |
Beta | 1.22 | |
MemoryManager |
false |
Alpha | 1.21 | 1.21 |
MemoryManager |
true |
Beta | 1.22 | |
MemoryQoS |
false |
Alpha | 1.22 | |
MixedProtocolLBService |
false |
Alpha | 1.20 | |
NetworkPolicyEndPort |
false |
Alpha | 1.21 | 1.21 |
NetworkPolicyEndPort |
true |
Beta | 1.22 | |
NodeSwap |
false |
Alpha | 1.22 | |
NonPreemptingPriority |
false |
Alpha | 1.15 | 1.18 |
NonPreemptingPriority |
true |
Beta | 1.19 | |
OpenAPIEnums |
false |
Alpha | 1.23 | |
OpenAPIV3 |
false |
Alpha | 1.23 | |
PodAndContainerStatsFromCRI |
false |
Alpha | 1.23 | |
PodAffinityNamespaceSelector |
false |
Alpha | 1.21 | 1.21 |
PodAffinityNamespaceSelector |
true |
Beta | 1.22 | |
PodDeletionCost |
false |
Alpha | 1.21 | 1.21 |
PodDeletionCost |
true |
Beta | 1.22 | |
PodOverhead |
false |
Alpha | 1.16 | 1.17 |
PodOverhead |
true |
Beta | 1.18 | |
PodSecurity |
false |
Alpha | 1.22 | 1.22 |
PodSecurity |
true |
Beta | 1.23 | |
PreferNominatedNode |
false |
Alpha | 1.21 | 1.21 |
PreferNominatedNode |
true |
Beta | 1.22 | |
ProbeTerminationGracePeriod |
false |
Alpha | 1.21 | 1.21 |
ProbeTerminationGracePeriod |
false |
Beta | 1.22 | |
ProcMountType |
false |
Alpha | 1.12 | |
ProxyTerminatingEndpoints |
false |
Alpha | 1.22 | |
QOSReserved |
false |
Alpha | 1.11 | |
ReadWriteOncePod |
false |
Alpha | 1.22 | |
RecoverVolumeExpansionFailure |
false |
Alpha | 1.23 | |
RemainingItemCount |
false |
Alpha | 1.15 | 1.15 |
RemainingItemCount |
true |
Beta | 1.16 | |
RemoveSelfLink |
false |
Alpha | 1.16 | 1.19 |
RemoveSelfLink |
true |
Beta | 1.20 | |
RotateKubeletServerCertificate |
false |
Alpha | 1.7 | 1.11 |
RotateKubeletServerCertificate |
true |
Beta | 1.12 | |
SeccompDefault |
false |
Alpha | 1.22 | |
ServiceInternalTrafficPolicy |
false |
Alpha | 1.21 | 1.21 |
ServiceInternalTrafficPolicy |
true |
Beta | 1.22 | |
ServiceLBNodePortControl |
false |
Alpha | 1.20 | 1.21 |
ServiceLBNodePortControl |
true |
Beta | 1.22 | |
ServiceLoadBalancerClass |
false |
Alpha | 1.21 | 1.21 |
ServiceLoadBalancerClass |
true |
Beta | 1.22 | |
SizeMemoryBackedVolumes |
false |
Alpha | 1.20 | 1.21 |
SizeMemoryBackedVolumes |
true |
Beta | 1.22 | |
StatefulSetAutoDeletePVC |
false |
Alpha | 1.22 | |
StatefulSetMinReadySeconds |
false |
Alpha | 1.22 | 1.22 |
StatefulSetMinReadySeconds |
true |
Beta | 1.23 | |
StorageVersionAPI |
false |
Alpha | 1.20 | |
StorageVersionHash |
false |
Alpha | 1.14 | 1.14 |
StorageVersionHash |
true |
Beta | 1.15 | |
SuspendJob |
false |
Alpha | 1.21 | 1.21 |
SuspendJob |
true |
Beta | 1.22 | |
TopologyAwareHints |
false |
Alpha | 1.21 | 1.22 |
TopologyAwareHints |
false |
Beta | 1.23 | |
TopologyManager |
false |
Alpha | 1.16 | 1.17 |
TopologyManager |
true |
Beta | 1.18 | |
VolumeCapacityPriority |
false |
Alpha | 1.21 | - |
WinDSR |
false |
Alpha | 1.14 | |
WinOverlay |
false |
Alpha | 1.14 | 1.19 |
WinOverlay |
true |
Beta | 1.20 | |
WindowsHostProcessContainers |
false |
Alpha | 1.22 | 1.22 |
WindowsHostProcessContainers |
false |
Beta | 1.23 |
Feature gates for graduated or deprecated features
Feature | Default | Stage | Since | Until |
---|---|---|---|---|
Accelerators |
false |
Alpha | 1.6 | 1.10 |
Accelerators |
- | Deprecated | 1.11 | - |
AdvancedAuditing |
false |
Alpha | 1.7 | 1.7 |
AdvancedAuditing |
true |
Beta | 1.8 | 1.11 |
AdvancedAuditing |
true |
GA | 1.12 | - |
AffinityInAnnotations |
false |
Alpha | 1.6 | 1.7 |
AffinityInAnnotations |
- | Deprecated | 1.8 | - |
AllowExtTrafficLocalEndpoints |
false |
Beta | 1.4 | 1.6 |
AllowExtTrafficLocalEndpoints |
true |
GA | 1.7 | - |
AttachVolumeLimit |
false |
Alpha | 1.11 | 1.11 |
AttachVolumeLimit |
true |
Beta | 1.12 | 1.16 |
AttachVolumeLimit |
true |
GA | 1.17 | - |
BalanceAttachedNodeVolumes |
false |
Alpha | 1.11 | 1.21 |
BalanceAttachedNodeVolumes |
false |
Deprecated | 1.22 | |
BlockVolume |
false |
Alpha | 1.9 | 1.12 |
BlockVolume |
true |
Beta | 1.13 | 1.17 |
BlockVolume |
true |
GA | 1.18 | - |
BoundServiceAccountTokenVolume |
false |
Alpha | 1.13 | 1.20 |
BoundServiceAccountTokenVolume |
true |
Beta | 1.21 | 1.21 |
BoundServiceAccountTokenVolume |
true |
GA | 1.22 | - |
ConfigurableFSGroupPolicy |
false |
Alpha | 1.18 | 1.19 |
ConfigurableFSGroupPolicy |
true |
Beta | 1.20 | 1.22 |
ConfigurableFSGroupPolicy |
true |
GA | 1.23 | |
CRIContainerLogRotation |
false |
Alpha | 1.10 | 1.10 |
CRIContainerLogRotation |
true |
Beta | 1.11 | 1.20 |
CRIContainerLogRotation |
true |
GA | 1.21 | - |
CSIBlockVolume |
false |
Alpha | 1.11 | 1.13 |
CSIBlockVolume |
true |
Beta | 1.14 | 1.17 |
CSIBlockVolume |
true |
GA | 1.18 | - |
CSIDriverRegistry |
false |
Alpha | 1.12 | 1.13 |
CSIDriverRegistry |
true |
Beta | 1.14 | 1.17 |
CSIDriverRegistry |
true |
GA | 1.18 | |
CSIMigrationAWSComplete |
false |
Alpha | 1.17 | 1.20 |
CSIMigrationAWSComplete |
- | Deprecated | 1.21 | - |
CSIMigrationAzureDiskComplete |
false |
Alpha | 1.17 | 1.20 |
CSIMigrationAzureDiskComplete |
- | Deprecated | 1.21 | - |
CSIMigrationAzureFileComplete |
false |
Alpha | 1.17 | 1.20 |
CSIMigrationAzureFileComplete |
- | Deprecated | 1.21 | - |
CSIMigrationGCEComplete |
false |
Alpha | 1.17 | 1.20 |
CSIMigrationGCEComplete |
- | Deprecated | 1.21 | - |
CSIMigrationOpenStackComplete |
false |
Alpha | 1.17 | 1.20 |
CSIMigrationOpenStackComplete |
- | Deprecated | 1.21 | - |
CSIMigrationvSphereComplete |
false |
Beta | 1.19 | 1.21 |
CSIMigrationvSphereComplete |
- | Deprecated | 1.22 | - |
CSINodeInfo |
false |
Alpha | 1.12 | 1.13 |
CSINodeInfo |
true |
Beta | 1.14 | 1.16 |
CSINodeInfo |
true |
GA | 1.17 | |
CSIPersistentVolume |
false |
Alpha | 1.9 | 1.9 |
CSIPersistentVolume |
true |
Beta | 1.10 | 1.12 |
CSIPersistentVolume |
true |
GA | 1.13 | - |
CSIServiceAccountToken |
false |
Alpha | 1.20 | 1.20 |
CSIServiceAccountToken |
true |
Beta | 1.21 | 1.21 |
CSIServiceAccountToken |
true |
GA | 1.22 | |
CSIVolumeFSGroupPolicy |
false |
Alpha | 1.19 | 1.19 |
CSIVolumeFSGroupPolicy |
true |
Beta | 1.20 | 1.22 |
CSIVolumeFSGroupPolicy |
true |
GA | 1.23 | |
CronJobControllerV2 |
false |
Alpha | 1.20 | 1.20 |
CronJobControllerV2 |
true |
Beta | 1.21 | 1.21 |
CronJobControllerV2 |
true |
GA | 1.22 | - |
CustomPodDNS |
false |
Alpha | 1.9 | 1.9 |
CustomPodDNS |
true |
Beta | 1.10 | 1.13 |
CustomPodDNS |
true |
GA | 1.14 | - |
CustomResourceDefaulting |
false |
Alpha | 1.15 | 1.15 |
CustomResourceDefaulting |
true |
Beta | 1.16 | 1.16 |
CustomResourceDefaulting |
true |
GA | 1.17 | - |
CustomResourcePublishOpenAPI |
false |
Alpha | 1.14 | 1.14 |
CustomResourcePublishOpenAPI |
true |
Beta | 1.15 | 1.15 |
CustomResourcePublishOpenAPI |
true |
GA | 1.16 | - |
CustomResourceSubresources |
false |
Alpha | 1.10 | 1.10 |
CustomResourceSubresources |
true |
Beta | 1.11 | 1.15 |
CustomResourceSubresources |
true |
GA | 1.16 | - |
CustomResourceValidation |
false |
Alpha | 1.8 | 1.8 |
CustomResourceValidation |
true |
Beta | 1.9 | 1.15 |
CustomResourceValidation |
true |
GA | 1.16 | - |
CustomResourceWebhookConversion |
false |
Alpha | 1.13 | 1.14 |
CustomResourceWebhookConversion |
true |
Beta | 1.15 | 1.15 |
CustomResourceWebhookConversion |
true |
GA | 1.16 | - |
DryRun |
false |
Alpha | 1.12 | 1.12 |
DryRun |
true |
Beta | 1.13 | 1.18 |
DryRun |
true |
GA | 1.19 | - |
DynamicAuditing |
false |
Alpha | 1.13 | 1.18 |
DynamicAuditing |
- | Deprecated | 1.19 | - |
DynamicKubeletConfig |
false |
Alpha | 1.4 | 1.10 |
DynamicKubeletConfig |
true |
Beta | 1.11 | 1.21 |
DynamicKubeletConfig |
false |
Deprecated | 1.22 | - |
DynamicProvisioningScheduling |
false |
Alpha | 1.11 | 1.11 |
DynamicProvisioningScheduling |
- | Deprecated | 1.12 | - |
DynamicVolumeProvisioning |
true |
Alpha | 1.3 | 1.7 |
DynamicVolumeProvisioning |
true |
GA | 1.8 | - |
EnableAggregatedDiscoveryTimeout |
true |
Deprecated | 1.16 | - |
EnableEquivalenceClassCache |
false |
Alpha | 1.8 | 1.14 |
EnableEquivalenceClassCache |
- | Deprecated | 1.15 | - |
EndpointSlice |
false |
Alpha | 1.16 | 1.16 |
EndpointSlice |
false |
Beta | 1.17 | 1.17 |
EndpointSlice |
true |
Beta | 1.18 | 1.20 |
EndpointSlice |
true |
GA | 1.21 | - |
EndpointSliceNodeName |
false |
Alpha | 1.20 | 1.20 |
EndpointSliceNodeName |
true |
GA | 1.21 | - |
EndpointSliceProxying |
false |
Alpha | 1.18 | 1.18 |
EndpointSliceProxying |
true |
Beta | 1.19 | 1.21 |
EndpointSliceProxying |
true |
GA | 1.22 | - |
EvenPodsSpread |
false |
Alpha | 1.16 | 1.17 |
EvenPodsSpread |
true |
Beta | 1.18 | 1.18 |
EvenPodsSpread |
true |
GA | 1.19 | - |
ExecProbeTimeout |
true |
GA | 1.20 | - |
ExperimentalCriticalPodAnnotation |
false |
Alpha | 1.5 | 1.12 |
ExperimentalCriticalPodAnnotation |
false |
Deprecated | 1.13 | - |
ExternalPolicyForExternalIP |
true |
GA | 1.18 | - |
GCERegionalPersistentDisk |
true |
Beta | 1.10 | 1.12 |
GCERegionalPersistentDisk |
true |
GA | 1.13 | - |
GenericEphemeralVolume |
false |
Alpha | 1.19 | 1.20 |
GenericEphemeralVolume |
true |
Beta | 1.21 | 1.22 |
GenericEphemeralVolume |
true |
GA | 1.23 | - |
HugePageStorageMediumSize |
false |
Alpha | 1.18 | 1.18 |
HugePageStorageMediumSize |
true |
Beta | 1.19 | 1.21 |
HugePageStorageMediumSize |
true |
GA | 1.22 | - |
HugePages |
false |
Alpha | 1.8 | 1.9 |
HugePages |
true |
Beta | 1.10 | 1.13 |
HugePages |
true |
GA | 1.14 | - |
HyperVContainer |
false |
Alpha | 1.10 | 1.19 |
HyperVContainer |
false |
Deprecated | 1.20 | - |
ImmutableEphemeralVolumes |
false |
Alpha | 1.18 | 1.18 |
ImmutableEphemeralVolumes |
true |
Beta | 1.19 | 1.20 |
ImmutableEphemeralVolumes |
true |
GA | 1.21 | |
IngressClassNamespacedParams |
false |
Alpha | 1.21 | 1.21 |
IngressClassNamespacedParams |
true |
Beta | 1.22 | 1.22 |
IngressClassNamespacedParams |
true |
GA | 1.23 | - |
Initializers |
false |
Alpha | 1.7 | 1.13 |
Initializers |
- | Deprecated | 1.14 | - |
IPv6DualStack |
false |
Alpha | 1.15 | 1.20 |
IPv6DualStack |
true |
Beta | 1.21 | 1.22 |
IPv6DualStack |
true |
GA | 1.23 | - |
KubeletConfigFile |
false |
Alpha | 1.8 | 1.9 |
KubeletConfigFile |
- | Deprecated | 1.10 | - |
KubeletPluginsWatcher |
false |
Alpha | 1.11 | 1.11 |
KubeletPluginsWatcher |
true |
Beta | 1.12 | 1.12 |
KubeletPluginsWatcher |
true |
GA | 1.13 | - |
LegacyNodeRoleBehavior |
false |
Alpha | 1.16 | 1.18 |
LegacyNodeRoleBehavior |
true |
Beta | 1.19 | 1.20 |
LegacyNodeRoleBehavior |
false |
GA | 1.21 | - |
MountContainers |
false |
Alpha | 1.9 | 1.16 |
MountContainers |
false |
Deprecated | 1.17 | - |
MountPropagation |
false |
Alpha | 1.8 | 1.9 |
MountPropagation |
true |
Beta | 1.10 | 1.11 |
MountPropagation |
true |
GA | 1.12 | - |
NodeDisruptionExclusion |
false |
Alpha | 1.16 | 1.18 |
NodeDisruptionExclusion |
true |
Beta | 1.19 | 1.20 |
NodeDisruptionExclusion |
true |
GA | 1.21 | - |
NodeLease |
false |
Alpha | 1.12 | 1.13 |
NodeLease |
true |
Beta | 1.14 | 1.16 |
NodeLease |
true |
GA | 1.17 | - |
NamespaceDefaultLabelName |
true |
Beta | 1.21 | 1.21 |
NamespaceDefaultLabelName |
true |
GA | 1.22 | - |
PVCProtection |
false |
Alpha | 1.9 | 1.9 |
PVCProtection |
- | Deprecated | 1.10 | - |
PersistentLocalVolumes |
false |
Alpha | 1.7 | 1.9 |
PersistentLocalVolumes |
true |
Beta | 1.10 | 1.13 |
PersistentLocalVolumes |
true |
GA | 1.14 | - |
PodDisruptionBudget |
false |
Alpha | 1.3 | 1.4 |
PodDisruptionBudget |
true |
Beta | 1.5 | 1.20 |
PodDisruptionBudget |
true |
GA | 1.21 | - |
PodPriority |
false |
Alpha | 1.8 | 1.10 |
PodPriority |
true |
Beta | 1.11 | 1.13 |
PodPriority |
true |
GA | 1.14 | - |
PodReadinessGates |
false |
Alpha | 1.11 | 1.11 |
PodReadinessGates |
true |
Beta | 1.12 | 1.13 |
PodReadinessGates |
true |
GA | 1.14 | - |
PodShareProcessNamespace |
false |
Alpha | 1.10 | 1.11 |
PodShareProcessNamespace |
true |
Beta | 1.12 | 1.16 |
PodShareProcessNamespace |
true |
GA | 1.17 | - |
RequestManagement |
false |
Alpha | 1.15 | 1.16 |
RequestManagement |
- | Deprecated | 1.17 | - |
ResourceLimitsPriorityFunction |
false |
Alpha | 1.9 | 1.18 |
ResourceLimitsPriorityFunction |
- | Deprecated | 1.19 | - |
ResourceQuotaScopeSelectors |
false |
Alpha | 1.11 | 1.11 |
ResourceQuotaScopeSelectors |
true |
Beta | 1.12 | 1.16 |
ResourceQuotaScopeSelectors |
true |
GA | 1.17 | - |
RootCAConfigMap |
false |
Alpha | 1.13 | 1.19 |
RootCAConfigMap |
true |
Beta | 1.20 | 1.20 |
RootCAConfigMap |
true |
GA | 1.21 | - |
RotateKubeletClientCertificate |
true |
Beta | 1.8 | 1.18 |
RotateKubeletClientCertificate |
true |
GA | 1.19 | - |
RunAsGroup |
true |
Beta | 1.14 | 1.20 |
RunAsGroup |
true |
GA | 1.21 | - |
RuntimeClass |
false |
Alpha | 1.12 | 1.13 |
RuntimeClass |
true |
Beta | 1.14 | 1.19 |
RuntimeClass |
true |
GA | 1.20 | - |
SCTPSupport |
false |
Alpha | 1.12 | 1.18 |
SCTPSupport |
true |
Beta | 1.19 | 1.19 |
SCTPSupport |
true |
GA | 1.20 | - |
ScheduleDaemonSetPods |
false |
Alpha | 1.11 | 1.11 |
ScheduleDaemonSetPods |
true |
Beta | 1.12 | 1.16 |
ScheduleDaemonSetPods |
true |
GA | 1.17 | - |
SelectorIndex |
false |
Alpha | 1.18 | 1.18 |
SelectorIndex |
true |
Beta | 1.19 | 1.19 |
SelectorIndex |
true |
GA | 1.20 | - |
ServerSideApply |
false |
Alpha | 1.14 | 1.15 |
ServerSideApply |
true |
Beta | 1.16 | 1.21 |
ServerSideApply |
true |
GA | 1.22 | - |
ServiceAccountIssuerDiscovery |
false |
Alpha | 1.18 | 1.19 |
ServiceAccountIssuerDiscovery |
true |
Beta | 1.20 | 1.20 |
ServiceAccountIssuerDiscovery |
true |
GA | 1.21 | - |
ServiceAppProtocol |
false |
Alpha | 1.18 | 1.18 |
ServiceAppProtocol |
true |
Beta | 1.19 | 1.19 |
ServiceAppProtocol |
true |
GA | 1.20 | - |
ServiceLoadBalancerFinalizer |
false |
Alpha | 1.15 | 1.15 |
ServiceLoadBalancerFinalizer |
true |
Beta | 1.16 | 1.16 |
ServiceLoadBalancerFinalizer |
true |
GA | 1.17 | - |
ServiceNodeExclusion |
false |
Alpha | 1.8 | 1.18 |
ServiceNodeExclusion |
true |
Beta | 1.19 | 1.20 |
ServiceNodeExclusion |
true |
GA | 1.21 | - |
ServiceTopology |
false |
Alpha | 1.17 | 1.19 |
ServiceTopology |
false |
Deprecated | 1.20 | - |
SetHostnameAsFQDN |
false |
Alpha | 1.19 | 1.19 |
SetHostnameAsFQDN |
true |
Beta | 1.20 | 1.21 |
SetHostnameAsFQDN |
true |
GA | 1.22 | - |
StartupProbe |
false |
Alpha | 1.16 | 1.17 |
StartupProbe |
true |
Beta | 1.18 | 1.19 |
StartupProbe |
true |
GA | 1.20 | - |
StorageObjectInUseProtection |
true |
Beta | 1.10 | 1.10 |
StorageObjectInUseProtection |
true |
GA | 1.11 | - |
StreamingProxyRedirects |
false |
Beta | 1.5 | 1.5 |
StreamingProxyRedirects |
true |
Beta | 1.6 | 1.17 |
StreamingProxyRedirects |
true |
Deprecated | 1.18 | 1.21 |
StreamingProxyRedirects |
false |
Deprecated | 1.22 | - |
SupportIPVSProxyMode |
false |
Alpha | 1.8 | 1.8 |
SupportIPVSProxyMode |
false |
Beta | 1.9 | 1.9 |
SupportIPVSProxyMode |
true |
Beta | 1.10 | 1.10 |
SupportIPVSProxyMode |
true |
GA | 1.11 | - |
SupportNodePidsLimit |
false |
Alpha | 1.14 | 1.14 |
SupportNodePidsLimit |
true |
Beta | 1.15 | 1.19 |
SupportNodePidsLimit |
true |
GA | 1.20 | - |
SupportPodPidsLimit |
false |
Alpha | 1.10 | 1.13 |
SupportPodPidsLimit |
true |
Beta | 1.14 | 1.19 |
SupportPodPidsLimit |
true |
GA | 1.20 | - |
Sysctls |
true |
Beta | 1.11 | 1.20 |
Sysctls |
true |
GA | 1.21 | |
TTLAfterFinished |
false |
Alpha | 1.12 | 1.20 |
TTLAfterFinished |
true |
Beta | 1.21 | 1.22 |
TTLAfterFinished |
true |
GA | 1.23 | - |
TaintBasedEvictions |
false |
Alpha | 1.6 | 1.12 |
TaintBasedEvictions |
true |
Beta | 1.13 | 1.17 |
TaintBasedEvictions |
true |
GA | 1.18 | - |
TaintNodesByCondition |
false |
Alpha | 1.8 | 1.11 |
TaintNodesByCondition |
true |
Beta | 1.12 | 1.16 |
TaintNodesByCondition |
true |
GA | 1.17 | - |
TokenRequest |
false |
Alpha | 1.10 | 1.11 |
TokenRequest |
true |
Beta | 1.12 | 1.19 |
TokenRequest |
true |
GA | 1.20 | - |
TokenRequestProjection |
false |
Alpha | 1.11 | 1.11 |
TokenRequestProjection |
true |
Beta | 1.12 | 1.19 |
TokenRequestProjection |
true |
GA | 1.20 | - |
ValidateProxyRedirects |
false |
Alpha | 1.12 | 1.13 |
ValidateProxyRedirects |
true |
Beta | 1.14 | 1.21 |
ValidateProxyRedirects |
true |
Deprecated | 1.22 | - |
VolumePVCDataSource |
false |
Alpha | 1.15 | 1.15 |
VolumePVCDataSource |
true |
Beta | 1.16 | 1.17 |
VolumePVCDataSource |
true |
GA | 1.18 | - |
VolumeScheduling |
false |
Alpha | 1.9 | 1.9 |
VolumeScheduling |
true |
Beta | 1.10 | 1.12 |
VolumeScheduling |
true |
GA | 1.13 | - |
VolumeSnapshotDataSource |
false |
Alpha | 1.12 | 1.16 |
VolumeSnapshotDataSource |
true |
Beta | 1.17 | 1.19 |
VolumeSnapshotDataSource |
true |
GA | 1.20 | - |
VolumeSubpath |
true |
GA | 1.10 | - |
VolumeSubpathEnvExpansion |
false |
Alpha | 1.14 | 1.14 |
VolumeSubpathEnvExpansion |
true |
Beta | 1.15 | 1.16 |
VolumeSubpathEnvExpansion |
true |
GA | 1.17 | - |
WarningHeaders |
true |
Beta | 1.19 | 1.21 |
WarningHeaders |
true |
GA | 1.22 | - |
WatchBookmark |
false |
Alpha | 1.15 | 1.15 |
WatchBookmark |
true |
Beta | 1.16 | 1.16 |
WatchBookmark |
true |
GA | 1.17 | - |
WindowsEndpointSliceProxying |
false |
Alpha | 1.19 | 1.20 |
WindowsEndpointSliceProxying |
true |
Beta | 1.21 | 1.21 |
WindowsEndpointSliceProxying |
true |
GA | 1.22 | - |
WindowsGMSA |
false |
Alpha | 1.14 | 1.15 |
WindowsGMSA |
true |
Beta | 1.16 | 1.17 |
WindowsGMSA |
true |
GA | 1.18 | - |
WindowsRunAsUserName |
false |
Alpha | 1.16 | 1.16 |
WindowsRunAsUserName |
true |
Beta | 1.17 | 1.17 |
WindowsRunAsUserName |
true |
GA | 1.18 | - |
Using a feature
Feature stages
A feature can be in Alpha, Beta or GA stage. An Alpha feature means:
- Disabled by default.
- Might be buggy. Enabling the feature may expose bugs.
- Support for feature may be dropped at any time without notice.
- The API may change in incompatible ways in a later software release without notice.
- Recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.
A Beta feature means:
- Enabled by default.
- The feature is well tested. Enabling the feature is considered safe.
- Support for the overall feature will not be dropped, though details may change.
- The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens, we will provide instructions for migrating to the next version. This may require deleting, editing, and re-creating API objects. The editing process may require some thought. This may require downtime for applications that rely on the feature.
- Recommended for only non-business-critical uses because of potential for incompatible changes in subsequent releases. If you have multiple clusters that can be upgraded independently, you may be able to relax this restriction.
A General Availability (GA) feature is also referred to as a stable feature. It means:
- The feature is always enabled; you cannot disable it.
- The corresponding feature gate is no longer needed.
- Stable versions of features will appear in released software for many subsequent versions.
List of feature gates
Each feature gate is designed for enabling/disabling a specific feature:
APIListChunking
: Enable the API clients to retrieve (LIST
orGET
) resources from API server in chunks.APIPriorityAndFairness
: Enable managing request concurrency with prioritization and fairness at each server. (Renamed fromRequestManagement
)APIResponseCompression
: Compress the API responses forLIST
orGET
requests.APIServerIdentity
: Assign each API server an ID in a cluster.APIServerTracing
: Add support for distributed tracing in the API server.Accelerators
: Provided an early form of plugin to enable Nvidia GPU support when using Docker Engine; no longer available. See Device Plugins for an alternative.AdvancedAuditing
: Enable advanced auditingAffinityInAnnotations
: Enable setting Pod affinity or anti-affinity.AllowExtTrafficLocalEndpoints
: Enable a service to route external requests to node local endpoints.AllowInsecureBackendProxy
: Enable the users to skip TLS verification of kubelets on Pod log requests.AnyVolumeDataSource
: Enable use of any custom resource as theDataSource
of a PVC.AppArmor
: Enable use of AppArmor mandatory access control for Pods running on Linux nodes. See AppArmor Tutorial for more details.AttachVolumeLimit
: Enable volume plugins to report limits on number of volumes that can be attached to a node. See dynamic volume limits for more details.BalanceAttachedNodeVolumes
: Include volume count on node to be considered for balanced resource allocation while scheduling. A node which has closer CPU, memory utilization, and volume count is favored by the scheduler while making decisions.BlockVolume
: Enable the definition and consumption of raw block devices in Pods. See Raw Block Volume Support for more details.BoundServiceAccountTokenVolume
: Migrate ServiceAccount volumes to use a projected volume consisting of a ServiceAccountTokenVolumeProjection. Cluster admins can use metricserviceaccount_stale_tokens_total
to monitor workloads that are depending on the extended tokens. If there are no such workloads, turn off extended tokens by startingkube-apiserver
with flag--service-account-extend-token-expiration=false
. Check Bound Service Account Tokens for more details.ControllerManagerLeaderMigration
: Enables Leader Migration for kube-controller-manager and cloud-controller-manager which allows a cluster operator to live migrate controllers from the kube-controller-manager into an external controller-manager (e.g. the cloud-controller-manager) in an HA cluster without downtime.CPUManager
: Enable container level CPU affinity support, see CPU Management Policies.CPUManagerPolicyAlphaOptions
: This allows fine-tuning of CPUManager policies, experimental, Alpha-quality options This feature gate guards a group of CPUManager options whose quality level is alpha. This feature gate will never graduate to beta or stable.CPUManagerPolicyBetaOptions
: This allows fine-tuning of CPUManager policies, experimental, Beta-quality options This feature gate guards a group of CPUManager options whose quality level is beta. This feature gate will never graduate to stable.CPUManagerPolicyOptions
: Allow fine-tuning of CPUManager policies.CRIContainerLogRotation
: Enable container log rotation for CRI container runtime. The default max size of a log file is 10MB and the default max number of log files allowed for a container is 5. These values can be configured in the kubelet config. See logging at node level for more details.CSIBlockVolume
: Enable external CSI volume drivers to support block storage. Seecsi
raw block volume support for more details.CSIDriverRegistry
: Enable all logic related to the CSIDriver API object in csi.storage.k8s.io.CSIInlineVolume
: Enable CSI Inline volumes support for pods.CSIMigration
: Enables shims and translation logic to route volume operations from in-tree plugins to corresponding pre-installed CSI pluginsCSIMigrationAWS
: Enables shims and translation logic to route volume operations from the AWS-EBS in-tree plugin to EBS CSI plugin. Supports falling back to in-tree EBS plugin for mount operations to nodes that have the feature disabled or that do not have EBS CSI plugin installed and configured. Does not support falling back for provision operations, for those the CSI plugin must be installed and configured.CSIMigrationAWSComplete
: Stops registering the EBS in-tree plugin in kubelet and volume controllers and enables shims and translation logic to route volume operations from the AWS-EBS in-tree plugin to EBS CSI plugin. Requires CSIMigration and CSIMigrationAWS feature flags enabled and EBS CSI plugin installed and configured on all nodes in the cluster. This flag has been deprecated in favor of theInTreePluginAWSUnregister
feature flag which prevents the registration of in-tree EBS plugin.CSIMigrationAzureDisk
: Enables shims and translation logic to route volume operations from the Azure-Disk in-tree plugin to AzureDisk CSI plugin. Supports falling back to in-tree AzureDisk plugin for mount operations to nodes that have the feature disabled or that do not have AzureDisk CSI plugin installed and configured. Does not support falling back for provision operations, for those the CSI plugin must be installed and configured. Requires CSIMigration feature flag enabled.CSIMigrationAzureDiskComplete
: Stops registering the Azure-Disk in-tree plugin in kubelet and volume controllers and enables shims and translation logic to route volume operations from the Azure-Disk in-tree plugin to AzureDisk CSI plugin. Requires CSIMigration and CSIMigrationAzureDisk feature flags enabled and AzureDisk CSI plugin installed and configured on all nodes in the cluster. This flag has been deprecated in favor of theInTreePluginAzureDiskUnregister
feature flag which prevents the registration of in-tree AzureDisk plugin.CSIMigrationAzureFile
: Enables shims and translation logic to route volume operations from the Azure-File in-tree plugin to AzureFile CSI plugin. Supports falling back to in-tree AzureFile plugin for mount operations to nodes that have the feature disabled or that do not have AzureFile CSI plugin installed and configured. Does not support falling back for provision operations, for those the CSI plugin must be installed and configured. Requires CSIMigration feature flag enabled.CSIMigrationAzureFileComplete
: Stops registering the Azure-File in-tree plugin in kubelet and volume controllers and enables shims and translation logic to route volume operations from the Azure-File in-tree plugin to AzureFile CSI plugin. Requires CSIMigration and CSIMigrationAzureFile feature flags enabled and AzureFile CSI plugin installed and configured on all nodes in the cluster. This flag has been deprecated in favor of theInTreePluginAzureFileUnregister
feature flag which prevents the registration of in-tree AzureFile plugin.CSIMigrationGCE
: Enables shims and translation logic to route volume operations from the GCE-PD in-tree plugin to PD CSI plugin. Supports falling back to in-tree GCE plugin for mount operations to nodes that have the feature disabled or that do not have PD CSI plugin installed and configured. Does not support falling back for provision operations, for those the CSI plugin must be installed and configured. Requires CSIMigration feature flag enabled.CSIMigrationGCEComplete
: Stops registering the GCE-PD in-tree plugin in kubelet and volume controllers and enables shims and translation logic to route volume operations from the GCE-PD in-tree plugin to PD CSI plugin. Requires CSIMigration and CSIMigrationGCE feature flags enabled and PD CSI plugin installed and configured on all nodes in the cluster. This flag has been deprecated in favor of theInTreePluginGCEUnregister
feature flag which prevents the registration of in-tree GCE PD plugin.CSIMigrationOpenStack
: Enables shims and translation logic to route volume operations from the Cinder in-tree plugin to Cinder CSI plugin. Supports falling back to in-tree Cinder plugin for mount operations to nodes that have the feature disabled or that do not have Cinder CSI plugin installed and configured. Does not support falling back for provision operations, for those the CSI plugin must be installed and configured. Requires CSIMigration feature flag enabled.CSIMigrationOpenStackComplete
: Stops registering the Cinder in-tree plugin in kubelet and volume controllers and enables shims and translation logic to route volume operations from the Cinder in-tree plugin to Cinder CSI plugin. Requires CSIMigration and CSIMigrationOpenStack feature flags enabled and Cinder CSI plugin installed and configured on all nodes in the cluster. This flag has been deprecated in favor of theInTreePluginOpenStackUnregister
feature flag which prevents the registration of in-tree openstack cinder plugin.csiMigrationRBD
: Enables shims and translation logic to route volume operations from the RBD in-tree plugin to Ceph RBD CSI plugin. Requires CSIMigration and csiMigrationRBD feature flags enabled and Ceph CSI plugin installed and configured in the cluster. This flag has been deprecated in favor of theInTreePluginRBDUnregister
feature flag which prevents the registration of in-tree RBD plugin.CSIMigrationvSphere
: Enables shims and translation logic to route volume operations from the vSphere in-tree plugin to vSphere CSI plugin. Supports falling back to in-tree vSphere plugin for mount operations to nodes that have the feature disabled or that do not have vSphere CSI plugin installed and configured. Does not support falling back for provision operations, for those the CSI plugin must be installed and configured. Requires CSIMigration feature flag enabled.CSIMigrationvSphereComplete
: Stops registering the vSphere in-tree plugin in kubelet and volume controllers and enables shims and translation logic to route volume operations from the vSphere in-tree plugin to vSphere CSI plugin. Requires CSIMigration and CSIMigrationvSphere feature flags enabled and vSphere CSI plugin installed and configured on all nodes in the cluster. This flag has been deprecated in favor of theInTreePluginvSphereUnregister
feature flag which prevents the registration of in-tree vsphere plugin.CSIMigrationPortworx
: Enables shims and translation logic to route volume operations from the Portworx in-tree plugin to Portworx CSI plugin. Requires Portworx CSI driver to be installed and configured in the cluster.CSINodeInfo
: Enable all logic related to the CSINodeInfo API object incsi.storage.k8s.io
.CSIPersistentVolume
: Enable discovering and mounting volumes provisioned through a CSI (Container Storage Interface) compatible volume plugin.CSIServiceAccountToken
: Enable CSI drivers to receive the pods' service account token that they mount volumes for. See Token Requests.CSIStorageCapacity
: Enables CSI drivers to publish storage capacity information and the Kubernetes scheduler to use that information when scheduling pods. See Storage Capacity. Check thecsi
volume type documentation for more details.CSIVolumeFSGroupPolicy
: Allows CSIDrivers to use thefsGroupPolicy
field. This field controls whether volumes created by a CSIDriver support volume ownership and permission modifications when these volumes are mounted.CSIVolumeHealth
: Enable support for CSI volume health monitoring on node.CSRDuration
: Allows clients to request a duration for certificates issued via the Kubernetes CSR API.ConfigurableFSGroupPolicy
: Allows user to configure volume permission change policy for fsGroups when mounting a volume in a Pod. See Configure volume permission and ownership change policy for Pods for more details.ControllerManagerLeaderMigration
: Enables leader migration forkube-controller-manager
andcloud-controller-manager
.CronJobControllerV2
: Use an alternative implementation of the CronJob controller. Otherwise, version 1 of the same controller is selected.CustomCPUCFSQuotaPeriod
: Enable nodes to changecpuCFSQuotaPeriod
in kubelet config.CustomResourceValidationExpressions
: Enable expression language validation in CRD which will validate customer resource based on validation rules written in thex-kubernetes-validations
extension.CustomPodDNS
: Enable customizing the DNS settings for a Pod using itsdnsConfig
property. Check Pod's DNS Config for more details.CustomResourceDefaulting
: Enable CRD support for default values in OpenAPI v3 validation schemas.CustomResourcePublishOpenAPI
: Enables publishing of CRD OpenAPI specs.CustomResourceSubresources
: Enable/status
and/scale
subresources on resources created from CustomResourceDefinition.CustomResourceValidation
: Enable schema based validation on resources created from CustomResourceDefinition.CustomResourceWebhookConversion
: Enable webhook-based conversion on resources created from CustomResourceDefinition.DaemonSetUpdateSurge
: Enables the DaemonSet workloads to maintain availability during update per node. See Perform a Rolling Update on a DaemonSet.DefaultPodTopologySpread
: Enables the use ofPodTopologySpread
scheduling plugin to do default spreading.DelegateFSGroupToCSIDriver
: If supported by the CSI driver, delegates the role of applyingfsGroup
from a Pod'ssecurityContext
to the driver by passingfsGroup
through the NodeStageVolume and NodePublishVolume CSI calls.DevicePlugins
: Enable the device-plugins based resource provisioning on nodes.DisableAcceleratorUsageMetrics
: Disable accelerator metrics collected by the kubelet.DisableCloudProviders
: Disables any functionality inkube-apiserver
,kube-controller-manager
andkubelet
related to the--cloud-provider
component flag.DisableKubeletCloudCredentialProviders
: Disable the in-tree functionality in kubelet to authenticate to a cloud provider container registry for image pull credentials.DownwardAPIHugePages
: Enables usage of hugepages in downward API.DryRun
: Enable server-side dry run requests so that validation, merging, and mutation can be tested without committing.DynamicAuditing
: Used to enable dynamic auditing before v1.19.DynamicKubeletConfig
: Enable the dynamic configuration of kubelet. See Reconfigure kubelet.DynamicProvisioningScheduling
: Extend the default scheduler to be aware of volume topology and handle PV provisioning. This feature is superseded by theVolumeScheduling
feature completely in v1.12.DynamicVolumeProvisioning
: Enable the dynamic provisioning of persistent volumes to Pods.EfficientWatchResumption
: Allows for storage-originated bookmark (progress notify) events to be delivered to the users. This is only applied to watch operations.EnableAggregatedDiscoveryTimeout
: Enable the five second timeout on aggregated discovery calls.EnableEquivalenceClassCache
: Enable the scheduler to cache equivalence of nodes when scheduling Pods.EndpointSlice
: Enables EndpointSlices for more scalable and extensible network endpoints. See Enabling EndpointSlices.EndpointSliceNodeName
: Enables EndpointSlicenodeName
field.EndpointSliceProxying
: When enabled, kube-proxy running on Linux will use EndpointSlices as the primary data source instead of Endpoints, enabling scalability and performance improvements. See Enabling Endpoint Slices.EndpointSliceTerminatingCondition
: Enables EndpointSliceterminating
andserving
condition fields.EphemeralContainers
: Enable the ability to add ephemeral containers to running pods.EvenPodsSpread
: Enable pods to be scheduled evenly across topology domains. See Pod Topology Spread Constraints.ExecProbeTimeout
: Ensure kubelet respects exec probe timeouts. This feature gate exists in case any of your existing workloads depend on a now-corrected fault where Kubernetes ignored exec probe timeouts. See readiness probes.ExpandCSIVolumes
: Enable the expanding of CSI volumes.ExpandedDNSConfig
: Enable kubelet and kube-apiserver to allow more DNS search paths and longer list of DNS search paths. This feature requires container runtime support(Containerd: v1.5.6 or higher, CRI-O: v1.22 or higher). See Expanded DNS Configuration.ExpandInUsePersistentVolumes
: Enable expanding in-use PVCs. See Resizing an in-use PersistentVolumeClaim.ExpandPersistentVolumes
: Enable the expanding of persistent volumes. See Expanding Persistent Volumes Claims.ExperimentalCriticalPodAnnotation
: Enable annotating specific pods as critical so that their scheduling is guaranteed. This feature is deprecated by Pod Priority and Preemption as of v1.13.ExperimentalHostUserNamespaceDefaulting
: Enabling the defaulting user namespace to host. This is for containers that are using other host namespaces, host mounts, or containers that are privileged or using specific non-namespaced capabilities (e.g.MKNODE
,SYS_MODULE
etc.). This should only be enabled if user namespace remapping is enabled in the Docker daemon.ExternalPolicyForExternalIP
: Fix a bug where ExternalTrafficPolicy is not applied to Service ExternalIPs.GCERegionalPersistentDisk
: Enable the regional PD feature on GCE.GenericEphemeralVolume
: Enables ephemeral, inline volumes that support all features of normal volumes (can be provided by third-party storage vendors, storage capacity tracking, restore from snapshot, etc.). See Ephemeral Volumes.GracefulNodeShutdown
: Enables support for graceful shutdown in kubelet. During a system shutdown, kubelet will attempt to detect the shutdown event and gracefully terminate pods running on the node. See Graceful Node Shutdown for more details.GracefulNodeShutdownBasedOnPodPriority
: Enables the kubelet to check Pod priorities when shutting down a node gracefully.GRPCContainerProbe
: Enables the gRPC probe method for {Liveness,Readiness,Startup}Probe. See Configure Liveness, Readiness and Startup Probes.HonorPVReclaimPolicy
: Honor persistent volume reclaim policy when it isDelete
irrespective of PV-PVC deletion ordering.HPAContainerMetrics
: Enable theHorizontalPodAutoscaler
to scale based on metrics from individual containers in target pods.HPAScaleToZero
: Enables settingminReplicas
to 0 forHorizontalPodAutoscaler
resources when using custom or external metrics.HugePages
: Enable the allocation and consumption of pre-allocated huge pages.HugePageStorageMediumSize
: Enable support for multiple sizes pre-allocated huge pages.HyperVContainer
: Enable Hyper-V isolation for Windows containers.IdentifyPodOS
: Allows the Pod OS field to be specified. This helps in identifying the OS of the pod authoritatively during the API server admission time. In Kubernetes 1.23, the allowed values for thepod.spec.os.name
arewindows
andlinux
.ImmutableEphemeralVolumes
: Allows for marking individual Secrets and ConfigMaps as immutable for better safety and performance.IndexedJob
: Allows the Job controller to manage Pod completions per completion index.IngressClassNamespacedParams
: Allow namespace-scoped parameters reference inIngressClass
resource. This feature adds two fields -Scope
andNamespace
toIngressClass.spec.parameters
.Initializers
: Allow asynchronous coordination of object creation using the Initializers admission plugin.InTreePluginAWSUnregister
: Stops registering the aws-ebs in-tree plugin in kubelet and volume controllers.InTreePluginAzureDiskUnregister
: Stops registering the azuredisk in-tree plugin in kubelet and volume controllers.InTreePluginAzureFileUnregister
: Stops registering the azurefile in-tree plugin in kubelet and volume controllers.InTreePluginGCEUnregister
: Stops registering the gce-pd in-tree plugin in kubelet and volume controllers.InTreePluginOpenStackUnregister
: Stops registering the OpenStack cinder in-tree plugin in kubelet and volume controllers.InTreePluginPortworxUnregister
: Stops registering the Portworx in-tree plugin in kubelet and volume controllers.InTreePluginRBDUnregister
: Stops registering the RBD in-tree plugin in kubelet and volume controllers.InTreePluginvSphereUnregister
: Stops registering the vSphere in-tree plugin in kubelet and volume controllers.IPv6DualStack
: Enable dual stack support for IPv6.JobMutableNodeSchedulingDirectives
: Allows updating node scheduling directives in the pod template of Job.JobReadyPods
: Enables tracking the number of Pods that have aReady
condition. The count ofReady
pods is recorded in the status of a Job status.JobTrackingWithFinalizers
: Enables tracking Job completions without relying on Pods remaining in the cluster indefinitely. The Job controller uses Pod finalizers and a field in the Job status to keep track of the finished Pods to count towards completion.KubeletConfigFile
: Enable loading kubelet configuration from a file specified using a config file. See setting kubelet parameters via a config file for more details.KubeletCredentialProviders
: Enable kubelet exec credential providers for image pull credentials.KubeletInUserNamespace
: Enables support for running kubelet in a user namespace. See Running Kubernetes Node Components as a Non-root User.KubeletPluginsWatcher
: Enable probe-based plugin watcher utility to enable kubelet to discover plugins such as CSI volume drivers.KubeletPodResources
: Enable the kubelet's pod resources gRPC endpoint. See Support Device Monitoring for more details.KubeletPodResourcesGetAllocatable
: Enable the kubelet's pod resourcesGetAllocatableResources
functionality. This API augments the resource allocation reporting with informations about the allocatable resources, enabling clients to properly track the free compute resources on a node.LegacyNodeRoleBehavior
: When disabled, legacy behavior in service load balancers and node disruption will ignore thenode-role.kubernetes.io/master
label in favor of the feature-specific labels provided byNodeDisruptionExclusion
andServiceNodeExclusion
.LocalStorageCapacityIsolation
: Enable the consumption of local ephemeral storage and also thesizeLimit
property of an emptyDir volume.LocalStorageCapacityIsolationFSQuotaMonitoring
: WhenLocalStorageCapacityIsolation
is enabled for local ephemeral storage and the backing filesystem for emptyDir volumes supports project quotas and they are enabled, use project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy.LogarithmicScaleDown
: Enable semi-random selection of pods to evict on controller scaledown based on logarithmic bucketing of pod timestamps.MemoryManager
: Allows setting memory affinity for a container based on NUMA topology.MemoryQoS
: Enable memory protection and usage throttle on pod / container using cgroup v2 memory controller.MixedProtocolLBService
: Enable using different protocols in the sameLoadBalancer
type Service instance.MountContainers
: Enable using utility containers on host as the volume mounter.MountPropagation
: Enable sharing volume mounted by one container to other containers or pods. For more details, please see mount propagation.NamespaceDefaultLabelName
: Configure the API Server to set an immutable labelkubernetes.io/metadata.name
on all namespaces, containing the namespace name.NetworkPolicyEndPort
: Enable use of the fieldendPort
in NetworkPolicy objects, allowing the selection of a port range instead of a single port.NodeDisruptionExclusion
: Enable use of the Node labelnode.kubernetes.io/exclude-disruption
which prevents nodes from being evacuated during zone failures.NodeLease
: Enable the new Lease API to report node heartbeats, which could be used as a node health signal.NodeSwap
: Enable the kubelet to allocate swap memory for Kubernetes workloads on a node. Must be used withKubeletConfiguration.failSwapOn
set to false. For more details, please see swap memoryNonPreemptingPriority
: EnablepreemptionPolicy
field for PriorityClass and Pod.OpenAPIEnums
: Enables populating "enum" fields of OpenAPI schemas in the spec returned from the API server.OpenAPIV3
: Enables the API server to publish OpenAPI v3.PodDeletionCost
: Enable the Pod Deletion Cost feature which allows users to influence ReplicaSet downscaling order.PersistentLocalVolumes
: Enable the usage oflocal
volume type in Pods. Pod affinity has to be specified if requesting alocal
volume.PodAndContainerStatsFromCRI
: Configure the kubelet to gather container and pod stats from the CRI container runtime rather than gathering them from cAdvisor.PodDisruptionBudget
: Enable the PodDisruptionBudget feature.PodAffinityNamespaceSelector
: Enable the Pod Affinity Namespace Selector and CrossNamespacePodAffinity quota scope features.PodOverhead
: Enable the PodOverhead feature to account for pod overheads.PodPriority
: Enable the descheduling and preemption of Pods based on their priorities.PodReadinessGates
: Enable the setting ofPodReadinessGate
field for extending Pod readiness evaluation. See Pod readiness gate for more details.PodSecurity
: Enables thePodSecurity
admission plugin.PodShareProcessNamespace
: Enable the setting ofshareProcessNamespace
in a Pod for sharing a single process namespace between containers running in a pod. More details can be found in Share Process Namespace between Containers in a Pod.PreferNominatedNode
: This flag tells the scheduler whether the nominated nodes will be checked first before looping through all the other nodes in the cluster.ProbeTerminationGracePeriod
: Enable setting probe-levelterminationGracePeriodSeconds
on pods. See the enhancement proposal for more details.ProcMountType
: Enables control over the type proc mounts for containers by setting theprocMount
field of a SecurityContext.ProxyTerminatingEndpoints
: Enable the kube-proxy to handle terminating endpoints whenExternalTrafficPolicy=Local
.PVCProtection
: Enable the prevention of a PersistentVolumeClaim (PVC) from being deleted when it is still used by any Pod.QOSReserved
: Allows resource reservations at the QoS level preventing pods at lower QoS levels from bursting into resources requested at higher QoS levels (memory only for now).ReadWriteOncePod
: Enables the usage ofReadWriteOncePod
PersistentVolume access mode.RecoverVolumeExpansionFailure
: Enables users to edit their PVCs to smaller sizes so as they can recover from previously issued volume expansion failures. See Recovering from Failure when Expanding Volumes for more details.RemainingItemCount
: Allow the API servers to show a count of remaining items in the response to a chunking list request.RemoveSelfLink
: Deprecates and removesselfLink
from ObjectMeta and ListMeta.RequestManagement
: Enables managing request concurrency with prioritization and fairness at each API server. Deprecated byAPIPriorityAndFairness
since 1.17.ResourceLimitsPriorityFunction
: Enable a scheduler priority function that assigns a lowest possible score of 1 to a node that satisfies at least one of the input Pod's cpu and memory limits. The intent is to break ties between nodes with same scores.ResourceQuotaScopeSelectors
: Enable resource quota scope selectors.RootCAConfigMap
: Configure thekube-controller-manager
to publish a ConfigMap namedkube-root-ca.crt
to every namespace. This ConfigMap contains a CA bundle used for verifying connections to the kube-apiserver. See Bound Service Account Tokens for more details.RotateKubeletClientCertificate
: Enable the rotation of the client TLS certificate on the kubelet. See kubelet configuration for more details.RotateKubeletServerCertificate
: Enable the rotation of the server TLS certificate on the kubelet. See kubelet configuration for more details.RunAsGroup
: Enable control over the primary group ID set on the init processes of containers.RuntimeClass
: Enable the RuntimeClass feature for selecting container runtime configurations.ScheduleDaemonSetPods
: Enable DaemonSet Pods to be scheduled by the default scheduler instead of the DaemonSet controller.SCTPSupport
: Enables the SCTPprotocol
value in Pod, Service, Endpoints, EndpointSlice, and NetworkPolicy definitions.SeccompDefault
: Enables the use ofRuntimeDefault
as the default seccomp profile for all workloads. The seccomp profile is specified in thesecurityContext
of a Pod and/or a Container.SelectorIndex
: Allows label and field based indexes in API server watch cache to accelerate list operations.ServerSideApply
: Enables the Sever Side Apply (SSA) feature on the API Server.ServiceAccountIssuerDiscovery
: Enable OIDC discovery endpoints (issuer and JWKS URLs) for the service account issuer in the API server. See Configure Service Accounts for Pods for more details.ServiceAppProtocol
: Enables theappProtocol
field on Services and Endpoints.ServiceInternalTrafficPolicy
: Enables theinternalTrafficPolicy
field on ServicesServiceLBNodePortControl
: Enables theallocateLoadBalancerNodePorts
field on Services.ServiceLoadBalancerClass
: Enables theloadBalancerClass
field on Services. See Specifying class of load balancer implementation for more details.ServiceLoadBalancerFinalizer
: Enable finalizer protection for Service load balancers.ServiceNodeExclusion
: Enable the exclusion of nodes from load balancers created by a cloud provider. A node is eligible for exclusion if labelled with "node.kubernetes.io/exclude-from-external-load-balancers
".ServiceTopology
: Enable service to route traffic based upon the Node topology of the cluster. See ServiceTopology for more details.SetHostnameAsFQDN
: Enable the ability of setting Fully Qualified Domain Name(FQDN) as the hostname of a pod. See Pod'ssetHostnameAsFQDN
field.SizeMemoryBackedVolumes
: Enable kubelets to determine the size limit for memory-backed volumes (mainlyemptyDir
volumes).StartupProbe
: Enable the startup probe in the kubelet.StatefulSetMinReadySeconds
: AllowsminReadySeconds
to be respected by the StatefulSet controller.StorageObjectInUseProtection
: Postpone the deletion of PersistentVolume or PersistentVolumeClaim objects if they are still being used.StorageVersionAPI
: Enable the storage version API.StorageVersionHash
: Allow API servers to expose the storage version hash in the discovery.StreamingProxyRedirects
: Instructs the API server to intercept (and follow) redirects from the backend (kubelet) for streaming requests. Examples of streaming requests include theexec
,attach
andport-forward
requests.SupportIPVSProxyMode
: Enable providing in-cluster service load balancing using IPVS. See service proxies for more details.SupportNodePidsLimit
: Enable the support to limiting PIDs on the Node. The parameterpid=<number>
in the--system-reserved
and--kube-reserved
options can be specified to ensure that the specified number of process IDs will be reserved for the system as a whole and for Kubernetes system daemons respectively.SupportPodPidsLimit
: Enable the support to limiting PIDs in Pods.SuspendJob
: Enable support to suspend and resume Jobs. See the Jobs docs for more details.Sysctls
: Enable support for namespaced kernel parameters (sysctls) that can be set for each pod. See sysctls for more details.TTLAfterFinished
: Allow a TTL controller to clean up resources after they finish execution.TaintBasedEvictions
: Enable evicting pods from nodes based on taints on Nodes and tolerations on Pods. See taints and tolerations for more details.TaintNodesByCondition
: Enable automatic tainting nodes based on node conditions.TokenRequest
: Enable theTokenRequest
endpoint on service account resources.TokenRequestProjection
: Enable the injection of service account tokens into a Pod through aprojected
volume.TopologyAwareHints
: Enables topology aware routing based on topology hints in EndpointSlices. See Topology Aware Hints for more details.TopologyManager
: Enable a mechanism to coordinate fine-grained hardware resource assignments for different components in Kubernetes. See Control Topology Management Policies on a node.ValidateProxyRedirects
: This flag controls whether the API server should validate that redirects are only followed to the same host. Only used if theStreamingProxyRedirects
flag is enabled.VolumeCapacityPriority
: Enable support for prioritizing nodes in different topologies based on available PV capacity.VolumePVCDataSource
: Enable support for specifying an existing PVC as a DataSource.VolumeScheduling
: Enable volume topology aware scheduling and make the PersistentVolumeClaim (PVC) binding aware of scheduling decisions. It also enables the usage oflocal
volume type when used together with thePersistentLocalVolumes
feature gate.VolumeSnapshotDataSource
: Enable volume snapshot data source support.VolumeSubpath
: Allow mounting a subpath of a volume in a container.VolumeSubpathEnvExpansion
: EnablesubPathExpr
field for expanding environment variables into asubPath
.WarningHeaders
: Allow sending warning headers in API responses.WatchBookmark
: Enable support for watch bookmark events.WinDSR
: Allows kube-proxy to create DSR loadbalancers for Windows.WinOverlay
: Allows kube-proxy to run in overlay mode for Windows.WindowsEndpointSliceProxying
: When enabled, kube-proxy running on Windows will use EndpointSlices as the primary data source instead of Endpoints, enabling scalability and performance improvements. See Enabling Endpoint Slices.WindowsGMSA
: Enables passing of GMSA credential specs from pods to container runtimes.WindowsHostProcessContainers
: Enables support for Windows HostProcess containers.WindowsRunAsUserName
: Enable support for running applications in Windows containers with as a non-default user. See Configuring RunAsUserName for more details.
What's next
- The deprecation policy for Kubernetes explains the project's approach to removing features and components.
6.11.2 - kubelet
Synopsis
The kubelet is the primary "node agent" that runs on each node. It can register the node with the apiserver using one of: the hostname; a flag to override the hostname; or specific logic for a cloud provider.
The kubelet works in terms of a PodSpec. A PodSpec is a YAML or JSON object that describes a pod. The kubelet takes a set of PodSpecs that are provided through various mechanisms (primarily through the apiserver) and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn't manage containers which were not created by Kubernetes.
Other than from a PodSpec from the apiserver, there are three ways that a container manifest can be provided to the Kubelet.
- File: Path passed as a flag on the command line. Files under this path will be monitored periodically for updates. The monitoring period is 20s by default and is configurable via a flag.
- HTTP endpoint: HTTP endpoint passed as a parameter on the command line. This endpoint is checked every 20 seconds (also configurable with a flag).
- HTTP server: The kubelet can also listen for HTTP and respond to a simple API (underspec'd currently) to submit a new manifest.
kubelet [flags]
Options
--add-dir-header | |
If true, adds the file directory to the header of the log messages (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) | |
--address string Default: 0.0.0.0 | |
The IP address for the Kubelet to serve on (set to 0.0.0.0 or :: for listening in all interfaces and IP families) (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--allowed-unsafe-sysctls strings | |
Comma-separated whitelist of unsafe sysctls or unsafe sysctl patterns (ending in * ). Use these at your own risk. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--alsologtostderr | |
Log to standard error as well as files (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) | |
--anonymous-auth Default: true | |
Enables anonymous requests to the Kubelet server. Requests that are not rejected by another authentication method are treated as anonymous requests. Anonymous requests have a username of system:anonymous , and a group name of system:unauthenticated . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--authentication-token-webhook | |
Use the TokenReview API to determine authentication for bearer tokens. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--authentication-token-webhook-cache-ttl duration Default: 2m0s |
|
The duration to cache responses from the webhook token authenticator. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--authorization-mode string Default: AlwaysAllow |
|
Authorization mode for Kubelet server. Valid options are AlwaysAllow or Webhook. Webhook mode uses the SubjectAccessReview API to determine authorization. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) | |
--authorization-webhook-cache-authorized-ttl duration Default: 5m0s |
|
The duration to cache 'authorized' responses from the webhook authorizer. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--authorization-webhook-cache-unauthorized-ttl duration Default: 30s |
|
The duration to cache 'unauthorized' responses from the webhook authorizer. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--azure-container-registry-config string | |
Path to the file containing Azure container registry configuration information. | |
--bootstrap-kubeconfig string | |
Path to a kubeconfig file that will be used to get client certificate for kubelet. If the file specified by --kubeconfig does not exist, the bootstrap kubeconfig is used to request a client certificate from the API server. On success, a kubeconfig file referencing the generated client certificate and key is written to the path specified by --kubeconfig . The client certificate and key file will be stored in the directory pointed by --cert-dir . |
|
--cert-dir string Default: /var/lib/kubelet/pki |
|
The directory where the TLS certs are located. If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored. |
|
--cgroup-driver string Default: cgroupfs |
|
Driver that the kubelet uses to manipulate cgroups on the host. Possible values: cgroupfs , systemd . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)/td>
| |
--cgroup-root string Default: '' |
|
Optional root cgroup to use for pods. This is handled by the container runtime on a best effort basis. Default: '', which means use the container runtime default. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) | |
--cgroups-per-qos Default: true |
|
Enable creation of QoS cgroup hierarchy, if true top level QoS and pod cgroups are created. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--client-ca-file string | |
If set, any request presenting a client certificate signed by one of the authorities in the client-ca-file is authenticated with an identity corresponding to the CommonName of the client certificate. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--cloud-config string | |
The path to the cloud provider configuration file. Empty string for no configuration file. (DEPRECATED: will be removed in 1.24 or later, in favor of removing cloud providers code from kubelet.) | |
--cloud-provider string | |
The provider for cloud services. Set to empty string for running with no cloud provider. If set, the cloud provider determines the name of the node (consult cloud provider documentation to determine if and how the hostname is used). (DEPRECATED: will be removed in 1.24 or later, in favor of removing cloud provider code from Kubelet.) | |
--cluster-dns strings | |
Comma-separated list of DNS server IP address. This value is used for containers DNS server in case of Pods with "dnsPolicy=ClusterFirst". Note: all DNS servers appearing in the list MUST serve the same set of records otherwise name resolution within the cluster may not work correctly. There is no guarantee as to which DNS server may be contacted for name resolution. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--cluster-domain string | |
Domain for this cluster. If set, kubelet will configure all containers to search this domain in addition to the host's search domains (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--cni-bin-dir string Default: /opt/cni/bin |
|
A comma-separated list of full paths of directories in which to search for CNI plugin binaries. This docker-specific flag only works when container-runtime is set to docker . (DEPRECATED: will be removed along with dockershim.) |
|
--cni-cache-dir string Default: /var/lib/cni/cache |
|
The full path of the directory in which CNI should store cache files. This docker-specific flag only works when container-runtime is set to docker . (DEPRECATED: will be removed along with dockershim.) |
|
--cni-conf-dir string Default: /etc/cni/net.d |
|
<Warning: Alpha feature> The full path of the directory in which to search for CNI config files. This docker-specific flag only works when container-runtime is set to docker . (DEPRECATED: will be removed along with dockershim.) |
|
--config string | |
The Kubelet will load its initial configuration from this file. The path may be absolute or relative; relative paths start at the Kubelet's current working directory. Omit this flag to use the built-in default configuration values. Command-line flags override configuration from this file. | |
--container-log-max-files int32 Default: 5 | |
<Warning: Beta feature> Set the maximum number of container log files that can be present for a container. The number must be >= 2. This flag can only be used with --container-runtime=remote . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--container-log-max-size string Default: 10Mi |
|
<Warning: Beta feature> Set the maximum size (e.g. 10Mi ) of container log file before it is rotated. This flag can only be used with --container-runtime=remote . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--container-runtime string Default: docker |
|
The container runtime to use. Possible values: docker , remote . |
|
--container-runtime-endpoint string Default: unix:///var/run/dockershim.sock |
|
[Experimental] The endpoint of remote runtime service. Currently unix socket endpoint is supported on Linux, while npipe and tcp endpoints are supported on windows. Examples: unix:///var/run/dockershim.sock , npipe:////./pipe/dockershim . |
|
--contention-profiling | |
Enable lock contention profiling, if profiling is enabled (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--cpu-cfs-quota Default: true |
|
Enable CPU CFS quota enforcement for containers that specify CPU limits (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--cpu-cfs-quota-period duration Default: 100ms |
|
Sets CPU CFS quota period value, cpu.cfs_period_us , defaults to Linux Kernel default. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--cpu-manager-policy string Default: none |
|
CPU Manager policy to use. Possible values: none , static . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--cpu-manager-policy-options mapStringString | |
Comma-separated list of options to fine-tune the behavior of the selected CPU Manager policy. If not supplied, keep the default behaviour. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--cpu-manager-reconcile-period duration Default: 10s |
|
<Warning: Alpha feature> CPU Manager reconciliation period. Examples: 10s , or 1m . If not supplied, defaults to node status update frequency. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--docker-endpoint string Default: unix:///var/run/docker.sock |
|
Use this for the docker endpoint to communicate with. This docker-specific flag only works when container-runtime is set to docker . (DEPRECATED: will be removed along with dockershim.) |
|
--dynamic-config-dir string | |
The Kubelet will use this directory for checkpointing downloaded configurations and tracking configuration health. The Kubelet will create this directory if it does not already exist. The path may be absolute or relative; relative paths start at the Kubelet's current working directory. Providing this flag enables dynamic Kubelet configuration. The DynamicKubeletConfig feature gate must be enabled to pass this flag. (DEPRECATED: Feature DynamicKubeletConfig is deprecated in 1.22 and will not move to GA. It is planned to be removed from Kubernetes in the version 1.24 or later. Please use alternative ways to update kubelet configuration.) |
|
--enable-controller-attach-detach Default: true |
|
Enables the Attach/Detach controller to manage attachment/detachment of volumes scheduled to this node, and disables kubelet from executing any attach/detach operations. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--enable-debugging-handlers Default: true |
|
Enables server endpoints for log collection and local running of containers and commands. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--enable-server Default: true |
|
Enable the Kubelet's server. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--enforce-node-allocatable strings Default: pods |
|
A comma separated list of levels of node allocatable enforcement to be enforced by kubelet. Acceptable options are none , pods , system-reserved , and kube-reserved . If the latter two options are specified, --system-reserved-cgroup and --kube-reserved-cgroup must also be set, respectively. If none is specified, no additional options should be set. See https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ for more details. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--event-burst int32 Default: 10 | |
Maximum size of a bursty event records, temporarily allows event records to burst to this number, while still not exceeding --event-qps . The number must be >= 0. If 0 will use default burst (10). (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--event-qps int32 Default: 5 | |
QPS to limit event creations. The number must be >= 0. If 0 will use default QPS (5). (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--eviction-hard mapStringString Default: imagefs.available<15%,memory.available<100Mi,nodefs.available<10% |
|
A set of eviction thresholds (e.g. memory.available<1Gi ) that if met would trigger a pod eviction. On a Linux node, the default value also includes nodefs.inodesFree<5% . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--eviction-max-pod-grace-period int32 | |
Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met. If negative, defer to pod specified value. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--eviction-minimum-reclaim mapStringString | |
A set of minimum reclaims (e.g. imagefs.available=2Gi ) that describes the minimum amount of resource the kubelet will reclaim when performing a pod eviction if that resource is under pressure. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--eviction-pressure-transition-period duration Default: 5m0s |
|
Duration for which the kubelet has to wait before transitioning out of an eviction pressure condition. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--eviction-soft mapStringString | |
A set of eviction thresholds (e.g. memory.available<1.5Gi ) that if met over a corresponding grace period would trigger a pod eviction. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--eviction-soft-grace-period mapStringString | |
A set of eviction grace periods (e.g. memory.available=1m30s ) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--exit-on-lock-contention | |
Whether kubelet should exit upon lock-file contention. | |
--experimental-allocatable-ignore-eviction Default: false |
|
When set to true , hard eviction thresholds will be ignored while calculating node allocatable. See https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ for more details. (DEPRECATED: will be removed in 1.24 or later) |
|
--experimental-check-node-capabilities-before-mount | |
[Experimental] if set to true , the kubelet will check the underlying node for required components (binaries, etc.) before performing the mount (DEPRECATED: will be removed in 1.24 or later, in favor of using CSI.) |
|
--experimental-kernel-memcg-notification | |
Use kernelMemcgNotification configuration, this flag will be removed in 1.24 or later. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--experimental-log-sanitization bool | |
[Experimental] When enabled, prevents logging of fields tagged as sensitive (passwords, keys, tokens). Runtime log sanitization may introduce significant computation overhead and therefore should not be enabled in production. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)
| |
--experimental-mounter-path string Default: mount |
|
[Experimental] Path of mounter binary. Leave empty to use the default mount . (DEPRECATED: will be removed in 1.24 or later, in favor of using CSI.) |
|
--fail-swap-on Default: true |
|
Makes the Kubelet fail to start if swap is enabled on the node. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--feature-gates <A list of 'key=true/false' pairs> | |
A set of key=value pairs that describe feature gates for alpha/experimental features. Options are:APIListChunking=true|false (BETA - default=true) APIPriorityAndFairness=true|false (BETA - default=true) APIResponseCompression=true|false (BETA - default=true) APIServerIdentity=true|false (ALPHA - default=false) APIServerTracing=true|false (ALPHA - default=false) AllAlpha=true|false (ALPHA - default=false) AllBeta=true|false (BETA - default=false) AnyVolumeDataSource=true|false (ALPHA - default=false) AppArmor=true|false (BETA - default=true) CPUManager=true|false (BETA - default=true) CPUManagerPolicyAlphaOptions=true|false (ALPHA - default=false) CPUManagerPolicyBetaOptions=true|false (BETA - default=true) CPUManagerPolicyOptions=true|false (ALPHA - default=false) CSIInlineVolume=true|false (BETA - default=true) CSIMigration=true|false (BETA - default=true) CSIMigrationAWS=true|false (BETA - default=false) CSIMigrationAzureDisk=true|false (BETA - default=true) CSIMigrationAzureFile=true|false (BETA - default=false) CSIMigrationGCE=true|false (BETA - default=true) CSIMigrationOpenStack=true|false (BETA - default=true) CSIMigrationPortworx=true|false (ALPHA - default=false) CSIMigrationvSphere=true|false (BETA - default=false) CSIStorageCapacity=true|false (BETA - default=true) CSIVolumeHealth=true|false (ALPHA - default=false) CSRDuration=true|false (BETA - default=true) ControllerManagerLeaderMigration=true|false (BETA - default=true) CustomCPUCFSQuotaPeriod=true|false (ALPHA - default=false) CustomResourceValidationExpressions=true|false (ALPHA - default=false) DaemonSetUpdateSurge=true|false (BETA - default=true) DefaultPodTopologySpread=true|false (BETA - default=true) DelegateFSGroupToCSIDriver=true|false (BETA - default=true) DevicePlugins=true|false (BETA - default=true) DisableAcceleratorUsageMetrics=true|false (BETA - default=true) DisableCloudProviders=true|false (ALPHA - default=false) DisableKubeletCloudCredentialProviders=true|false (ALPHA - default=false) DownwardAPIHugePages=true|false (BETA - default=true) EfficientWatchResumption=true|false (BETA - default=true) EndpointSliceTerminatingCondition=true|false (BETA - default=true) EphemeralContainers=true|false (BETA - default=true) ExpandCSIVolumes=true|false (BETA - default=true) ExpandInUsePersistentVolumes=true|false (BETA - default=true) ExpandPersistentVolumes=true|false (BETA - default=true) ExpandedDNSConfig=true|false (ALPHA - default=false) ExperimentalHostUserNamespaceDefaulting=true|false (BETA - default=false) GRPCContainerProbe=true|false (ALPHA - default=false) GracefulNodeShutdown=true|false (BETA - default=true) GracefulNodeShutdownBasedOnPodPriority=true|false (ALPHA - default=false) HPAContainerMetrics=true|false (ALPHA - default=false) HPAScaleToZero=true|false (ALPHA - default=false) HonorPVReclaimPolicy=true|false (ALPHA - default=false) IdentifyPodOS=true|false (ALPHA - default=false) InTreePluginAWSUnregister=true|false (ALPHA - default=false) InTreePluginAzureDiskUnregister=true|false (ALPHA - default=false) InTreePluginAzureFileUnregister=true|false (ALPHA - default=false) InTreePluginGCEUnregister=true|false (ALPHA - default=false) InTreePluginOpenStackUnregister=true|false (ALPHA - default=false) InTreePluginPortworxUnregister=true|false (ALPHA - default=false) InTreePluginRBDUnregister=true|false (ALPHA - default=false) InTreePluginvSphereUnregister=true|false (ALPHA - default=false) IndexedJob=true|false (BETA - default=true) JobMutableNodeSchedulingDirectives=true|false (BETA - default=true) JobReadyPods=true|false (ALPHA - default=false) JobTrackingWithFinalizers=true|false (BETA - default=true) KubeletCredentialProviders=true|false (ALPHA - default=false) KubeletInUserNamespace=true|false (ALPHA - default=false) KubeletPodResources=true|false (BETA - default=true) KubeletPodResourcesGetAllocatable=true|false (BETA - default=true) LocalStorageCapacityIsolation=true|false (BETA - default=true) LocalStorageCapacityIsolationFSQuotaMonitoring=true|false (ALPHA - default=false) LogarithmicScaleDown=true|false (BETA - default=true) MemoryManager=true|false (BETA - default=true) MemoryQoS=true|false (ALPHA - default=false) MixedProtocolLBService=true|false (ALPHA - default=false) NetworkPolicyEndPort=true|false (BETA - default=true) NodeSwap=true|false (ALPHA - default=false) NonPreemptingPriority=true|false (BETA - default=true) OpenAPIEnums=true|false (ALPHA - default=false) OpenAPIV3=true|false (ALPHA - default=false) PodAffinityNamespaceSelector=true|false (BETA - default=true) PodAndContainerStatsFromCRI=true|false (ALPHA - default=false) PodDeletionCost=true|false (BETA - default=true) PodOverhead=true|false (BETA - default=true) PodSecurity=true|false (BETA - default=true) PreferNominatedNode=true|false (BETA - default=true) ProbeTerminationGracePeriod=true|false (BETA - default=false) ProcMountType=true|false (ALPHA - default=false) ProxyTerminatingEndpoints=true|false (ALPHA - default=false) QOSReserved=true|false (ALPHA - default=false) ReadWriteOncePod=true|false (ALPHA - default=false) RecoverVolumeExpansionFailure=true|false (ALPHA - default=false) RemainingItemCount=true|false (BETA - default=true) RemoveSelfLink=true|false (BETA - default=true) RotateKubeletServerCertificate=true|false (BETA - default=true) SeccompDefault=true|false (ALPHA - default=false) ServiceInternalTrafficPolicy=true|false (BETA - default=true) ServiceLBNodePortControl=true|false (BETA - default=true) ServiceLoadBalancerClass=true|false (BETA - default=true) SizeMemoryBackedVolumes=true|false (BETA - default=true) StatefulSetAutoDeletePVC=true|false (ALPHA - default=false) StatefulSetMinReadySeconds=true|false (BETA - default=true) StorageVersionAPI=true|false (ALPHA - default=false) StorageVersionHash=true|false (BETA - default=true) SuspendJob=true|false (BETA - default=true) TopologyAwareHints=true|false (BETA - default=true) TopologyManager=true|false (BETA - default=true) VolumeCapacityPriority=true|false (ALPHA - default=false) WinDSR=true|false (ALPHA - default=false) WinOverlay=true|false (BETA - default=true) WindowsHostProcessContainers=true|false (BETA - default=true) csiMigrationRBD=true|false (ALPHA - default=false) (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--file-check-frequency duration Default: 20s |
|
Duration between checking config files for new data. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--hairpin-mode string Default: promiscuous-bridge |
|
How should the kubelet setup hairpin NAT. This allows endpoints of a Service to load balance back to themselves if they should try to access their own Service. Valid values are promiscuous-bridge , hairpin-veth and none . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--healthz-bind-address string Default: 127.0.0.1 |
|
The IP address for the healthz server to serve on (set to 0.0.0.0 or :: for listening in all interfaces and IP families). (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--healthz-port int32 Default: 10248 | |
The port of the localhost healthz endpoint (set to 0 to disable). (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
-h, --help | |
help for kubelet | |
--hostname-override string | |
If non-empty, will use this string as identification instead of the actual hostname. If --cloud-provider is set, the cloud provider determines the name of the node (consult cloud provider documentation to determine if and how the hostname is used). |
|
--http-check-frequency duration Default: 20s |
|
Duration between checking HTTP for new data. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--image-credential-provider-bin-dir string | |
The path to the directory where credential provider plugin binaries are located. | |
--image-credential-provider-config string | |
The path to the credential provider plugin config file. | |
--image-gc-high-threshold int32 Default: 85 | |
The percent of disk usage after which image garbage collection is always run. Values must be within the range [0, 100], To disable image garbage collection, set to 100. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--image-gc-low-threshold int32 Default: 80 | |
The percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. Values must be within the range [0, 100] and should not be larger than that of --image-gc-high-threshold . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--image-pull-progress-deadline duration Default: 1m0s |
|
If no pulling progress is made before this deadline, the image pulling will be cancelled. This docker-specific flag only works when container-runtime is set to docker . (DEPRECATED: will be removed along with dockershim.) |
|
--image-service-endpoint string | |
[Experimental] The endpoint of remote image service. If not specified, it will be the same with --container-runtime-endpoint by default. Currently UNIX socket endpoint is supported on Linux, while npipe and TCP endpoints are supported on Windows. Examples: unix:///var/run/dockershim.sock , npipe:////./pipe/dockershim |
|
--iptables-drop-bit int32 Default: 15 | |
The bit of the fwmark space to mark packets for dropping. Must be within the range [0, 31]. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--iptables-masquerade-bit int32 Default: 14 | |
The bit of the fwmark space to mark packets for SNAT. Must be within the range [0, 31]. Please match this parameter with corresponding parameter in kube-proxy . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--keep-terminated-pod-volumes | |
Keep terminated pod volumes mounted to the node after the pod terminates. Can be useful for debugging volume related issues. (DEPRECATED: will be removed in a future version) | |
--kernel-memcg-notification | |
If enabled, the kubelet will integrate with the kernel memcg notification to determine if memory eviction thresholds are crossed rather than polling. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--kube-api-burst int32 Default: 10 | |
Burst to use while talking with kubernetes API server. The number must be >= 0. If 0 will use default burst (10). (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--kube-api-content-type string Default: application/vnd.kubernetes.protobuf |
|
Content type of requests sent to apiserver. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--kube-api-qps int32 Default: 5 | |
QPS to use while talking with kubernetes API server. The number must be >= 0. If 0 will use default QPS (5). Doesn't cover events and node heartbeat apis which rate limiting is controlled by a different set of flags. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--kube-reserved mapStringString Default: <None> | |
A set of <resource name>=<resource quantity> (e.g. cpu=200m,memory=500Mi,ephemeral-storage=1Gi,pid='100' ) pairs that describe resources reserved for kubernetes system components. Currently cpu , memory and local ephemeral-storage for root file system are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--kube-reserved-cgroup string Default: '' |
|
Absolute name of the top level cgroup that is used to manage kubernetes components for which compute resources were reserved via --kube-reserved flag. Ex. /kube-reserved . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--kubeconfig string | |
Path to a kubeconfig file, specifying how to connect to the API server. Providing --kubeconfig enables API server mode, omitting --kubeconfig enables standalone mode. |
|
--kubelet-cgroups string | |
Optional absolute name of cgroups to create and run the Kubelet in. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--lock-file string | |
<Warning: Alpha feature> The path to file for kubelet to use as a lock file. | |
--log-backtrace-at <A string of format 'file:line'> Default: ":0" |
|
When logging hits line , emit a stack trace. (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) |
|
--log-dir string | |
If non-empty, write log files in this directory. (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) | |
--log-file string | |
If non-empty, use this log file. (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) | |
--log-file-max-size uint Default: 1800 | |
Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) | |
--log-flush-frequency duration Default: 5s |
|
Maximum number of seconds between log flushes. | |
--log-json-info-buffer-size string Default: '0' |
|
[Experimental] In JSON format with split output streams, the info messages can be buffered for a while to increase performance. The default value of zero bytes disables buffering. The size can be specified as number of bytes (512), multiples of 1000 (1K), multiples of 1024 (2Ki), or powers of those (3M, 4G, 5Mi, 6Gi). (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--log-json-split-stream | |
[Experimental] In JSON format, write error messages to stderr and info messages to stdout. The default is to write a single stream to stdout. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--logging-format string Default: text |
|
Sets the log format. Permitted formats: text , json .Non-default formats don't honor these flags: --add-dir-header , --alsologtostderr , --log-backtrace-at , --log-dir , --log-file , --log-file-max-size , --logtostderr , --skip_headers , --skip_log_headers , --stderrthreshold , --log-flush-frequency .Non-default choices are currently alpha and subject to change without warning. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--logtostderr Default: true |
|
log to standard error instead of files. (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) | |
--make-iptables-util-chains Default: true |
|
If true, kubelet will ensure iptables utility rules are present on host. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--manifest-url string | |
URL for accessing additional Pod specifications to run (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--manifest-url-header string | |
Comma-separated list of HTTP headers to use when accessing the URL provided to --manifest-url . Multiple headers with the same name will be added in the same order provided. This flag can be repeatedly invoked. For example: --manifest-url-header 'a:hello,b:again,c:world' --manifest-url-header 'b:beautiful' (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--master-service-namespace string Default: default |
|
The namespace from which the kubernetes master services should be injected into pods. (DEPRECATED: This flag will be removed in a future version.) | |
--max-open-files int Default: 1000000 | |
Number of files that can be opened by Kubelet process. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--max-pods int32 Default: 110 | |
Number of Pods that can run on this Kubelet. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--maximum-dead-containers int32 Default: -1 | |
Maximum number of old instances of containers to retain globally. Each container takes up some disk space. To disable, set to a negative number. (DEPRECATED: Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.) |
|
--maximum-dead-containers-per-container int32 Default: 1 | |
Maximum number of old instances to retain per container. Each container takes up some disk space. (DEPRECATED: Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.) |
|
--memory-manager-policy string Default: None |
|
Memory Manager policy to use. Possible values: 'None' , 'Static' . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--minimum-container-ttl-duration duration | |
Minimum age for a finished container before it is garbage collected. Examples: '300ms' , '10s' or '2h45m' (DEPRECATED: Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.) |
|
--minimum-image-ttl-duration duration Default: 2m0s |
|
Minimum age for an unused image before it is garbage collected. Examples: '300ms' , '10s' or '2h45m' . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--network-plugin string | |
The name of the network plugin to be invoked for various events in kubelet/pod lifecycle. This docker-specific flag only works when container-runtime is set to docker . (DEPRECATED: will be removed along with dockershim.) |
|
--network-plugin-mtu int32 | |
The MTU to be passed to the network plugin, to override the default. Set to 0 to use the default 1460 MTU. This docker-specific flag only works when container-runtime is set to docker . (DEPRECATED: will be removed along with dockershim.) |
|
--node-ip string | |
IP address (or comma-separated dual-stack IP addresses) of the node. If unset, kubelet will use the node's default IPv4 address, if any, or its default IPv6 address if it has no IPv4 addresses. You can pass '::' to make it prefer the default IPv6 address rather than the default IPv4 address. |
|
--node-labels mapStringString | |
<Warning: Alpha feature>Labels to add when registering the node in the cluster. Labels must be key=value pairs separated by ',' . Labels in the 'kubernetes.io' namespace must begin with an allowed prefix ('kubelet.kubernetes.io' , 'node.kubernetes.io' ) or be in the specifically allowed set ('beta.kubernetes.io/arch' , 'beta.kubernetes.io/instance-type' , 'beta.kubernetes.io/os' , 'failure-domain.beta.kubernetes.io/region' , 'failure-domain.beta.kubernetes.io/zone' , 'kubernetes.io/arch' , 'kubernetes.io/hostname' , 'kubernetes.io/os' , 'node.kubernetes.io/instance-type' , 'topology.kubernetes.io/region' , 'topology.kubernetes.io/zone' ) |
|
--node-status-max-images int32 Default: 50 | |
The maximum number of images to report in node.status.images . If -1 is specified, no cap will be applied. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--node-status-update-frequency duration Default: 10s |
|
Specifies how often kubelet posts node status to master. Note: be cautious when changing the constant, it must work with nodeMonitorGracePeriod in Node controller. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--non-masquerade-cidr string Default: 10.0.0.0/8 |
|
Traffic to IPs outside this range will use IP masquerade. Set to '0.0.0.0/0' to never masquerade. (DEPRECATED: will be removed in a future version) |
|
--one-output | |
If true, only write logs to their native severity level (vs also writing to each lower severity level). (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) | |
--oom-score-adj int32 Default: -999 | |
The oom-score-adj value for kubelet process. Values must be within the range [-1000, 1000]. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--pod-cidr string | |
The CIDR to use for pod IP addresses, only used in standalone mode. In cluster mode, this is obtained from the master. For IPv6, the maximum number of IP's allocated is 65536 (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--pod-infra-container-image string Default: k8s.gcr.io/pause:3.6 |
|
Specified image will not be pruned by the image garbage collector. When container-runtime is set to docker , all containers in each pod will use the network/IPC namespaces from this image. Other CRI implementations have their own configuration to set this image. |
|
--pod-manifest-path string | |
Path to the directory containing static pod files to run, or the path to a single static pod file. Files starting with dots will be ignored. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--pod-max-pids int Default: -1 | |
Set the maximum number of processes per pod. If -1 , the kubelet defaults to the node allocatable PID capacity. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--pods-per-core int32 | |
Number of Pods per core that can run on this kubelet. The total number of pods on this kubelet cannot exceed --max-pods , so --max-pods will be used if this calculation results in a larger number of pods allowed on the kubelet. A value of 0 disables this limit. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--port int32 Default: 10250 | |
The port for the kubelet to serve on. (DEPRECATED: This parameter should be set via the config file specified by the kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--protect-kernel-defaults | |
Default kubelet behaviour for kernel tuning. If set, kubelet errors if any of kernel tunables is different than kubelet defaults. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--provider-id string | |
Unique identifier for identifying the node in a machine database, i.e cloud provider. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--qos-reserved mapStringString | |
<Warning: Alpha feature> A set of <resource name>=<percentage> (e.g. memory=50% ) pairs that describe how pod resource requests are reserved at the QoS level. Currently only memory is supported. Requires the QOSReserved feature gate to be enabled. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--read-only-port int32 Default: 10255 | |
The read-only port for the kubelet to serve on with no authentication/authorization (set to 0 to disable). (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--really-crash-for-testing | |
If true, when panics occur crash. Intended for testing. (DEPRECATED: will be removed in a future version.) | |
--register-node Default: true |
|
Register the node with the API server. If --kubeconfig is not provided, this flag is irrelevant, as the Kubelet won't have an API server to register with. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--register-schedulable Default: true |
|
Register the node as schedulable. Won't have any effect if --register-node is false . (DEPRECATED: will be removed in a future version) |
|
--register-with-taints mapStringString | |
Register the node with the given list of taints (comma separated <key>=<value>:<effect> ). No-op if --register-node is false . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--registry-burst int32 Default: 10 | |
Maximum size of a bursty pulls, temporarily allows pulls to burst to this number, while still not exceeding --registry-qps . Only used if --registry-qps is greater than 0. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--registry-qps int32 Default: 5 | |
If > 0, limit registry pull QPS to this value. If 0 , unlimited. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--reserved-cpus string | |
A comma-separated list of CPUs or CPU ranges that are reserved for system and kubernetes usage. This specific list will supersede cpu counts in --system-reserved and --kube-reserved . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--reserved-memory string | |
A comma-separated list of memory reservations for NUMA nodes. (e.g. --reserved-memory 0:memory=1Gi,hugepages-1M=2Gi --reserved-memory 1:memory=2Gi ). The total sum for each memory type should be equal to the sum of --kube-reserved , --system-reserved and --eviction-threshold . See https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/#reserved-memory-flag for more details. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--resolv-conf string Default: /etc/resolv.conf |
|
Resolver configuration file used as the basis for the container DNS resolution configuration. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--root-dir string Default: /var/lib/kubelet |
|
Directory path for managing kubelet files (volume mounts, etc). | |
--rotate-certificates | |
<Warning: Beta feature> Auto rotate the kubelet client certificates by requesting new certificates from the kube-apiserver when the certificate expiration approaches. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--rotate-server-certificates | |
Auto-request and rotate the kubelet serving certificates by requesting new certificates from the kube-apiserver when the certificate expiration approaches. Requires the RotateKubeletServerCertificate feature gate to be enabled, and approval of the submitted CertificateSigningRequest objects. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--runonce | |
If true , exit after spawning pods from local manifests or remote urls. Exclusive with --enable-server (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--runtime-cgroups string | |
Optional absolute name of cgroups to create and run the runtime in. | |
--runtime-request-timeout duration Default: 2m0s |
|
Timeout of all runtime requests except long running request - pull , logs , exec and attach . When timeout exceeded, kubelet will cancel the request, throw out an error and retry later. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--seccomp-default RuntimeDefault | |
<Warning: Alpha feature> Enable the use of RuntimeDefault as the default seccomp profile for all workloads. The SeccompDefault feature gate must be enabled to allow this flag, which is disabled by default. |
|
--serialize-image-pulls Default: true |
|
Pull images one at a time. We recommend *not* changing the default value on nodes that run docker daemon with version < 1.9 or an aufs storage backend. Issue #10959 has more details. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--skip-headers | |
If true , avoid header prefixes in the log messages. (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) |
|
--skip-log-headers | |
If true , avoid headers when opening log files. (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) |
|
--stderrthreshold int Default: 2 | |
logs at or above this threshold go to stderr. (DEPRECATED: will be removed in a future release, see https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/2845-deprecate-klog-specific-flags-in-k8s-components) | |
--streaming-connection-idle-timeout duration Default: 4h0m0s |
|
Maximum time a streaming connection can be idle before the connection is automatically closed. 0 indicates no timeout. Example: 5m . Note: All connections to the kubelet server have a maximum duration of 4 hours. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--sync-frequency duration Default: 1m0s |
|
Max period between synchronizing running containers and config. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--system-cgroups string | |
Optional absolute name of cgroups in which to place all non-kernel processes that are not already inside a cgroup under '/' . Empty for no container. Rolling back the flag requires a reboot. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--system-reserved mapStringString Default: <none> | |
A set of <resource name>=<resource quantity> (e.g. cpu=200m,memory=500Mi,ephemeral-storage=1Gi,pid='100' ) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--system-reserved-cgroup string Default: '' |
|
Absolute name of the top level cgroup that is used to manage non-kubernetes components for which compute resources were reserved via --system-reserved flag. Ex. /system-reserved . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--tls-cert-file string | |
File containing x509 Certificate used for serving HTTPS (with intermediate certs, if any, concatenated after server cert). If --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory passed to --cert-dir . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--tls-cipher-suites strings | |
Comma-separated list of cipher suites for the server. If omitted, the default Go cipher suites will be used. Preferred values: TLS_AES_128_GCM_SHA256, TLS_AES_256_GCM_SHA384, TLS_CHACHA20_POLY1305_SHA256, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384 Insecure values: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_RC4_128_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_RC4_128_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_RC4_128_SHA. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) | |
--tls-min-version string | |
Minimum TLS version supported. Possible values: VersionTLS10 , VersionTLS11 , VersionTLS12 , VersionTLS13 . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--tls-private-key-file string | |
File containing x509 private key matching --tls-cert-file . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--topology-manager-policy string Default: 'none' |
|
Topology Manager policy to use. Possible values: 'none' , 'best-effort' , 'restricted' , 'single-numa-node' . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--topology-manager-scope string Default: container |
|
Scope to which topology hints applied. Topology Manager collects hints from Hint Providers and applies them to defined scope to ensure the pod admission. Possible values: 'container' , 'pod' . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
-v, --v Level | |
Number for the log level verbosity | |
--version version[=true] | |
Print version information and quit | |
--vmodule <A list of 'pattern=N' string> | |
Comma-separated list of pattern=N settings for file-filtered logging |
|
--volume-plugin-dir string Default: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ |
|
The full path of the directory in which to search for additional third party volume plugins. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
|
--volume-stats-agg-period duration Default: 1m0s |
|
Specifies interval for kubelet to calculate and cache the volume disk usage for all pods and volumes. To disable volume calculations, set to 0 . (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.) |
6.11.3 - kube-apiserver
Synopsis
The Kubernetes API server validates and configures data for the api objects which include pods, services, replicationcontrollers, and others. The API Server services REST operations and provides the frontend to the cluster's shared state through which all other components interact.
kube-apiserver [flags]
Options
--admission-control-config-file string | |
File with admission control configuration. |
|
--advertise-address string | |
The IP address on which to advertise the apiserver to members of the cluster. This address must be reachable by the rest of the cluster. If blank, the --bind-address will be used. If --bind-address is unspecified, the host's default interface will be used. |
|
--allow-metric-labels stringToString Default: [] | |
The map from metric-label to value allow-list of this label. The key's format is <MetricName>,<LabelName>. The value's format is <allowed_value>,<allowed_value>...e.g. metric1,label1='v1,v2,v3', metric1,label2='v1,v2,v3' metric2,label1='v1,v2,v3'. |
|
--allow-privileged | |
If true, allow privileged containers. [default=false] |
|
--anonymous-auth Default: true | |
Enables anonymous requests to the secure port of the API server. Requests that are not rejected by another authentication method are treated as anonymous requests. Anonymous requests have a username of system:anonymous, and a group name of system:unauthenticated. |
|
--api-audiences strings | |
Identifiers of the API. The service account token authenticator will validate that tokens used against the API are bound to at least one of these audiences. If the --service-account-issuer flag is configured and this flag is not, this field defaults to a single element list containing the issuer URL. |
|
--apiserver-count int Default: 1 | |
The number of apiservers running in the cluster, must be a positive number. (In use when --endpoint-reconciler-type=master-count is enabled.) |
|
--audit-log-batch-buffer-size int Default: 10000 | |
The size of the buffer to store events before batching and writing. Only used in batch mode. |
|
--audit-log-batch-max-size int Default: 1 | |
The maximum size of a batch. Only used in batch mode. |
|
--audit-log-batch-max-wait duration | |
The amount of time to wait before force writing the batch that hadn't reached the max size. Only used in batch mode. |
|
--audit-log-batch-throttle-burst int | |
Maximum number of requests sent at the same moment if ThrottleQPS was not utilized before. Only used in batch mode. |
|
--audit-log-batch-throttle-enable | |
Whether batching throttling is enabled. Only used in batch mode. |
|
--audit-log-batch-throttle-qps float | |
Maximum average number of batches per second. Only used in batch mode. |
|
--audit-log-compress | |
If set, the rotated log files will be compressed using gzip. |
|
--audit-log-format string Default: "json" | |
Format of saved audits. "legacy" indicates 1-line text format for each event. "json" indicates structured json format. Known formats are legacy,json. |
|
--audit-log-maxage int | |
The maximum number of days to retain old audit log files based on the timestamp encoded in their filename. |
|
--audit-log-maxbackup int | |
The maximum number of old audit log files to retain. Setting a value of 0 will mean there's no restriction on the number of files. |
|
--audit-log-maxsize int | |
The maximum size in megabytes of the audit log file before it gets rotated. |
|
--audit-log-mode string Default: "blocking" | |
Strategy for sending audit events. Blocking indicates sending events should block server responses. Batch causes the backend to buffer and write events asynchronously. Known modes are batch,blocking,blocking-strict. |
|
--audit-log-path string | |
If set, all requests coming to the apiserver will be logged to this file. '-' means standard out. |
|
--audit-log-truncate-enabled | |
Whether event and batch truncating is enabled. |
|
--audit-log-truncate-max-batch-size int Default: 10485760 | |
Maximum size of the batch sent to the underlying backend. Actual serialized size can be several hundreds of bytes greater. If a batch exceeds this limit, it is split into several batches of smaller size. |
|
--audit-log-truncate-max-event-size int Default: 102400 | |
Maximum size of the audit event sent to the underlying backend. If the size of an event is greater than this number, first request and response are removed, and if this doesn't reduce the size enough, event is discarded. |
|
--audit-log-version string Default: "audit.k8s.io/v1" | |
API group and version used for serializing audit events written to log. |
|
--audit-policy-file string | |
Path to the file that defines the audit policy configuration. |
|
--audit-webhook-batch-buffer-size int Default: 10000 | |
The size of the buffer to store events before batching and writing. Only used in batch mode. |
|
--audit-webhook-batch-max-size int Default: 400 | |
The maximum size of a batch. Only used in batch mode. |
|
--audit-webhook-batch-max-wait duration Default: 30s | |
The amount of time to wait before force writing the batch that hadn't reached the max size. Only used in batch mode. |
|
--audit-webhook-batch-throttle-burst int Default: 15 | |
Maximum number of requests sent at the same moment if ThrottleQPS was not utilized before. Only used in batch mode. |
|
--audit-webhook-batch-throttle-enable Default: true | |
Whether batching throttling is enabled. Only used in batch mode. |
|
--audit-webhook-batch-throttle-qps float Default: 10 | |
Maximum average number of batches per second. Only used in batch mode. |
|
--audit-webhook-config-file string | |
Path to a kubeconfig formatted file that defines the audit webhook configuration. |
|
--audit-webhook-initial-backoff duration Default: 10s | |
The amount of time to wait before retrying the first failed request. |
|
--audit-webhook-mode string Default: "batch" | |
Strategy for sending audit events. Blocking indicates sending events should block server responses. Batch causes the backend to buffer and write events asynchronously. Known modes are batch,blocking,blocking-strict. |
|
--audit-webhook-truncate-enabled | |
Whether event and batch truncating is enabled. |
|
--audit-webhook-truncate-max-batch-size int Default: 10485760 | |
Maximum size of the batch sent to the underlying backend. Actual serialized size can be several hundreds of bytes greater. If a batch exceeds this limit, it is split into several batches of smaller size. |
|
--audit-webhook-truncate-max-event-size int Default: 102400 | |
Maximum size of the audit event sent to the underlying backend. If the size of an event is greater than this number, first request and response are removed, and if this doesn't reduce the size enough, event is discarded. |
|
--audit-webhook-version string Default: "audit.k8s.io/v1" | |
API group and version used for serializing audit events written to webhook. |
|
--authentication-token-webhook-cache-ttl duration Default: 2m0s | |
The duration to cache responses from the webhook token authenticator. |
|
--authentication-token-webhook-config-file string | |
File with webhook configuration for token authentication in kubeconfig format. The API server will query the remote service to determine authentication for bearer tokens. |
|
--authentication-token-webhook-version string Default: "v1beta1" | |
The API version of the authentication.k8s.io TokenReview to send to and expect from the webhook. |
|
--authorization-mode strings Default: "AlwaysAllow" | |
Ordered list of plug-ins to do authorization on secure port. Comma-delimited list of: AlwaysAllow,AlwaysDeny,ABAC,Webhook,RBAC,Node. |
|
--authorization-policy-file string | |
File with authorization policy in json line by line format, used with --authorization-mode=ABAC, on the secure port. |
|
--authorization-webhook-cache-authorized-ttl duration Default: 5m0s | |
The duration to cache 'authorized' responses from the webhook authorizer. |
|
--authorization-webhook-cache-unauthorized-ttl duration Default: 30s | |
The duration to cache 'unauthorized' responses from the webhook authorizer. |
|
--authorization-webhook-config-file string | |
File with webhook configuration in kubeconfig format, used with --authorization-mode=Webhook. The API server will query the remote service to determine access on the API server's secure port. |
|
--authorization-webhook-version string Default: "v1beta1" | |
The API version of the authorization.k8s.io SubjectAccessReview to send to and expect from the webhook. |
|
--azure-container-registry-config string | |
Path to the file containing Azure container registry configuration information. |
|
--bind-address string Default: 0.0.0.0 | |
The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces will be used. |
|
--cert-dir string Default: "/var/run/kubernetes" | |
The directory where the TLS certs are located. If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored. |
|
--client-ca-file string | |
If set, any request presenting a client certificate signed by one of the authorities in the client-ca-file is authenticated with an identity corresponding to the CommonName of the client certificate. |
|
--cloud-config string | |
The path to the cloud provider configuration file. Empty string for no configuration file. |
|
--cloud-provider string | |
The provider for cloud services. Empty string for no provider. |
|
--cloud-provider-gce-l7lb-src-cidrs cidrs Default: 130.211.0.0/22,35.191.0.0/16 | |
CIDRs opened in GCE firewall for L7 LB traffic proxy & health checks |
|
--contention-profiling | |
Enable lock contention profiling, if profiling is enabled |
|
--cors-allowed-origins strings | |
List of allowed origins for CORS, comma separated. An allowed origin can be a regular expression to support subdomain matching. If this list is empty CORS will not be enabled. |
|
--default-not-ready-toleration-seconds int Default: 300 | |
Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a toleration. |
|
--default-unreachable-toleration-seconds int Default: 300 | |
Indicates the tolerationSeconds of the toleration for unreachable:NoExecute that is added by default to every pod that does not already have such a toleration. |
|
--default-watch-cache-size int Default: 100 | |
Default watch cache size. If zero, watch cache will be disabled for resources that do not have a default watch size set. |
|
--delete-collection-workers int Default: 1 | |
Number of workers spawned for DeleteCollection call. These are used to speed up namespace cleanup. |
|
--disable-admission-plugins strings | |
admission plugins that should be disabled although they are in the default enabled plugins list (NamespaceLifecycle, LimitRanger, ServiceAccount, TaintNodesByCondition, PodSecurity, Priority, DefaultTolerationSeconds, DefaultStorageClass, StorageObjectInUseProtection, PersistentVolumeClaimResize, RuntimeClass, CertificateApproval, CertificateSigning, CertificateSubjectRestriction, DefaultIngressClass, MutatingAdmissionWebhook, ValidatingAdmissionWebhook, ResourceQuota). Comma-delimited list of admission plugins: AlwaysAdmit, AlwaysDeny, AlwaysPullImages, CertificateApproval, CertificateSigning, CertificateSubjectRestriction, DefaultIngressClass, DefaultStorageClass, DefaultTolerationSeconds, DenyServiceExternalIPs, EventRateLimit, ExtendedResourceToleration, ImagePolicyWebhook, LimitPodHardAntiAffinityTopology, LimitRanger, MutatingAdmissionWebhook, NamespaceAutoProvision, NamespaceExists, NamespaceLifecycle, NodeRestriction, OwnerReferencesPermissionEnforcement, PersistentVolumeClaimResize, PersistentVolumeLabel, PodNodeSelector, PodSecurity, PodSecurityPolicy, PodTolerationRestriction, Priority, ResourceQuota, RuntimeClass, SecurityContextDeny, ServiceAccount, StorageObjectInUseProtection, TaintNodesByCondition, ValidatingAdmissionWebhook. The order of plugins in this flag does not matter. |
|
--disabled-metrics strings | |
This flag provides an escape hatch for misbehaving metrics. You must provide the fully qualified metric name in order to disable it. Disclaimer: disabling metrics is higher in precedence than showing hidden metrics. |
|
--egress-selector-config-file string | |
File with apiserver egress selector configuration. |
|
--enable-admission-plugins strings | |
admission plugins that should be enabled in addition to default enabled ones (NamespaceLifecycle, LimitRanger, ServiceAccount, TaintNodesByCondition, PodSecurity, Priority, DefaultTolerationSeconds, DefaultStorageClass, StorageObjectInUseProtection, PersistentVolumeClaimResize, RuntimeClass, CertificateApproval, CertificateSigning, CertificateSubjectRestriction, DefaultIngressClass, MutatingAdmissionWebhook, ValidatingAdmissionWebhook, ResourceQuota). Comma-delimited list of admission plugins: AlwaysAdmit, AlwaysDeny, AlwaysPullImages, CertificateApproval, CertificateSigning, CertificateSubjectRestriction, DefaultIngressClass, DefaultStorageClass, DefaultTolerationSeconds, DenyServiceExternalIPs, EventRateLimit, ExtendedResourceToleration, ImagePolicyWebhook, LimitPodHardAntiAffinityTopology, LimitRanger, MutatingAdmissionWebhook, NamespaceAutoProvision, NamespaceExists, NamespaceLifecycle, NodeRestriction, OwnerReferencesPermissionEnforcement, PersistentVolumeClaimResize, PersistentVolumeLabel, PodNodeSelector, PodSecurity, PodSecurityPolicy, PodTolerationRestriction, Priority, ResourceQuota, RuntimeClass, SecurityContextDeny, ServiceAccount, StorageObjectInUseProtection, TaintNodesByCondition, ValidatingAdmissionWebhook. The order of plugins in this flag does not matter. |
|
--enable-aggregator-routing | |
Turns on aggregator routing requests to endpoints IP rather than cluster IP. |
|
--enable-bootstrap-token-auth | |
Enable to allow secrets of type 'bootstrap.kubernetes.io/token' in the 'kube-system' namespace to be used for TLS bootstrapping authentication. |
|
--enable-garbage-collector Default: true | |
Enables the generic garbage collector. MUST be synced with the corresponding flag of the kube-controller-manager. |
|
--enable-priority-and-fairness Default: true | |
If true and the APIPriorityAndFairness feature gate is enabled, replace the max-in-flight handler with an enhanced one that queues and dispatches with priority and fairness |
|
--encryption-provider-config string | |
The file containing configuration for encryption providers to be used for storing secrets in etcd |
|
--endpoint-reconciler-type string Default: "lease" | |
Use an endpoint reconciler (master-count, lease, none) |
|
--etcd-cafile string | |
SSL Certificate Authority file used to secure etcd communication. |
|
--etcd-certfile string | |
SSL certification file used to secure etcd communication. |
|
--etcd-compaction-interval duration Default: 5m0s | |
The interval of compaction requests. If 0, the compaction request from apiserver is disabled. |
|
--etcd-count-metric-poll-period duration Default: 1m0s | |
Frequency of polling etcd for number of resources per type. 0 disables the metric collection. |
|
--etcd-db-metric-poll-interval duration Default: 30s | |
The interval of requests to poll etcd and update metric. 0 disables the metric collection |
|
--etcd-healthcheck-timeout duration Default: 2s | |
The timeout to use when checking etcd health. |
|
--etcd-keyfile string | |
SSL key file used to secure etcd communication. |
|
--etcd-prefix string Default: "/registry" | |
The prefix to prepend to all resource paths in etcd. |
|
--etcd-servers strings | |
List of etcd servers to connect with (scheme://ip:port), comma separated. |
|
--etcd-servers-overrides strings | |
Per-resource etcd servers overrides, comma separated. The individual override format: group/resource#servers, where servers are URLs, semicolon separated. Note that this applies only to resources compiled into this server binary. |
|
--event-ttl duration Default: 1h0m0s | |
Amount of time to retain events. |
|
--experimental-logging-sanitization | |
[Experimental] When enabled prevents logging of fields tagged as sensitive (passwords, keys, tokens). |
|
--external-hostname string | |
The hostname to use when generating externalized URLs for this master (e.g. Swagger API Docs or OpenID Discovery). |
|
--feature-gates <comma-separated 'key=True|False' pairs> | |
A set of key=value pairs that describe feature gates for alpha/experimental features. Options are: |
|
--goaway-chance float | |
To prevent HTTP/2 clients from getting stuck on a single apiserver, randomly close a connection (GOAWAY). The client's other in-flight requests won't be affected, and the client will reconnect, likely landing on a different apiserver after going through the load balancer again. This argument sets the fraction of requests that will be sent a GOAWAY. Clusters with single apiservers, or which don't use a load balancer, should NOT enable this. Min is 0 (off), Max is .02 (1/50 requests); .001 (1/1000) is a recommended starting point. |
|
-h, --help | |
help for kube-apiserver |
|
--http2-max-streams-per-connection int | |
The limit that the server gives to clients for the maximum number of streams in an HTTP/2 connection. Zero means to use golang's default. |
|
--identity-lease-duration-seconds int Default: 3600 | |
The duration of kube-apiserver lease in seconds, must be a positive number. (In use when the APIServerIdentity feature gate is enabled.) |
|
--identity-lease-renew-interval-seconds int Default: 10 | |
The interval of kube-apiserver renewing its lease in seconds, must be a positive number. (In use when the APIServerIdentity feature gate is enabled.) |
|
--kubelet-certificate-authority string | |
Path to a cert file for the certificate authority. |
|
--kubelet-client-certificate string | |
Path to a client cert file for TLS. |
|
--kubelet-client-key string | |
Path to a client key file for TLS. |
|
--kubelet-preferred-address-types strings Default: "Hostname,InternalDNS,InternalIP,ExternalDNS,ExternalIP" | |
List of the preferred NodeAddressTypes to use for kubelet connections. |
|
--kubelet-timeout duration Default: 5s | |
Timeout for kubelet operations. |
|
--kubernetes-service-node-port int | |
If non-zero, the Kubernetes master service (which apiserver creates/maintains) will be of type NodePort, using this as the value of the port. If zero, the Kubernetes master service will be of type ClusterIP. |
|
--lease-reuse-duration-seconds int Default: 60 | |
The time in seconds that each lease is reused. A lower value could avoid large number of objects reusing the same lease. Notice that a too small value may cause performance problems at storage layer. |
|
--livez-grace-period duration | |
This option represents the maximum amount of time it should take for apiserver to complete its startup sequence and become live. From apiserver's start time to when this amount of time has elapsed, /livez will assume that unfinished post-start hooks will complete successfully and therefore return true. |
|
--log-flush-frequency duration Default: 5s | |
Maximum number of seconds between log flushes |
|
--logging-format string Default: "text" | |
Sets the log format. Permitted formats: "text". |
|
--master-service-namespace string Default: "default" | |
DEPRECATED: the namespace from which the Kubernetes master services should be injected into pods. |
|
--max-connection-bytes-per-sec int | |
If non-zero, throttle each user connection to this number of bytes/sec. Currently only applies to long-running requests. |
|
--max-mutating-requests-inflight int Default: 200 | |
This and --max-requests-inflight are summed to determine the server's total concurrency limit (which must be positive) if --enable-priority-and-fairness is true. Otherwise, this flag limits the maximum number of mutating requests in flight, or a zero value disables the limit completely. |
|
--max-requests-inflight int Default: 400 | |
This and --max-mutating-requests-inflight are summed to determine the server's total concurrency limit (which must be positive) if --enable-priority-and-fairness is true. Otherwise, this flag limits the maximum number of non-mutating requests in flight, or a zero value disables the limit completely. |
|
--min-request-timeout int Default: 1800 | |
An optional field indicating the minimum number of seconds a handler must keep a request open before timing it out. Currently only honored by the watch request handler, which picks a randomized value above this number as the connection timeout, to spread out load. |
|
--oidc-ca-file string | |
If set, the OpenID server's certificate will be verified by one of the authorities in the oidc-ca-file, otherwise the host's root CA set will be used. |
|
--oidc-client-id string | |
The client ID for the OpenID Connect client, must be set if oidc-issuer-url is set. |
|
--oidc-groups-claim string | |
If provided, the name of a custom OpenID Connect claim for specifying user groups. The claim value is expected to be a string or array of strings. This flag is experimental, please see the authentication documentation for further details. |
|
--oidc-groups-prefix string | |
If provided, all groups will be prefixed with this value to prevent conflicts with other authentication strategies. |
|
--oidc-issuer-url string | |
The URL of the OpenID issuer, only HTTPS scheme will be accepted. If set, it will be used to verify the OIDC JSON Web Token (JWT). |
|
--oidc-required-claim <comma-separated 'key=value' pairs> | |
A key=value pair that describes a required claim in the ID Token. If set, the claim is verified to be present in the ID Token with a matching value. Repeat this flag to specify multiple claims. |
|
--oidc-signing-algs strings Default: "RS256" | |
Comma-separated list of allowed JOSE asymmetric signing algorithms. JWTs with a supported 'alg' header values are: RS256, RS384, RS512, ES256, ES384, ES512, PS256, PS384, PS512. Values are defined by RFC 7518 https://tools.ietf.org/html/rfc7518#section-3.1. |
|
--oidc-username-claim string Default: "sub" | |
The OpenID claim to use as the user name. Note that claims other than the default ('sub') is not guaranteed to be unique and immutable. This flag is experimental, please see the authentication documentation for further details. |
|
--oidc-username-prefix string | |
If provided, all usernames will be prefixed with this value. If not provided, username claims other than 'email' are prefixed by the issuer URL to avoid clashes. To skip any prefixing, provide the value '-'. |
|
--permit-address-sharing | |
If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false] |
|
--permit-port-sharing | |
If true, SO_REUSEPORT will be used when binding the port, which allows more than one instance to bind on the same address and port. [default=false] |
|
--profiling Default: true | |
Enable profiling via web interface host:port/debug/pprof/ |
|
--proxy-client-cert-file string | |
Client certificate used to prove the identity of the aggregator or kube-apiserver when it must call out during a request. This includes proxying requests to a user api-server and calling out to webhook admission plugins. It is expected that this cert includes a signature from the CA in the --requestheader-client-ca-file flag. That CA is published in the 'extension-apiserver-authentication' configmap in the kube-system namespace. Components receiving calls from kube-aggregator should use that CA to perform their half of the mutual TLS verification. |
|
--proxy-client-key-file string | |
Private key for the client certificate used to prove the identity of the aggregator or kube-apiserver when it must call out during a request. This includes proxying requests to a user api-server and calling out to webhook admission plugins. |
|
--request-timeout duration Default: 1m0s | |
An optional field indicating the duration a handler must keep a request open before timing it out. This is the default request timeout for requests but may be overridden by flags such as --min-request-timeout for specific types of requests. |
|
--requestheader-allowed-names strings | |
List of client certificate common names to allow to provide usernames in headers specified by --requestheader-username-headers. If empty, any client certificate validated by the authorities in --requestheader-client-ca-file is allowed. |
|
--requestheader-client-ca-file string | |
Root certificate bundle to use to verify client certificates on incoming requests before trusting usernames in headers specified by --requestheader-username-headers. WARNING: generally do not depend on authorization being already done for incoming requests. |
|
--requestheader-extra-headers-prefix strings | |
List of request header prefixes to inspect. X-Remote-Extra- is suggested. |
|
--requestheader-group-headers strings | |
List of request headers to inspect for groups. X-Remote-Group is suggested. |
|
--requestheader-username-headers strings | |
List of request headers to inspect for usernames. X-Remote-User is common. |
|
--runtime-config <comma-separated 'key=value' pairs> | |
A set of key=value pairs that enable or disable built-in APIs. Supported options are: |
|
--secure-port int Default: 6443 | |
The port on which to serve HTTPS with authentication and authorization. It cannot be switched off with 0. |
|
--service-account-extend-token-expiration Default: true | |
Turns on projected service account expiration extension during token generation, which helps safe transition from legacy token to bound service account token feature. If this flag is enabled, admission injected tokens would be extended up to 1 year to prevent unexpected failure during transition, ignoring value of service-account-max-token-expiration. |
|
--service-account-issuer strings | |
Identifier of the service account token issuer. The issuer will assert this identifier in "iss" claim of issued tokens. This value is a string or URI. If this option is not a valid URI per the OpenID Discovery 1.0 spec, the ServiceAccountIssuerDiscovery feature will remain disabled, even if the feature gate is set to true. It is highly recommended that this value comply with the OpenID spec: https://openid.net/specs/openid-connect-discovery-1_0.html. In practice, this means that service-account-issuer must be an https URL. It is also highly recommended that this URL be capable of serving OpenID discovery documents at {service-account-issuer}/.well-known/openid-configuration. When this flag is specified multiple times, the first is used to generate tokens and all are used to determine which issuers are accepted. |
|
--service-account-jwks-uri string | |
Overrides the URI for the JSON Web Key Set in the discovery doc served at /.well-known/openid-configuration. This flag is useful if the discovery docand key set are served to relying parties from a URL other than the API server's external (as auto-detected or overridden with external-hostname). |
|
--service-account-key-file strings | |
File containing PEM-encoded x509 RSA or ECDSA private or public keys, used to verify ServiceAccount tokens. The specified file can contain multiple keys, and the flag can be specified multiple times with different files. If unspecified, --tls-private-key-file is used. Must be specified when --service-account-signing-key-file is provided |
|
--service-account-lookup Default: true | |
If true, validate ServiceAccount tokens exist in etcd as part of authentication. |
|
--service-account-max-token-expiration duration | |
The maximum validity duration of a token created by the service account token issuer. If an otherwise valid TokenRequest with a validity duration larger than this value is requested, a token will be issued with a validity duration of this value. |
|
--service-account-signing-key-file string | |
Path to the file that contains the current private key of the service account token issuer. The issuer will sign issued ID tokens with this private key. |
|
--service-cluster-ip-range string | |
A CIDR notation IP range from which to assign service cluster IPs. This must not overlap with any IP ranges assigned to nodes or pods. Max of two dual-stack CIDRs is allowed. |
|
--service-node-port-range <a string in the form 'N1-N2'> Default: 30000-32767 | |
A port range to reserve for services with NodePort visibility. Example: '30000-32767'. Inclusive at both ends of the range. |
|
--show-hidden-metrics-for-version string | |
The previous version for which you want to show hidden metrics. Only the previous minor version is meaningful, other values will not be allowed. The format is <major>.<minor>, e.g.: '1.16'. The purpose of this format is make sure you have the opportunity to notice if the next release hides additional metrics, rather than being surprised when they are permanently removed in the release after that. |
|
--shutdown-delay-duration duration | |
Time to delay the termination. During that time the server keeps serving requests normally. The endpoints /healthz and /livez will return success, but /readyz immediately returns failure. Graceful termination starts after this delay has elapsed. This can be used to allow load balancer to stop sending traffic to this server. |
|
--shutdown-send-retry-after | |
If true the HTTP Server will continue listening until all non long running request(s) in flight have been drained, during this window all incoming requests will be rejected with a status code 429 and a 'Retry-After' response header, in addition 'Connection: close' response header is set in order to tear down the TCP connection when idle. |
|
--storage-backend string | |
The storage backend for persistence. Options: 'etcd3' (default). |
|
--storage-media-type string Default: "application/vnd.kubernetes.protobuf" | |
The media type to use to store objects in storage. Some resources or storage backends may only support a specific media type and will ignore this setting. |
|
--strict-transport-security-directives strings | |
List of directives for HSTS, comma separated. If this list is empty, then HSTS directives will not be added. Example: 'max-age=31536000,includeSubDomains,preload' |
|
--tls-cert-file string | |
File containing the default x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If HTTPS serving is enabled, and --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory specified by --cert-dir. |
|
--tls-cipher-suites strings | |
Comma-separated list of cipher suites for the server. If omitted, the default Go cipher suites will be used. |
|
--tls-min-version string | |
Minimum TLS version supported. Possible values: VersionTLS10, VersionTLS11, VersionTLS12, VersionTLS13 |
|
--tls-private-key-file string | |
File containing the default x509 private key matching --tls-cert-file. |
|
--tls-sni-cert-key string | |
A pair of x509 certificate and private key file paths, optionally suffixed with a list of domain patterns which are fully qualified domain names, possibly with prefixed wildcard segments. The domain patterns also allow IP addresses, but IPs should only be used if the apiserver has visibility to the IP address requested by a client. If no domain patterns are provided, the names of the certificate are extracted. Non-wildcard matches trump over wildcard matches, explicit domain patterns trump over extracted names. For multiple key/certificate pairs, use the --tls-sni-cert-key multiple times. Examples: "example.crt,example.key" or "foo.crt,foo.key:*.foo.com,foo.com". |
|
--token-auth-file string | |
If set, the file that will be used to secure the secure port of the API server via token authentication. |
|
--tracing-config-file string | |
File with apiserver tracing configuration. |
|
-v, --v int | |
number for the log level verbosity |
|
--version version[=true] | |
Print version information and quit |
|
--vmodule pattern=N,... | |
comma-separated list of pattern=N settings for file-filtered logging (only works for text log format) |
|
--watch-cache Default: true | |
Enable watch caching in the apiserver |
|
--watch-cache-sizes strings | |
Watch cache size settings for some resources (pods, nodes, etc.), comma separated. The individual setting format: resource[.group]#size, where resource is lowercase plural (no version), group is omitted for resources of apiVersion v1 (the legacy core API) and included for others, and size is a number. It takes effect when watch-cache is enabled. Some resources (replicationcontrollers, endpoints, nodes, pods, services, apiservices.apiregistration.k8s.io) have system defaults set by heuristics, others default to default-watch-cache-size |
6.11.4 - kube-controller-manager
Synopsis
The Kubernetes controller manager is a daemon that embeds the core control loops shipped with Kubernetes. In applications of robotics and automation, a control loop is a non-terminating loop that regulates the state of the system. In Kubernetes, a controller is a control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state. Examples of controllers that ship with Kubernetes today are the replication controller, endpoints controller, namespace controller, and serviceaccounts controller.
kube-controller-manager [flags]
Options
--allocate-node-cidrs | |
Should CIDRs for Pods be allocated and set on the cloud provider. |
|
--allow-metric-labels stringToString Default: [] | |
The map from metric-label to value allow-list of this label. The key's format is <MetricName>,<LabelName>. The value's format is <allowed_value>,<allowed_value>...e.g. metric1,label1='v1,v2,v3', metric1,label2='v1,v2,v3' metric2,label1='v1,v2,v3'. |
|
--attach-detach-reconcile-sync-period duration Default: 1m0s | |
The reconciler sync wait time between volume attach detach. This duration must be larger than one second, and increasing this value from the default may allow for volumes to be mismatched with pods. |
|
--authentication-kubeconfig string | |
kubeconfig file pointing at the 'core' kubernetes server with enough rights to create tokenreviews.authentication.k8s.io. This is optional. If empty, all token requests are considered to be anonymous and no client CA is looked up in the cluster. |
|
--authentication-skip-lookup | |
If false, the authentication-kubeconfig will be used to lookup missing authentication configuration from the cluster. |
|
--authentication-token-webhook-cache-ttl duration Default: 10s | |
The duration to cache responses from the webhook token authenticator. |
|
--authentication-tolerate-lookup-failure | |
If true, failures to look up missing authentication configuration from the cluster are not considered fatal. Note that this can result in authentication that treats all requests as anonymous. |
|
--authorization-always-allow-paths strings Default: "/healthz,/readyz,/livez" | |
A list of HTTP paths to skip during authorization, i.e. these are authorized without contacting the 'core' kubernetes server. |
|
--authorization-kubeconfig string | |
kubeconfig file pointing at the 'core' kubernetes server with enough rights to create subjectaccessreviews.authorization.k8s.io. This is optional. If empty, all requests not skipped by authorization are forbidden. |
|
--authorization-webhook-cache-authorized-ttl duration Default: 10s | |
The duration to cache 'authorized' responses from the webhook authorizer. |
|
--authorization-webhook-cache-unauthorized-ttl duration Default: 10s | |
The duration to cache 'unauthorized' responses from the webhook authorizer. |
|
--azure-container-registry-config string | |
Path to the file containing Azure container registry configuration information. |
|
--bind-address string Default: 0.0.0.0 | |
The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces will be used. |
|
--cert-dir string | |
The directory where the TLS certs are located. If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored. |
|
--cidr-allocator-type string Default: "RangeAllocator" | |
Type of CIDR allocator to use |
|
--client-ca-file string | |
If set, any request presenting a client certificate signed by one of the authorities in the client-ca-file is authenticated with an identity corresponding to the CommonName of the client certificate. |
|
--cloud-config string | |
The path to the cloud provider configuration file. Empty string for no configuration file. |
|
--cloud-provider string | |
The provider for cloud services. Empty string for no provider. |
|
--cluster-cidr string | |
CIDR Range for Pods in cluster. Requires --allocate-node-cidrs to be true |
|
--cluster-name string Default: "kubernetes" | |
The instance prefix for the cluster. |
|
--cluster-signing-cert-file string | |
Filename containing a PEM-encoded X509 CA certificate used to issue cluster-scoped certificates. If specified, no more specific --cluster-signing-* flag may be specified. |
|
--cluster-signing-duration duration Default: 8760h0m0s | |
The max length of duration signed certificates will be given. Individual CSRs may request shorter certs by setting spec.expirationSeconds. |
|
--cluster-signing-key-file string | |
Filename containing a PEM-encoded RSA or ECDSA private key used to sign cluster-scoped certificates. If specified, no more specific --cluster-signing-* flag may be specified. |
|
--cluster-signing-kube-apiserver-client-cert-file string | |
Filename containing a PEM-encoded X509 CA certificate used to issue certificates for the kubernetes.io/kube-apiserver-client signer. If specified, --cluster-signing-{cert,key}-file must not be set. |
|
--cluster-signing-kube-apiserver-client-key-file string | |
Filename containing a PEM-encoded RSA or ECDSA private key used to sign certificates for the kubernetes.io/kube-apiserver-client signer. If specified, --cluster-signing-{cert,key}-file must not be set. |
|
--cluster-signing-kubelet-client-cert-file string | |
Filename containing a PEM-encoded X509 CA certificate used to issue certificates for the kubernetes.io/kube-apiserver-client-kubelet signer. If specified, --cluster-signing-{cert,key}-file must not be set. |
|
--cluster-signing-kubelet-client-key-file string | |
Filename containing a PEM-encoded RSA or ECDSA private key used to sign certificates for the kubernetes.io/kube-apiserver-client-kubelet signer. If specified, --cluster-signing-{cert,key}-file must not be set. |
|
--cluster-signing-kubelet-serving-cert-file string | |
Filename containing a PEM-encoded X509 CA certificate used to issue certificates for the kubernetes.io/kubelet-serving signer. If specified, --cluster-signing-{cert,key}-file must not be set. |
|
--cluster-signing-kubelet-serving-key-file string | |
Filename containing a PEM-encoded RSA or ECDSA private key used to sign certificates for the kubernetes.io/kubelet-serving signer. If specified, --cluster-signing-{cert,key}-file must not be set. |
|
--cluster-signing-legacy-unknown-cert-file string | |
Filename containing a PEM-encoded X509 CA certificate used to issue certificates for the kubernetes.io/legacy-unknown signer. If specified, --cluster-signing-{cert,key}-file must not be set. |
|
--cluster-signing-legacy-unknown-key-file string | |
Filename containing a PEM-encoded RSA or ECDSA private key used to sign certificates for the kubernetes.io/legacy-unknown signer. If specified, --cluster-signing-{cert,key}-file must not be set. |
|
--concurrent-deployment-syncs int32 Default: 5 | |
The number of deployment objects that are allowed to sync concurrently. Larger number = more responsive deployments, but more CPU (and network) load |
|
--concurrent-endpoint-syncs int32 Default: 5 | |
The number of endpoint syncing operations that will be done concurrently. Larger number = faster endpoint updating, but more CPU (and network) load |
|
--concurrent-ephemeralvolume-syncs int32 Default: 5 | |
The number of ephemeral volume syncing operations that will be done concurrently. Larger number = faster ephemeral volume updating, but more CPU (and network) load |
|
--concurrent-gc-syncs int32 Default: 20 | |
The number of garbage collector workers that are allowed to sync concurrently. |
|
--concurrent-namespace-syncs int32 Default: 10 | |
The number of namespace objects that are allowed to sync concurrently. Larger number = more responsive namespace termination, but more CPU (and network) load |
|
--concurrent-rc-syncs int32 Default: 5 | |
The number of replication controllers that are allowed to sync concurrently. Larger number = more responsive replica management, but more CPU (and network) load |
|
--concurrent-replicaset-syncs int32 Default: 5 | |
The number of replica sets that are allowed to sync concurrently. Larger number = more responsive replica management, but more CPU (and network) load |
|
--concurrent-resource-quota-syncs int32 Default: 5 | |
The number of resource quotas that are allowed to sync concurrently. Larger number = more responsive quota management, but more CPU (and network) load |
|
--concurrent-service-endpoint-syncs int32 Default: 5 | |
The number of service endpoint syncing operations that will be done concurrently. Larger number = faster endpoint slice updating, but more CPU (and network) load. Defaults to 5. |
|
--concurrent-service-syncs int32 Default: 1 | |
The number of services that are allowed to sync concurrently. Larger number = more responsive service management, but more CPU (and network) load |
|
--concurrent-serviceaccount-token-syncs int32 Default: 5 | |
The number of service account token objects that are allowed to sync concurrently. Larger number = more responsive token generation, but more CPU (and network) load |
|
--concurrent-statefulset-syncs int32 Default: 5 | |
The number of statefulset objects that are allowed to sync concurrently. Larger number = more responsive statefulsets, but more CPU (and network) load |
|
--concurrent-ttl-after-finished-syncs int32 Default: 5 | |
The number of TTL-after-finished controller workers that are allowed to sync concurrently. |
|
--configure-cloud-routes Default: true | |
Should CIDRs allocated by allocate-node-cidrs be configured on the cloud provider. |
|
--contention-profiling | |
Enable lock contention profiling, if profiling is enabled |
|
--controller-start-interval duration | |
Interval between starting controller managers. |
|
--controllers strings Default: "*" | |
A list of controllers to enable. '*' enables all on-by-default controllers, 'foo' enables the controller named 'foo', '-foo' disables the controller named 'foo'. |
|
--disable-attach-detach-reconcile-sync | |
Disable volume attach detach reconciler sync. Disabling this may cause volumes to be mismatched with pods. Use wisely. |
|
--disabled-metrics strings | |
This flag provides an escape hatch for misbehaving metrics. You must provide the fully qualified metric name in order to disable it. Disclaimer: disabling metrics is higher in precedence than showing hidden metrics. |
|
--enable-dynamic-provisioning Default: true | |
Enable dynamic provisioning for environments that support it. |
|
--enable-garbage-collector Default: true | |
Enables the generic garbage collector. MUST be synced with the corresponding flag of the kube-apiserver. |
|
--enable-hostpath-provisioner | |
Enable HostPath PV provisioning when running without a cloud provider. This allows testing and development of provisioning features. HostPath provisioning is not supported in any way, won't work in a multi-node cluster, and should not be used for anything other than testing or development. |
|
--enable-leader-migration | |
Whether to enable controller leader migration. |
|
--enable-taint-manager Default: true | |
WARNING: Beta feature. If set to true enables NoExecute Taints and will evict all not-tolerating Pod running on Nodes tainted with this kind of Taints. |
|
--endpoint-updates-batch-period duration | |
The length of endpoint updates batching period. Processing of pod changes will be delayed by this duration to join them with potential upcoming updates and reduce the overall number of endpoints updates. Larger number = higher endpoint programming latency, but lower number of endpoints revision generated |
|
--endpointslice-updates-batch-period duration | |
The length of endpoint slice updates batching period. Processing of pod changes will be delayed by this duration to join them with potential upcoming updates and reduce the overall number of endpoints updates. Larger number = higher endpoint programming latency, but lower number of endpoints revision generated |
|
--experimental-logging-sanitization | |
[Experimental] When enabled prevents logging of fields tagged as sensitive (passwords, keys, tokens). |
|
--external-cloud-volume-plugin string | |
The plugin to use when cloud provider is set to external. Can be empty, should only be set when cloud-provider is external. Currently used to allow node and volume controllers to work for in tree cloud providers. |
|
--feature-gates <comma-separated 'key=True|False' pairs> | |
A set of key=value pairs that describe feature gates for alpha/experimental features. Options are: |
|
--flex-volume-plugin-dir string Default: "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/" | |
Full path of the directory in which the flex volume plugin should search for additional third party volume plugins. |
|
-h, --help | |
help for kube-controller-manager |
|
--horizontal-pod-autoscaler-cpu-initialization-period duration Default: 5m0s | |
The period after pod start when CPU samples might be skipped. |
|
--horizontal-pod-autoscaler-downscale-stabilization duration Default: 5m0s | |
The period for which autoscaler will look backwards and not scale down below any recommendation it made during that period. |
|
--horizontal-pod-autoscaler-initial-readiness-delay duration Default: 30s | |
The period after pod start during which readiness changes will be treated as initial readiness. |
|
--horizontal-pod-autoscaler-sync-period duration Default: 15s | |
The period for syncing the number of pods in horizontal pod autoscaler. |
|
--horizontal-pod-autoscaler-tolerance float Default: 0.1 | |
The minimum change (from 1.0) in the desired-to-actual metrics ratio for the horizontal pod autoscaler to consider scaling. |
|
--http2-max-streams-per-connection int | |
The limit that the server gives to clients for the maximum number of streams in an HTTP/2 connection. Zero means to use golang's default. |
|
--kube-api-burst int32 Default: 30 | |
Burst to use while talking with kubernetes apiserver. |
|
--kube-api-content-type string Default: "application/vnd.kubernetes.protobuf" | |
Content type of requests sent to apiserver. |
|
--kube-api-qps float Default: 20 | |
QPS to use while talking with kubernetes apiserver. |
|
--kubeconfig string | |
Path to kubeconfig file with authorization and master location information. |
|
--large-cluster-size-threshold int32 Default: 50 | |
Number of nodes from which NodeController treats the cluster as large for the eviction logic purposes. --secondary-node-eviction-rate is implicitly overridden to 0 for clusters this size or smaller. |
|
--leader-elect Default: true | |
Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability. |
|
--leader-elect-lease-duration duration Default: 15s | |
The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. |
|
--leader-elect-renew-deadline duration Default: 10s | |
The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled. |
|
--leader-elect-resource-lock string Default: "leases" | |
The type of resource object that is used for locking during leader election. Supported options are 'endpoints', 'configmaps', 'leases', 'endpointsleases' and 'configmapsleases'. |
|
--leader-elect-resource-name string Default: "kube-controller-manager" | |
The name of resource object that is used for locking during leader election. |
|
--leader-elect-resource-namespace string Default: "kube-system" | |
The namespace of resource object that is used for locking during leader election. |
|
--leader-elect-retry-period duration Default: 2s | |
The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. |
|
--leader-migration-config string | |
Path to the config file for controller leader migration, or empty to use the value that reflects default configuration of the controller manager. The config file should be of type LeaderMigrationConfiguration, group controllermanager.config.k8s.io, version v1alpha1. |
|
--log-flush-frequency duration Default: 5s | |
Maximum number of seconds between log flushes |
|
--logging-format string Default: "text" | |
Sets the log format. Permitted formats: "text". |
|
--master string | |
The address of the Kubernetes API server (overrides any value in kubeconfig). |
|
--max-endpoints-per-slice int32 Default: 100 | |
The maximum number of endpoints that will be added to an EndpointSlice. More endpoints per slice will result in less endpoint slices, but larger resources. Defaults to 100. |
|
--min-resync-period duration Default: 12h0m0s | |
The resync period in reflectors will be random between MinResyncPeriod and 2*MinResyncPeriod. |
|
--mirroring-concurrent-service-endpoint-syncs int32 Default: 5 | |
The number of service endpoint syncing operations that will be done concurrently by the EndpointSliceMirroring controller. Larger number = faster endpoint slice updating, but more CPU (and network) load. Defaults to 5. |
|
--mirroring-endpointslice-updates-batch-period duration | |
The length of EndpointSlice updates batching period for EndpointSliceMirroring controller. Processing of EndpointSlice changes will be delayed by this duration to join them with potential upcoming updates and reduce the overall number of EndpointSlice updates. Larger number = higher endpoint programming latency, but lower number of endpoints revision generated |
|
--mirroring-max-endpoints-per-subset int32 Default: 1000 | |
The maximum number of endpoints that will be added to an EndpointSlice by the EndpointSliceMirroring controller. More endpoints per slice will result in less endpoint slices, but larger resources. Defaults to 100. |
|
--namespace-sync-period duration Default: 5m0s | |
The period for syncing namespace life-cycle updates |
|
--node-cidr-mask-size int32 | |
Mask size for node cidr in cluster. Default is 24 for IPv4 and 64 for IPv6. |
|
--node-cidr-mask-size-ipv4 int32 | |
Mask size for IPv4 node cidr in dual-stack cluster. Default is 24. |
|
--node-cidr-mask-size-ipv6 int32 | |
Mask size for IPv6 node cidr in dual-stack cluster. Default is 64. |
|
--node-eviction-rate float Default: 0.1 | |
Number of nodes per second on which pods are deleted in case of node failure when a zone is healthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters. |
|
--node-monitor-grace-period duration Default: 40s | |
Amount of time which we allow running Node to be unresponsive before marking it unhealthy. Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet to post node status. |
|
--node-monitor-period duration Default: 5s | |
The period for syncing NodeStatus in NodeController. |
|
--node-startup-grace-period duration Default: 1m0s | |
Amount of time which we allow starting Node to be unresponsive before marking it unhealthy. |
|
--permit-address-sharing | |
If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false] |
|
--permit-port-sharing | |
If true, SO_REUSEPORT will be used when binding the port, which allows more than one instance to bind on the same address and port. [default=false] |
|
--pod-eviction-timeout duration Default: 5m0s | |
The grace period for deleting pods on failed nodes. |
|
--profiling Default: true | |
Enable profiling via web interface host:port/debug/pprof/ |
|
--pv-recycler-increment-timeout-nfs int32 Default: 30 | |
the increment of time added per Gi to ActiveDeadlineSeconds for an NFS scrubber pod |
|
--pv-recycler-minimum-timeout-hostpath int32 Default: 60 | |
The minimum ActiveDeadlineSeconds to use for a HostPath Recycler pod. This is for development and testing only and will not work in a multi-node cluster. |
|
--pv-recycler-minimum-timeout-nfs int32 Default: 300 | |
The minimum ActiveDeadlineSeconds to use for an NFS Recycler pod |
|
--pv-recycler-pod-template-filepath-hostpath string | |
The file path to a pod definition used as a template for HostPath persistent volume recycling. This is for development and testing only and will not work in a multi-node cluster. |
|
--pv-recycler-pod-template-filepath-nfs string | |
The file path to a pod definition used as a template for NFS persistent volume recycling |
|
--pv-recycler-timeout-increment-hostpath int32 Default: 30 | |
the increment of time added per Gi to ActiveDeadlineSeconds for a HostPath scrubber pod. This is for development and testing only and will not work in a multi-node cluster. |
|
--pvclaimbinder-sync-period duration Default: 15s | |
The period for syncing persistent volumes and persistent volume claims |
|
--requestheader-allowed-names strings | |
List of client certificate common names to allow to provide usernames in headers specified by --requestheader-username-headers. If empty, any client certificate validated by the authorities in --requestheader-client-ca-file is allowed. |
|
--requestheader-client-ca-file string | |
Root certificate bundle to use to verify client certificates on incoming requests before trusting usernames in headers specified by --requestheader-username-headers. WARNING: generally do not depend on authorization being already done for incoming requests. |
|
--requestheader-extra-headers-prefix strings Default: "x-remote-extra-" | |
List of request header prefixes to inspect. X-Remote-Extra- is suggested. |
|
--requestheader-group-headers strings Default: "x-remote-group" | |
List of request headers to inspect for groups. X-Remote-Group is suggested. |
|
--requestheader-username-headers strings Default: "x-remote-user" | |
List of request headers to inspect for usernames. X-Remote-User is common. |
|
--resource-quota-sync-period duration Default: 5m0s | |
The period for syncing quota usage status in the system |
|
--root-ca-file string | |
If set, this root certificate authority will be included in service account's token secret. This must be a valid PEM-encoded CA bundle. |
|
--route-reconciliation-period duration Default: 10s | |
The period for reconciling routes created for Nodes by cloud provider. |
|
--secondary-node-eviction-rate float Default: 0.01 | |
Number of nodes per second on which pods are deleted in case of node failure when a zone is unhealthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters. This value is implicitly overridden to 0 if the cluster size is smaller than --large-cluster-size-threshold. |
|
--secure-port int Default: 10257 | |
The port on which to serve HTTPS with authentication and authorization. If 0, don't serve HTTPS at all. |
|
--service-account-private-key-file string | |
Filename containing a PEM-encoded private RSA or ECDSA key used to sign service account tokens. |
|
--service-cluster-ip-range string | |
CIDR Range for Services in cluster. Requires --allocate-node-cidrs to be true |
|
--show-hidden-metrics-for-version string | |
The previous version for which you want to show hidden metrics. Only the previous minor version is meaningful, other values will not be allowed. The format is <major>.<minor>, e.g.: '1.16'. The purpose of this format is make sure you have the opportunity to notice if the next release hides additional metrics, rather than being surprised when they are permanently removed in the release after that. |
|
--terminated-pod-gc-threshold int32 Default: 12500 | |
Number of terminated pods that can exist before the terminated pod garbage collector starts deleting terminated pods. If <= 0, the terminated pod garbage collector is disabled. |
|
--tls-cert-file string | |
File containing the default x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If HTTPS serving is enabled, and --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory specified by --cert-dir. |
|
--tls-cipher-suites strings | |
Comma-separated list of cipher suites for the server. If omitted, the default Go cipher suites will be used. |
|
--tls-min-version string | |
Minimum TLS version supported. Possible values: VersionTLS10, VersionTLS11, VersionTLS12, VersionTLS13 |
|
--tls-private-key-file string | |
File containing the default x509 private key matching --tls-cert-file. |
|
--tls-sni-cert-key string | |
A pair of x509 certificate and private key file paths, optionally suffixed with a list of domain patterns which are fully qualified domain names, possibly with prefixed wildcard segments. The domain patterns also allow IP addresses, but IPs should only be used if the apiserver has visibility to the IP address requested by a client. If no domain patterns are provided, the names of the certificate are extracted. Non-wildcard matches trump over wildcard matches, explicit domain patterns trump over extracted names. For multiple key/certificate pairs, use the --tls-sni-cert-key multiple times. Examples: "example.crt,example.key" or "foo.crt,foo.key:*.foo.com,foo.com". |
|
--unhealthy-zone-threshold float Default: 0.55 | |
Fraction of Nodes in a zone which needs to be not Ready (minimum 3) for zone to be treated as unhealthy. |
|
--use-service-account-credentials | |
If true, use individual service account credentials for each controller. |
|
-v, --v int | |
number for the log level verbosity |
|
--version version[=true] | |
Print version information and quit |
|
--vmodule pattern=N,... | |
comma-separated list of pattern=N settings for file-filtered logging (only works for text log format) |
|
--volume-host-allow-local-loopback Default: true | |
If false, deny local loopback IPs in addition to any CIDR ranges in --volume-host-cidr-denylist |
|
--volume-host-cidr-denylist strings | |
A comma-separated list of CIDR ranges to avoid from volume plugins. |
6.11.5 - kube-proxy
Synopsis
The Kubernetes network proxy runs on each node. This reflects services as defined in the Kubernetes API on each node and can do simple TCP, UDP, and SCTP stream forwarding or round robin TCP, UDP, and SCTP forwarding across a set of backends. Service cluster IPs and ports are currently found through Docker-links-compatible environment variables specifying ports opened by the service proxy. There is an optional addon that provides cluster DNS for these cluster IPs. The user must create a service with the apiserver API to configure the proxy.
kube-proxy [flags]
Options
--azure-container-registry-config string | |
Path to the file containing Azure container registry configuration information. |
|
--bind-address string Default: 0.0.0.0 | |
The IP address for the proxy server to serve on (set to '0.0.0.0' for all IPv4 interfaces and '::' for all IPv6 interfaces) |
|
--bind-address-hard-fail | |
If true kube-proxy will treat failure to bind to a port as fatal and exit |
|
--boot-id-file string Default: "/proc/sys/kernel/random/boot_id" | |
Comma-separated list of files to check for boot-id. Use the first one that exists. |
|
--boot_id_file string Default: "/proc/sys/kernel/random/boot_id" | |
Comma-separated list of files to check for boot-id. Use the first one that exists. |
|
--cleanup | |
If true cleanup iptables and ipvs rules and exit. |
|
--cloud-provider-gce-l7lb-src-cidrs cidrs Default: 130.211.0.0/22,35.191.0.0/16 | |
CIDRs opened in GCE firewall for L7 LB traffic proxy & health checks |
|
--cloud-provider-gce-lb-src-cidrs cidrs Default: 130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16 | |
CIDRs opened in GCE firewall for L4 LB traffic proxy & health checks |
|
--cluster-cidr string | |
The CIDR range of pods in the cluster. When configured, traffic sent to a Service cluster IP from outside this range will be masqueraded and traffic sent from pods to an external LoadBalancer IP will be directed to the respective cluster IP instead |
|
--config string | |
The path to the configuration file. |
|
--config-sync-period duration Default: 15m0s | |
How often configuration from the apiserver is refreshed. Must be greater than 0. |
|
--conntrack-max-per-core int32 Default: 32768 | |
Maximum number of NAT connections to track per CPU core (0 to leave the limit as-is and ignore conntrack-min). |
|
--conntrack-min int32 Default: 131072 | |
Minimum number of conntrack entries to allocate, regardless of conntrack-max-per-core (set conntrack-max-per-core=0 to leave the limit as-is). |
|
--conntrack-tcp-timeout-close-wait duration Default: 1h0m0s | |
NAT timeout for TCP connections in the CLOSE_WAIT state |
|
--conntrack-tcp-timeout-established duration Default: 24h0m0s | |
Idle timeout for established TCP connections (0 to leave as-is) |
|
--default-not-ready-toleration-seconds int Default: 300 | |
Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a toleration. |
|
--default-unreachable-toleration-seconds int Default: 300 | |
Indicates the tolerationSeconds of the toleration for unreachable:NoExecute that is added by default to every pod that does not already have such a toleration. |
|
--detect-local-mode LocalMode | |
Mode to use to detect local traffic |
|
--feature-gates <comma-separated 'key=True|False' pairs> | |
A set of key=value pairs that describe feature gates for alpha/experimental features. Options are: |
|
--healthz-bind-address ipport Default: 0.0.0.0:10256 | |
The IP address with port for the health check server to serve on (set to '0.0.0.0:10256' for all IPv4 interfaces and '[::]:10256' for all IPv6 interfaces). Set empty to disable. |
|
-h, --help | |
help for kube-proxy |
|
--hostname-override string | |
If non-empty, will use this string as identification instead of the actual hostname. |
|
--iptables-masquerade-bit int32 Default: 14 | |
If using the pure iptables proxy, the bit of the fwmark space to mark packets requiring SNAT with. Must be within the range [0, 31]. |
|
--iptables-min-sync-period duration Default: 1s | |
The minimum interval of how often the iptables rules can be refreshed as endpoints and services change (e.g. '5s', '1m', '2h22m'). |
|
--iptables-sync-period duration Default: 30s | |
The maximum interval of how often iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0. |
|
--ipvs-exclude-cidrs strings | |
A comma-separated list of CIDR's which the ipvs proxier should not touch when cleaning up IPVS rules. |
|
--ipvs-min-sync-period duration | |
The minimum interval of how often the ipvs rules can be refreshed as endpoints and services change (e.g. '5s', '1m', '2h22m'). |
|
--ipvs-scheduler string | |
The ipvs scheduler type when proxy mode is ipvs |
|
--ipvs-strict-arp | |
Enable strict ARP by setting arp_ignore to 1 and arp_announce to 2 |
|
--ipvs-sync-period duration Default: 30s | |
The maximum interval of how often ipvs rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0. |
|
--ipvs-tcp-timeout duration | |
The timeout for idle IPVS TCP connections, 0 to leave as-is. (e.g. '5s', '1m', '2h22m'). |
|
--ipvs-tcpfin-timeout duration | |
The timeout for IPVS TCP connections after receiving a FIN packet, 0 to leave as-is. (e.g. '5s', '1m', '2h22m'). |
|
--ipvs-udp-timeout duration | |
The timeout for IPVS UDP packets, 0 to leave as-is. (e.g. '5s', '1m', '2h22m'). |
|
--kube-api-burst int32 Default: 10 | |
Burst to use while talking with kubernetes apiserver |
|
--kube-api-content-type string Default: "application/vnd.kubernetes.protobuf" | |
Content type of requests sent to apiserver. |
|
--kube-api-qps float Default: 5 | |
QPS to use while talking with kubernetes apiserver |
|
--kubeconfig string | |
Path to kubeconfig file with authorization information (the master location can be overridden by the master flag). |
|
--machine-id-file string Default: "/etc/machine-id,/var/lib/dbus/machine-id" | |
Comma-separated list of files to check for machine-id. Use the first one that exists. |
|
--machine_id_file string Default: "/etc/machine-id,/var/lib/dbus/machine-id" | |
Comma-separated list of files to check for machine-id. Use the first one that exists. |
|
--masquerade-all | |
If using the pure iptables proxy, SNAT all traffic sent via Service cluster IPs (this not commonly needed) |
|
--master string | |
The address of the Kubernetes API server (overrides any value in kubeconfig) |
|
--metrics-bind-address ipport Default: 127.0.0.1:10249 | |
The IP address with port for the metrics server to serve on (set to '0.0.0.0:10249' for all IPv4 interfaces and '[::]:10249' for all IPv6 interfaces). Set empty to disable. |
|
--nodeport-addresses strings | |
A string slice of values which specify the addresses to use for NodePorts. Values may be valid IP blocks (e.g. 1.2.3.0/24, 1.2.3.4/32). The default empty string slice ([]) means to use all local addresses. |
|
--oom-score-adj int32 Default: -999 | |
The oom-score-adj value for kube-proxy process. Values must be within the range [-1000, 1000] |
|
--profiling | |
If true enables profiling via web interface on /debug/pprof handler. |
|
--proxy-mode ProxyMode | |
Which proxy mode to use: 'userspace' (older) or 'iptables' (faster) or 'ipvs' or 'kernelspace' (windows). If blank, use the best-available proxy (currently iptables). If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy. |
|
--proxy-port-range port-range | |
Range of host ports (beginPort-endPort, single port or beginPort+offset, inclusive) that may be consumed in order to proxy service traffic. If (unspecified, 0, or 0-0) then ports will be randomly chosen. |
|
--show-hidden-metrics-for-version string | |
The previous version for which you want to show hidden metrics. Only the previous minor version is meaningful, other values will not be allowed. The format is <major>.<minor>, e.g.: '1.16'. The purpose of this format is make sure you have the opportunity to notice if the next release hides additional metrics, rather than being surprised when they are permanently removed in the release after that. |
|
--udp-timeout duration Default: 250ms | |
How long an idle UDP connection will be kept open (e.g. '250ms', '2s'). Must be greater than 0. Only applicable for proxy-mode=userspace |
|
--version version[=true] | |
Print version information and quit |
|
--write-config-to string | |
If set, write the default configuration values to this file and exit. |
6.11.6 - kube-scheduler
Synopsis
The Kubernetes scheduler is a control plane process which assigns Pods to Nodes. The scheduler determines which Nodes are valid placements for each Pod in the scheduling queue according to constraints and available resources. The scheduler then ranks each valid Node and binds the Pod to a suitable Node. Multiple different schedulers may be used within a cluster; kube-scheduler is the reference implementation. See scheduling for more information about scheduling and the kube-scheduler component.
kube-scheduler [flags]
Options
--allow-metric-labels stringToString Default: [] | |
The map from metric-label to value allow-list of this label. The key's format is <MetricName>,<LabelName>. The value's format is <allowed_value>,<allowed_value>...e.g. metric1,label1='v1,v2,v3', metric1,label2='v1,v2,v3' metric2,label1='v1,v2,v3'. |
|
--authentication-kubeconfig string | |
kubeconfig file pointing at the 'core' kubernetes server with enough rights to create tokenreviews.authentication.k8s.io. This is optional. If empty, all token requests are considered to be anonymous and no client CA is looked up in the cluster. |
|
--authentication-skip-lookup | |
If false, the authentication-kubeconfig will be used to lookup missing authentication configuration from the cluster. |
|
--authentication-token-webhook-cache-ttl duration Default: 10s | |
The duration to cache responses from the webhook token authenticator. |
|
--authentication-tolerate-lookup-failure Default: true | |
If true, failures to look up missing authentication configuration from the cluster are not considered fatal. Note that this can result in authentication that treats all requests as anonymous. |
|
--authorization-always-allow-paths strings Default: "/healthz,/readyz,/livez" | |
A list of HTTP paths to skip during authorization, i.e. these are authorized without contacting the 'core' kubernetes server. |
|
--authorization-kubeconfig string | |
kubeconfig file pointing at the 'core' kubernetes server with enough rights to create subjectaccessreviews.authorization.k8s.io. This is optional. If empty, all requests not skipped by authorization are forbidden. |
|
--authorization-webhook-cache-authorized-ttl duration Default: 10s | |
The duration to cache 'authorized' responses from the webhook authorizer. |
|
--authorization-webhook-cache-unauthorized-ttl duration Default: 10s | |
The duration to cache 'unauthorized' responses from the webhook authorizer. |
|
--azure-container-registry-config string | |
Path to the file containing Azure container registry configuration information. |
|
--bind-address string Default: 0.0.0.0 | |
The IP address on which to listen for the --secure-port port. The associated interface(s) must be reachable by the rest of the cluster, and by CLI/web clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces will be used. |
|
--cert-dir string | |
The directory where the TLS certs are located. If --tls-cert-file and --tls-private-key-file are provided, this flag will be ignored. |
|
--client-ca-file string | |
If set, any request presenting a client certificate signed by one of the authorities in the client-ca-file is authenticated with an identity corresponding to the CommonName of the client certificate. |
|
--config string | |
The path to the configuration file. |
|
--contention-profiling Default: true | |
DEPRECATED: enable lock contention profiling, if profiling is enabled. This parameter is ignored if a config file is specified in --config. |
|
--disabled-metrics strings | |
This flag provides an escape hatch for misbehaving metrics. You must provide the fully qualified metric name in order to disable it. Disclaimer: disabling metrics is higher in precedence than showing hidden metrics. |
|
--experimental-logging-sanitization | |
[Experimental] When enabled prevents logging of fields tagged as sensitive (passwords, keys, tokens). |
|
--feature-gates <comma-separated 'key=True|False' pairs> | |
A set of key=value pairs that describe feature gates for alpha/experimental features. Options are: |
|
-h, --help | |
help for kube-scheduler |
|
--http2-max-streams-per-connection int | |
The limit that the server gives to clients for the maximum number of streams in an HTTP/2 connection. Zero means to use golang's default. |
|
--kube-api-burst int32 Default: 100 | |
DEPRECATED: burst to use while talking with kubernetes apiserver. This parameter is ignored if a config file is specified in --config. |
|
--kube-api-content-type string Default: "application/vnd.kubernetes.protobuf" | |
DEPRECATED: content type of requests sent to apiserver. This parameter is ignored if a config file is specified in --config. |
|
--kube-api-qps float Default: 50 | |
DEPRECATED: QPS to use while talking with kubernetes apiserver. This parameter is ignored if a config file is specified in --config. |
|
--kubeconfig string | |
DEPRECATED: path to kubeconfig file with authorization and master location information. This parameter is ignored if a config file is specified in --config. |
|
--leader-elect Default: true | |
Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability. |
|
--leader-elect-lease-duration duration Default: 15s | |
The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. |
|
--leader-elect-renew-deadline duration Default: 10s | |
The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled. |
|
--leader-elect-resource-lock string Default: "leases" | |
The type of resource object that is used for locking during leader election. Supported options are 'endpoints', 'configmaps', 'leases', 'endpointsleases' and 'configmapsleases'. |
|
--leader-elect-resource-name string Default: "kube-scheduler" | |
The name of resource object that is used for locking during leader election. |
|
--leader-elect-resource-namespace string Default: "kube-system" | |
The namespace of resource object that is used for locking during leader election. |
|
--leader-elect-retry-period duration Default: 2s | |
The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. |
|
--lock-object-name string Default: "kube-scheduler" | |
DEPRECATED: define the name of the lock object. Will be removed in favor of leader-elect-resource-name. This parameter is ignored if a config file is specified in --config. |
|
--lock-object-namespace string Default: "kube-system" | |
DEPRECATED: define the namespace of the lock object. Will be removed in favor of leader-elect-resource-namespace. This parameter is ignored if a config file is specified in --config. |
|
--log-flush-frequency duration Default: 5s | |
Maximum number of seconds between log flushes |
|
--logging-format string Default: "text" | |
Sets the log format. Permitted formats: "text". |
|
--master string | |
The address of the Kubernetes API server (overrides any value in kubeconfig) |
|
--permit-address-sharing | |
If true, SO_REUSEADDR will be used when binding the port. This allows binding to wildcard IPs like 0.0.0.0 and specific IPs in parallel, and it avoids waiting for the kernel to release sockets in TIME_WAIT state. [default=false] |
|
--permit-port-sharing | |
If true, SO_REUSEPORT will be used when binding the port, which allows more than one instance to bind on the same address and port. [default=false] |
|
--profiling Default: true | |
DEPRECATED: enable profiling via web interface host:port/debug/pprof/. This parameter is ignored if a config file is specified in --config. |
|
--requestheader-allowed-names strings | |
List of client certificate common names to allow to provide usernames in headers specified by --requestheader-username-headers. If empty, any client certificate validated by the authorities in --requestheader-client-ca-file is allowed. |
|
--requestheader-client-ca-file string | |
Root certificate bundle to use to verify client certificates on incoming requests before trusting usernames in headers specified by --requestheader-username-headers. WARNING: generally do not depend on authorization being already done for incoming requests. |
|
--requestheader-extra-headers-prefix strings Default: "x-remote-extra-" | |
List of request header prefixes to inspect. X-Remote-Extra- is suggested. |
|
--requestheader-group-headers strings Default: "x-remote-group" | |
List of request headers to inspect for groups. X-Remote-Group is suggested. |
|
--requestheader-username-headers strings Default: "x-remote-user" | |
List of request headers to inspect for usernames. X-Remote-User is common. |
|
--secure-port int Default: 10259 | |
The port on which to serve HTTPS with authentication and authorization. If 0, don't serve HTTPS at all. |
|
--show-hidden-metrics-for-version string | |
The previous version for which you want to show hidden metrics. Only the previous minor version is meaningful, other values will not be allowed. The format is <major>.<minor>, e.g.: '1.16'. The purpose of this format is make sure you have the opportunity to notice if the next release hides additional metrics, rather than being surprised when they are permanently removed in the release after that. |
|
--tls-cert-file string | |
File containing the default x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If HTTPS serving is enabled, and --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory specified by --cert-dir. |
|
--tls-cipher-suites strings | |
Comma-separated list of cipher suites for the server. If omitted, the default Go cipher suites will be used. |
|
--tls-min-version string | |
Minimum TLS version supported. Possible values: VersionTLS10, VersionTLS11, VersionTLS12, VersionTLS13 |
|
--tls-private-key-file string | |
File containing the default x509 private key matching --tls-cert-file. |
|
--tls-sni-cert-key string | |
A pair of x509 certificate and private key file paths, optionally suffixed with a list of domain patterns which are fully qualified domain names, possibly with prefixed wildcard segments. The domain patterns also allow IP addresses, but IPs should only be used if the apiserver has visibility to the IP address requested by a client. If no domain patterns are provided, the names of the certificate are extracted. Non-wildcard matches trump over wildcard matches, explicit domain patterns trump over extracted names. For multiple key/certificate pairs, use the --tls-sni-cert-key multiple times. Examples: "example.crt,example.key" or "foo.crt,foo.key:*.foo.com,foo.com". |
|
-v, --v int | |
number for the log level verbosity |
|
--version version[=true] | |
Print version information and quit |
|
--vmodule pattern=N,... | |
comma-separated list of pattern=N settings for file-filtered logging (only works for text log format) |
|
--write-config-to string | |
If set, write the configuration values to this file and exit. |
6.11.7 - Kubelet authentication/authorization
Overview
A kubelet's HTTPS endpoint exposes APIs which give access to data of varying sensitivity, and allow you to perform operations with varying levels of power on the node and within containers.
This document describes how to authenticate and authorize access to the kubelet's HTTPS endpoint.
Kubelet authentication
By default, requests to the kubelet's HTTPS endpoint that are not rejected by other configured
authentication methods are treated as anonymous requests, and given a username of system:anonymous
and a group of system:unauthenticated
.
To disable anonymous access and send 401 Unauthorized
responses to unauthenticated requests:
- start the kubelet with the
--anonymous-auth=false
flag
To enable X509 client certificate authentication to the kubelet's HTTPS endpoint:
- start the kubelet with the
--client-ca-file
flag, providing a CA bundle to verify client certificates with - start the apiserver with
--kubelet-client-certificate
and--kubelet-client-key
flags - see the apiserver authentication documentation for more details
To enable API bearer tokens (including service account tokens) to be used to authenticate to the kubelet's HTTPS endpoint:
- ensure the
authentication.k8s.io/v1beta1
API group is enabled in the API server - start the kubelet with the
--authentication-token-webhook
and--kubeconfig
flags - the kubelet calls the
TokenReview
API on the configured API server to determine user information from bearer tokens
Kubelet authorization
Any request that is successfully authenticated (including an anonymous request) is then authorized. The default authorization mode is AlwaysAllow
, which allows all requests.
There are many possible reasons to subdivide access to the kubelet API:
- anonymous auth is enabled, but anonymous users' ability to call the kubelet API should be limited
- bearer token auth is enabled, but arbitrary API users' (like service accounts) ability to call the kubelet API should be limited
- client certificate auth is enabled, but only some of the client certificates signed by the configured CA should be allowed to use the kubelet API
To subdivide access to the kubelet API, delegate authorization to the API server:
- ensure the
authorization.k8s.io/v1beta1
API group is enabled in the API server - start the kubelet with the
--authorization-mode=Webhook
and the--kubeconfig
flags - the kubelet calls the
SubjectAccessReview
API on the configured API server to determine whether each request is authorized
The kubelet authorizes API requests using the same request attributes approach as the apiserver.
The verb is determined from the incoming request's HTTP verb:
HTTP verb | request verb |
---|---|
POST | create |
GET, HEAD | get |
PUT | update |
PATCH | patch |
DELETE | delete |
The resource and subresource is determined from the incoming request's path:
Kubelet API | resource | subresource |
---|---|---|
/stats/* | nodes | stats |
/metrics/* | nodes | metrics |
/logs/* | nodes | log |
/spec/* | nodes | spec |
all others | nodes | proxy |
The namespace and API group attributes are always an empty string, and
the resource name is always the name of the kubelet's Node
API object.
When running in this mode, ensure the user identified by the --kubelet-client-certificate
and --kubelet-client-key
flags passed to the apiserver is authorized for the following attributes:
- verb=*, resource=nodes, subresource=proxy
- verb=*, resource=nodes, subresource=stats
- verb=*, resource=nodes, subresource=log
- verb=*, resource=nodes, subresource=spec
- verb=*, resource=nodes, subresource=metrics
6.11.8 - TLS bootstrapping
In a Kubernetes cluster, the components on the worker nodes - kubelet and kube-proxy - need to communicate with Kubernetes control plane components, specifically kube-apiserver. In order to ensure that communication is kept private, not interfered with, and ensure that each component of the cluster is talking to another trusted component, we strongly recommend using client TLS certificates on nodes.
The normal process of bootstrapping these components, especially worker nodes that need certificates so they can communicate safely with kube-apiserver, can be a challenging process as it is often outside of the scope of Kubernetes and requires significant additional work. This in turn, can make it challenging to initialize or scale a cluster.
In order to simplify the process, beginning in version 1.4, Kubernetes introduced a certificate request and signing API. The proposal can be found here.
This document describes the process of node initialization, how to set up TLS client certificate bootstrapping for kubelets, and how it works.
Initialization Process
When a worker node starts up, the kubelet does the following:
- Look for its
kubeconfig
file - Retrieve the URL of the API server and credentials, normally a TLS key and signed certificate from the
kubeconfig
file - Attempt to communicate with the API server using the credentials.
Assuming that the kube-apiserver successfully validates the kubelet's credentials, it will treat the kubelet as a valid node, and begin to assign pods to it.
Note that the above process depends upon:
- Existence of a key and certificate on the local host in the
kubeconfig
- The certificate having been signed by a Certificate Authority (CA) trusted by the kube-apiserver
All of the following are responsibilities of whoever sets up and manages the cluster:
- Creating the CA key and certificate
- Distributing the CA certificate to the control plane nodes, where kube-apiserver is running
- Creating a key and certificate for each kubelet; strongly recommended to have a unique one, with a unique CN, for each kubelet
- Signing the kubelet certificate using the CA key
- Distributing the kubelet key and signed certificate to the specific node on which the kubelet is running
The TLS Bootstrapping described in this document is intended to simplify, and partially or even completely automate, steps 3 onwards, as these are the most common when initializing or scaling a cluster.
Bootstrap Initialization
In the bootstrap initialization process, the following occurs:
- kubelet begins
- kubelet sees that it does not have a
kubeconfig
file - kubelet searches for and finds a
bootstrap-kubeconfig
file - kubelet reads its bootstrap file, retrieving the URL of the API server and a limited usage "token"
- kubelet connects to the API server, authenticates using the token
- kubelet now has limited credentials to create and retrieve a certificate signing request (CSR)
- kubelet creates a CSR for itself with the signerName set to
kubernetes.io/kube-apiserver-client-kubelet
- CSR is approved in one of two ways:
- If configured, kube-controller-manager automatically approves the CSR
- If configured, an outside process, possibly a person, approves the CSR using the Kubernetes API or via
kubectl
- Certificate is created for the kubelet
- Certificate is issued to the kubelet
- kubelet retrieves the certificate
- kubelet creates a proper
kubeconfig
with the key and signed certificate - kubelet begins normal operation
- Optional: if configured, kubelet automatically requests renewal of the certificate when it is close to expiry
- The renewed certificate is approved and issued, either automatically or manually, depending on configuration.
The rest of this document describes the necessary steps to configure TLS Bootstrapping, and its limitations.
Configuration
To configure for TLS bootstrapping and optional automatic approval, you must configure options on the following components:
- kube-apiserver
- kube-controller-manager
- kubelet
- in-cluster resources:
ClusterRoleBinding
and potentiallyClusterRole
In addition, you need your Kubernetes Certificate Authority (CA).
Certificate Authority
As without bootstrapping, you will need a Certificate Authority (CA) key and certificate. As without bootstrapping, these will be used to sign the kubelet certificate. As before, it is your responsibility to distribute them to control plane nodes.
For the purposes of this document, we will assume these have been distributed to control plane nodes at /var/lib/kubernetes/ca.pem
(certificate) and /var/lib/kubernetes/ca-key.pem
(key).
We will refer to these as "Kubernetes CA certificate and key".
All Kubernetes components that use these certificates - kubelet, kube-apiserver, kube-controller-manager - assume the key and certificate to be PEM-encoded.
kube-apiserver configuration
The kube-apiserver has several requirements to enable TLS bootstrapping:
- Recognizing CA that signs the client certificate
- Authenticating the bootstrapping kubelet to the
system:bootstrappers
group - Authorize the bootstrapping kubelet to create a certificate signing request (CSR)
Recognizing client certificates
This is normal for all client certificate authentication.
If not already set, add the --client-ca-file=FILENAME
flag to the kube-apiserver command to enable
client certificate authentication, referencing a certificate authority bundle
containing the signing certificate, for example
--client-ca-file=/var/lib/kubernetes/ca.pem
.
Initial bootstrap authentication
In order for the bootstrapping kubelet to connect to kube-apiserver and request a certificate, it must first authenticate to the server. You can use any authenticator that can authenticate the kubelet.
While any authentication strategy can be used for the kubelet's initial bootstrap credentials, the following two authenticators are recommended for ease of provisioning.
Bootstrap tokens are a simpler and more easily managed method to authenticate kubelets, and do not require any additional flags when starting kube-apiserver.
Whichever method you choose, the requirement is that the kubelet be able to authenticate as a user with the rights to:
- create and retrieve CSRs
- be automatically approved to request node client certificates, if automatic approval is enabled.
A kubelet authenticating using bootstrap tokens is authenticated as a user in the group system:bootstrappers
, which is the standard method to use.
As this feature matures, you should ensure tokens are bound to a Role Based Access Control (RBAC) policy which limits requests (using the bootstrap token) strictly to client requests related to certificate provisioning. With RBAC in place, scoping the tokens to a group allows for great flexibility. For example, you could disable a particular bootstrap group's access when you are done provisioning the nodes.
Bootstrap tokens
Bootstrap tokens are described in detail here. These are tokens that are stored as secrets in the Kubernetes cluster, and then issued to the individual kubelet. You can use a single token for an entire cluster, or issue one per worker node.
The process is two-fold:
- Create a Kubernetes secret with the token ID, secret and scope(s).
- Issue the token to the kubelet
From the kubelet's perspective, one token is like another and has no special meaning.
From the kube-apiserver's perspective, however, the bootstrap token is special. Due to its type
, namespace
and name
, kube-apiserver recognizes it as a special token,
and grants anyone authenticating with that token special bootstrap rights, notably treating them as a member of the system:bootstrappers
group. This fulfills a basic requirement
for TLS bootstrapping.
The details for creating the secret are available here.
If you want to use bootstrap tokens, you must enable it on kube-apiserver with the flag:
--enable-bootstrap-token-auth=true
Token authentication file
kube-apiserver has the ability to accept tokens as authentication.
These tokens are arbitrary but should represent at least 128 bits of entropy derived
from a secure random number generator (such as /dev/urandom
on most modern Linux
systems). There are multiple ways you can generate a token. For example:
head -c 16 /dev/urandom | od -An -t x | tr -d ' '
will generate tokens that look like 02b50b05283e98dd0fd71db496ef01e8
.
The token file should look like the following example, where the first three values can be anything and the quoted group name should be as depicted:
02b50b05283e98dd0fd71db496ef01e8,kubelet-bootstrap,10001,"system:bootstrappers"
Add the --token-auth-file=FILENAME
flag to the kube-apiserver command (in your
systemd unit file perhaps) to enable the token file. See docs
here for
further details.
Authorize kubelet to create CSR
Now that the bootstrapping node is authenticated as part of the
system:bootstrappers
group, it needs to be authorized to create a
certificate signing request (CSR) as well as retrieve it when done.
Fortunately, Kubernetes ships with a ClusterRole
with precisely these (and
only these) permissions, system:node-bootstrapper
.
To do this, you only need to create a ClusterRoleBinding
that binds the system:bootstrappers
group to the cluster role system:node-bootstrapper
.
# enable bootstrapping nodes to create CSR
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: create-csrs-for-bootstrapping
subjects:
- kind: Group
name: system:bootstrappers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: system:node-bootstrapper
apiGroup: rbac.authorization.k8s.io
kube-controller-manager configuration
While the apiserver receives the requests for certificates from the kubelet and authenticates those requests, the controller-manager is responsible for issuing actual signed certificates.
The controller-manager performs this function via a certificate-issuing control loop. This takes the form of a cfssl local signer using assets on disk. Currently, all certificates issued have one year validity and a default set of key usages.
In order for the controller-manager to sign certificates, it needs the following:
- access to the "Kubernetes CA key and certificate" that you created and distributed
- enabling CSR signing
Access to key and certificate
As described earlier, you need to create a Kubernetes CA key and certificate, and distribute it to the control plane nodes. These will be used by the controller-manager to sign the kubelet certificates.
Since these signed certificates will, in turn, be used by the kubelet to authenticate as a regular kubelet to kube-apiserver, it is important that the CA
provided to the controller-manager at this stage also be trusted by kube-apiserver for authentication. This is provided to kube-apiserver
with the flag --client-ca-file=FILENAME
(for example, --client-ca-file=/var/lib/kubernetes/ca.pem
), as described in the kube-apiserver configuration section.
To provide the Kubernetes CA key and certificate to kube-controller-manager, use the following flags:
--cluster-signing-cert-file="/etc/path/to/kubernetes/ca/ca.crt" --cluster-signing-key-file="/etc/path/to/kubernetes/ca/ca.key"
for example:
--cluster-signing-cert-file="/var/lib/kubernetes/ca.pem" --cluster-signing-key-file="/var/lib/kubernetes/ca-key.pem"
The validity duration of signed certificates can be configured with flag:
--cluster-signing-duration
Approval
In order to approve CSRs, you need to tell the controller-manager that it is acceptable to approve them. This is done by granting RBAC permissions to the correct group.
There are two distinct sets of permissions:
nodeclient
: If a node is creating a new certificate for a node, then it does not have a certificate yet. It is authenticating using one of the tokens listed above, and thus is part of the groupsystem:bootstrappers
.selfnodeclient
: If a node is renewing its certificate, then it already has a certificate (by definition), which it uses continuously to authenticate as part of the groupsystem:nodes
.
To enable the kubelet to request and receive a new certificate, create a ClusterRoleBinding
that binds the group in which the bootstrapping node is a member system:bootstrappers
to the ClusterRole
that grants it permission, system:certificates.k8s.io:certificatesigningrequests:nodeclient
:
# Approve all CSRs for the group "system:bootstrappers"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: auto-approve-csrs-for-group
subjects:
- kind: Group
name: system:bootstrappers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: system:certificates.k8s.io:certificatesigningrequests:nodeclient
apiGroup: rbac.authorization.k8s.io
To enable the kubelet to renew its own client certificate, create a ClusterRoleBinding
that binds the group in which the fully functioning node is a member system:nodes
to the ClusterRole
that
grants it permission, system:certificates.k8s.io:certificatesigningrequests:selfnodeclient
:
# Approve renewal CSRs for the group "system:nodes"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: auto-approve-renewals-for-nodes
subjects:
- kind: Group
name: system:nodes
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: system:certificates.k8s.io:certificatesigningrequests:selfnodeclient
apiGroup: rbac.authorization.k8s.io
The csrapproving
controller that ships as part of
kube-controller-manager and is enabled
by default. The controller uses the
SubjectAccessReview
API to
determine if a given user is authorized to request a CSR, then approves based on
the authorization outcome. To prevent conflicts with other approvers, the
builtin approver doesn't explicitly deny CSRs. It only ignores unauthorized
requests. The controller also prunes expired certificates as part of garbage
collection.
kubelet configuration
Finally, with the control plane nodes properly set up and all of the necessary authentication and authorization in place, we can configure the kubelet.
The kubelet requires the following configuration to bootstrap:
- A path to store the key and certificate it generates (optional, can use default)
- A path to a
kubeconfig
file that does not yet exist; it will place the bootstrapped config file here - A path to a bootstrap
kubeconfig
file to provide the URL for the server and bootstrap credentials, e.g. a bootstrap token - Optional: instructions to rotate certificates
The bootstrap kubeconfig
should be in a path available to the kubelet, for example /var/lib/kubelet/bootstrap-kubeconfig
.
Its format is identical to a normal kubeconfig
file. A sample file might look as follows:
apiVersion: v1
kind: Config
clusters:
- cluster:
certificate-authority: /var/lib/kubernetes/ca.pem
server: https://my.server.example.com:6443
name: bootstrap
contexts:
- context:
cluster: bootstrap
user: kubelet-bootstrap
name: bootstrap
current-context: bootstrap
preferences: {}
users:
- name: kubelet-bootstrap
user:
token: 07401b.f395accd246ae52d
The important elements to note are:
certificate-authority
: path to a CA file, used to validate the server certificate presented by kube-apiserverserver
: URL to kube-apiservertoken
: the token to use
The format of the token does not matter, as long as it matches what kube-apiserver expects. In the above example, we used a bootstrap token. As stated earlier, any valid authentication method can be used, not only tokens.
Because the bootstrap kubeconfig
is a standard kubeconfig
, you can use kubectl
to generate it. To create the above example file:
kubectl config --kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig set-cluster bootstrap --server='https://my.server.example.com:6443' --certificate-authority=/var/lib/kubernetes/ca.pem
kubectl config --kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig set-credentials kubelet-bootstrap --token=07401b.f395accd246ae52d
kubectl config --kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig set-context bootstrap --user=kubelet-bootstrap --cluster=bootstrap
kubectl config --kubeconfig=/var/lib/kubelet/bootstrap-kubeconfig use-context bootstrap
To indicate to the kubelet to use the bootstrap kubeconfig
, use the following kubelet flag:
--bootstrap-kubeconfig="/var/lib/kubelet/bootstrap-kubeconfig" --kubeconfig="/var/lib/kubelet/kubeconfig"
When starting the kubelet, if the file specified via --kubeconfig
does not
exist, the bootstrap kubeconfig specified via --bootstrap-kubeconfig
is used
to request a client certificate from the API server. On approval of the
certificate request and receipt back by the kubelet, a kubeconfig file
referencing the generated key and obtained certificate is written to the path
specified by --kubeconfig
. The certificate and key file will be placed in the
directory specified by --cert-dir
.
Client and Serving Certificates
All of the above relate to kubelet client certificates, specifically, the certificates a kubelet uses to authenticate to kube-apiserver.
A kubelet also can use serving certificates. The kubelet itself exposes an https endpoint for certain features. To secure these, the kubelet can do one of:
- use provided key and certificate, via the
--tls-private-key-file
and--tls-cert-file
flags - create self-signed key and certificate, if a key and certificate are not provided
- request serving certificates from the cluster server, via the CSR API
The client certificate provided by TLS bootstrapping is signed, by default, for client auth
only, and thus cannot
be used as serving certificates, or server auth
.
However, you can enable its server certificate, at least partially, via certificate rotation.
Certificate Rotation
Kubernetes v1.8 and higher kubelet implements beta features for enabling
rotation of its client and/or serving certificates. These can be enabled through
the respective RotateKubeletClientCertificate
and
RotateKubeletServerCertificate
feature flags on the kubelet and are enabled by
default.
RotateKubeletClientCertificate
causes the kubelet to rotate its client
certificates by creating new CSRs as its existing credentials expire. To enable
this feature pass the following flag to the kubelet:
--rotate-certificates
RotateKubeletServerCertificate
causes the kubelet both to request a serving
certificate after bootstrapping its client credentials and to rotate that
certificate. To enable this feature pass the following flag to the kubelet:
--rotate-server-certificates
The CSR approving controllers implemented in core Kubernetes do not
approve node serving certificates for security
reasons. To use
RotateKubeletServerCertificate
operators need to run a custom approving
controller, or manually approve the serving certificate requests.
A deployment-specific approval process for kubelet serving certificates should typically only approve CSRs which:
- are requested by nodes (ensure the
spec.username
field is of the formsystem:node:<nodeName>
andspec.groups
containssystem:nodes
) - request usages for a serving certificate (ensure
spec.usages
containsserver auth
, optionally containsdigital signature
andkey encipherment
, and contains no other usages) - only have IP and DNS subjectAltNames that belong to the requesting node,
and have no URI and Email subjectAltNames (parse the x509 Certificate Signing Request
in
spec.request
to verifysubjectAltNames
)
Other authenticating components
All of TLS bootstrapping described in this document relates to the kubelet. However, other components may need to communicate directly with kube-apiserver. Notable is kube-proxy, which is part of the Kubernetes control plane and runs on every node, but may also include other components such as monitoring or networking.
Like the kubelet, these other components also require a method of authenticating to kube-apiserver. You have several options for generating these credentials:
- The old way: Create and distribute certificates the same way you did for kubelet before TLS bootstrapping
- DaemonSet: Since the kubelet itself is loaded on each node, and is sufficient to start base services, you can run kube-proxy and other node-specific services not as a standalone process, but rather as a daemonset in the
kube-system
namespace. Since it will be in-cluster, you can give it a proper service account with appropriate permissions to perform its activities. This may be the simplest way to configure such services.
kubectl approval
CSRs can be approved outside of the approval flows builtin to the controller manager.
The signing controller does not immediately sign all certificate requests.
Instead, it waits until they have been flagged with an "Approved" status by an
appropriately-privileged user. This flow is intended to allow for automated
approval handled by an external approval controller or the approval controller
implemented in the core controller-manager. However cluster administrators can
also manually approve certificate requests using kubectl. An administrator can
list CSRs with kubectl get csr
and describe one in detail with kubectl describe csr <name>
. An administrator can approve or deny a CSR with kubectl certificate approve <name>
and kubectl certificate deny <name>
.
6.12 - Configuration APIs
6.12.1 - Client Authentication (v1)
Resource Types
ExecCredential
ExecCredential is used by exec-based plugins to communicate credentials to HTTP transports.
Field | Description |
---|---|
apiVersion string | client.authentication.k8s.io/v1 |
kind string | ExecCredential |
spec [Required]ExecCredentialSpec
|
Spec holds information passed to the plugin by the transport. |
status ExecCredentialStatus
|
Status is filled in by the plugin and holds the credentials that the transport should use to contact the API. |
Cluster
Appears in:
Cluster contains information to allow an exec plugin to communicate with the kubernetes cluster being authenticated to.
To ensure that this struct contains everything someone would need to communicate with a kubernetes cluster (just like they would via a kubeconfig), the fields should shadow "k8s.io/client-go/tools/clientcmd/api/v1".Cluster, with the exception of CertificateAuthority, since CA data will always be passed to the plugin as bytes.
Field | Description |
---|---|
server [Required]string
|
Server is the address of the kubernetes cluster (https://hostname:port). |
tls-server-name string
|
TLSServerName is passed to the server for SNI and is used in the client to check server certificates against. If ServerName is empty, the hostname used to contact the server is used. |
insecure-skip-tls-verify bool
|
InsecureSkipTLSVerify skips the validity check for the server's certificate. This will make your HTTPS connections insecure. |
certificate-authority-data []byte
|
CAData contains PEM-encoded certificate authority certificates. If empty, system roots should be used. |
proxy-url string
|
ProxyURL is the URL to the proxy to be used for all requests to this cluster. |
config k8s.io/apimachinery/pkg/runtime.RawExtension
|
Config holds additional config data that is specific to the exec plugin with regards to the cluster being authenticated to. This data is sourced from the clientcmd Cluster object's extensions[client.authentication.k8s.io/exec] field: clusters:
In some environments, the user config may be exactly the same across many clusters (i.e. call this exec plugin) minus some details that are specific to each cluster such as the audience. This field allows the per cluster config to be directly specified with the cluster info. Using this field to store secret data is not recommended as one of the prime benefits of exec plugins is that no secrets need to be stored directly in the kubeconfig. |
ExecCredentialSpec
Appears in:
ExecCredentialSpec holds request and runtime specific information provided by the transport.
Field | Description |
---|---|
cluster Cluster
|
Cluster contains information to allow an exec plugin to communicate with the kubernetes cluster being authenticated to. Note that Cluster is non-nil only when provideClusterInfo is set to true in the exec provider config (i.e., ExecConfig.ProvideClusterInfo). |
interactive [Required]bool
|
Interactive declares whether stdin has been passed to this exec plugin. |
ExecCredentialStatus
Appears in:
ExecCredentialStatus holds credentials for the transport to use.
Token and ClientKeyData are sensitive fields. This data should only be transmitted in-memory between client and exec plugin process. Exec plugin itself should at least be protected via file permissions.
Field | Description |
---|---|
expirationTimestamp meta/v1.Time
|
ExpirationTimestamp indicates a time when the provided credentials expire. |
token [Required]string
|
Token is a bearer token used by the client for request authentication. |
clientCertificateData [Required]string
|
PEM-encoded client TLS certificates (including intermediates, if any). |
clientKeyData [Required]string
|
PEM-encoded private key for the above certificate. |
6.12.2 - Client Authentication (v1beta1)
Resource Types
ExecCredential
ExecCredential is used by exec-based plugins to communicate credentials to HTTP transports.
Field | Description |
---|---|
apiVersion string | client.authentication.k8s.io/v1beta1 |
kind string | ExecCredential |
spec [Required]ExecCredentialSpec
|
Spec holds information passed to the plugin by the transport. |
status ExecCredentialStatus
|
Status is filled in by the plugin and holds the credentials that the transport should use to contact the API. |
Cluster
Appears in:
Cluster contains information to allow an exec plugin to communicate with the kubernetes cluster being authenticated to.
To ensure that this struct contains everything someone would need to communicate with a kubernetes cluster (just like they would via a kubeconfig), the fields should shadow "k8s.io/client-go/tools/clientcmd/api/v1".Cluster, with the exception of CertificateAuthority, since CA data will always be passed to the plugin as bytes.
Field | Description |
---|---|
server [Required]string
|
Server is the address of the kubernetes cluster (https://hostname:port). |
tls-server-name string
|
TLSServerName is passed to the server for SNI and is used in the client to check server certificates against. If ServerName is empty, the hostname used to contact the server is used. |
insecure-skip-tls-verify bool
|
InsecureSkipTLSVerify skips the validity check for the server's certificate. This will make your HTTPS connections insecure. |
certificate-authority-data []byte
|
CAData contains PEM-encoded certificate authority certificates. If empty, system roots should be used. |
proxy-url string
|
ProxyURL is the URL to the proxy to be used for all requests to this cluster. |
config k8s.io/apimachinery/pkg/runtime.RawExtension
|
Config holds additional config data that is specific to the exec plugin with regards to the cluster being authenticated to. This data is sourced from the clientcmd Cluster object's extensions[client.authentication.k8s.io/exec] field: clusters:
In some environments, the user config may be exactly the same across many clusters (i.e. call this exec plugin) minus some details that are specific to each cluster such as the audience. This field allows the per cluster config to be directly specified with the cluster info. Using this field to store secret data is not recommended as one of the prime benefits of exec plugins is that no secrets need to be stored directly in the kubeconfig. |
ExecCredentialSpec
Appears in:
ExecCredentialSpec holds request and runtime specific information provided by the transport.
Field | Description |
---|---|
cluster Cluster
|
Cluster contains information to allow an exec plugin to communicate with the kubernetes cluster being authenticated to. Note that Cluster is non-nil only when provideClusterInfo is set to true in the exec provider config (i.e., ExecConfig.ProvideClusterInfo). |
interactive [Required]bool
|
Interactive declares whether stdin has been passed to this exec plugin. |
ExecCredentialStatus
Appears in:
ExecCredentialStatus holds credentials for the transport to use.
Token and ClientKeyData are sensitive fields. This data should only be transmitted in-memory between client and exec plugin process. Exec plugin itself should at least be protected via file permissions.
Field | Description |
---|---|
expirationTimestamp meta/v1.Time
|
ExpirationTimestamp indicates a time when the provided credentials expire. |
token [Required]string
|
Token is a bearer token used by the client for request authentication. |
clientCertificateData [Required]string
|
PEM-encoded client TLS certificates (including intermediates, if any). |
clientKeyData [Required]string
|
PEM-encoded private key for the above certificate. |
6.12.3 - kube-apiserver Audit Configuration (v1)
Resource Types
Event
Appears in:
Event captures all the information that can be included in an API audit log.
Field | Description |
---|---|
apiVersion string | audit.k8s.io/v1 |
kind string | Event |
level [Required]Level
|
AuditLevel at which event was generated |
auditID [Required]k8s.io/apimachinery/pkg/types.UID
|
Unique audit ID, generated for each request. |
stage [Required]Stage
|
Stage of the request handling when this event instance was generated. |
requestURI [Required]string
|
RequestURI is the request URI as sent by the client to a server. |
verb [Required]string
|
Verb is the kubernetes verb associated with the request. For non-resource requests, this is the lower-cased HTTP method. |
user [Required]authentication/v1.UserInfo
|
Authenticated user information. |
impersonatedUser authentication/v1.UserInfo
|
Impersonated user information. |
sourceIPs []string
|
Source IPs, from where the request originated and intermediate proxies. |
userAgent string
|
UserAgent records the user agent string reported by the client. Note that the UserAgent is provided by the client, and must not be trusted. |
objectRef ObjectReference
|
Object reference this request is targeted at. Does not apply for List-type requests, or non-resource requests. |
responseStatus meta/v1.Status
|
The response status, populated even when the ResponseObject is not a Status type. For successful responses, this will only include the Code and StatusSuccess. For non-status type error responses, this will be auto-populated with the error Message. |
requestObject k8s.io/apimachinery/pkg/runtime.Unknown
|
API object from the request, in JSON format. The RequestObject is recorded as-is in the request (possibly re-encoded as JSON), prior to version conversion, defaulting, admission or merging. It is an external versioned object type, and may not be a valid object on its own. Omitted for non-resource requests. Only logged at Request Level and higher. |
responseObject k8s.io/apimachinery/pkg/runtime.Unknown
|
API object returned in the response, in JSON. The ResponseObject is recorded after conversion to the external type, and serialized as JSON. Omitted for non-resource requests. Only logged at Response Level. |
requestReceivedTimestamp meta/v1.MicroTime
|
Time the request reached the apiserver. |
stageTimestamp meta/v1.MicroTime
|
Time the request reached current audit stage. |
annotations map[string]string
|
Annotations is an unstructured key value map stored with an audit event that may be set by plugins invoked in the request serving chain, including authentication, authorization and admission plugins. Note that these annotations are for the audit event, and do not correspond to the metadata.annotations of the submitted object. Keys should uniquely identify the informing component to avoid name collisions (e.g. podsecuritypolicy.admission.k8s.io/policy). Values should be short. Annotations are included in the Metadata level. |
EventList
EventList is a list of audit Events.
Field | Description |
---|---|
apiVersion string | audit.k8s.io/v1 |
kind string | EventList |
metadata meta/v1.ListMeta
|
No description provided. |
items [Required][]Event
|
No description provided. |
Policy
Appears in:
Policy defines the configuration of audit logging, and the rules for how different request categories are logged.
Field | Description |
---|---|
apiVersion string | audit.k8s.io/v1 |
kind string | Policy |
metadata meta/v1.ObjectMeta
|
ObjectMeta is included for interoperability with API infrastructure. Refer to the Kubernetes API documentation for the fields of themetadata field. |
rules [Required][]PolicyRule
|
Rules specify the audit Level a request should be recorded at. A request may match multiple rules, in which case the FIRST matching rule is used. The default audit level is None, but can be overridden by a catch-all rule at the end of the list. PolicyRules are strictly ordered. |
omitStages []Stage
|
OmitStages is a list of stages for which no events are created. Note that this can also be specified per rule in which case the union of both are omitted. |
omitManagedFields bool
|
OmitManagedFields indicates whether to omit the managed fields of the request and response bodies from being written to the API audit log. This is used as a global default - a value of 'true' will omit the managed fileds, otherwise the managed fields will be included in the API audit log. Note that this can also be specified per rule in which case the value specified in a rule will override the global default. |
PolicyList
PolicyList is a list of audit Policies.
Field | Description |
---|---|
apiVersion string | audit.k8s.io/v1 |
kind string | PolicyList |
metadata meta/v1.ListMeta
|
No description provided. |
items [Required][]Policy
|
No description provided. |
GroupResources
Appears in:
GroupResources represents resource kinds in an API group.
Field | Description |
---|---|
group string
|
Group is the name of the API group that contains the resources. The empty string represents the core API group. |
resources []string
|
Resources is a list of resources this rule applies to. For example: 'pods' matches pods. 'pods/log' matches the log subresource of pods. '' matches all resources and their subresources. 'pods/' matches all subresources of pods. '*/scale' matches all scale subresources. If wildcard is present, the validation rule will ensure resources do not overlap with each other. An empty list implies all resources and subresources in this API groups apply. |
resourceNames []string
|
ResourceNames is a list of resource instance names that the policy matches. Using this field requires Resources to be specified. An empty list implies that every instance of the resource is matched. |
Level
(Alias of string
)
Appears in:
Level defines the amount of information logged during auditing
ObjectReference
Appears in:
ObjectReference contains enough information to let you inspect or modify the referred object.
Field | Description |
---|---|
resource string
|
No description provided. |
namespace string
|
No description provided. |
name string
|
No description provided. |
uid k8s.io/apimachinery/pkg/types.UID
|
No description provided. |
apiGroup string
|
APIGroup is the name of the API group that contains the referred object. The empty string represents the core API group. |
apiVersion string
|
APIVersion is the version of the API group that contains the referred object. |
resourceVersion string
|
No description provided. |
subresource string
|
No description provided. |
PolicyRule
Appears in:
PolicyRule maps requests based off metadata to an audit Level. Requests must match the rules of every field (an intersection of rules).
Field | Description |
---|---|
level [Required]Level
|
The Level that requests matching this rule are recorded at. |
users []string
|
The users (by authenticated user name) this rule applies to. An empty list implies every user. |
userGroups []string
|
The user groups this rule applies to. A user is considered matching if it is a member of any of the UserGroups. An empty list implies every user group. |
verbs []string
|
The verbs that match this rule. An empty list implies every verb. |
resources []GroupResources
|
Resources that this rule matches. An empty list implies all kinds in all API groups. |
namespaces []string
|
Namespaces that this rule matches. The empty string "" matches non-namespaced resources. An empty list implies every namespace. |
nonResourceURLs []string
|
NonResourceURLs is a set of URL paths that should be audited. s are allowed, but only as the full, final step in the path. Examples: "/metrics" - Log requests for apiserver metrics "/healthz" - Log all health checks |
omitStages []Stage
|
OmitStages is a list of stages for which no events are created. Note that this can also be specified policy wide in which case the union of both are omitted. An empty list means no restrictions will apply. |
omitManagedFields bool
|
OmitManagedFields indicates whether to omit the managed fields of the request and response bodies from being written to the API audit log.
|
Stage
(Alias of string
)
Appears in:
Stage defines the stages in request handling that audit events may be generated.
6.12.4 - kube-apiserver Configuration (v1)
Package v1 is the v1 version of the API.
Resource Types
AdmissionConfiguration
AdmissionConfiguration provides versioned configuration for admission controllers.
Field | Description |
---|---|
apiVersion string | apiserver.config.k8s.io/v1 |
kind string | AdmissionConfiguration |
plugins []AdmissionPluginConfiguration
|
Plugins allows specifying a configuration per admission control plugin. |
AdmissionPluginConfiguration
Appears in:
AdmissionPluginConfiguration provides the configuration for a single plug-in.
Field | Description |
---|---|
name [Required]string
|
Name is the name of the admission controller. It must match the registered admission plugin name. |
path string
|
Path is the path to a configuration file that contains the plugin's configuration |
configuration k8s.io/apimachinery/pkg/runtime.Unknown
|
Configuration is an embedded configuration object to be used as the plugin's configuration. If present, it will be used instead of the path to the configuration file. |
6.12.5 - kube-apiserver Configuration (v1alpha1)
Package v1alpha1 is the v1alpha1 version of the API.
Resource Types
AdmissionConfiguration
AdmissionConfiguration provides versioned configuration for admission controllers.
Field | Description |
---|---|
apiVersion string | apiserver.k8s.io/v1alpha1 |
kind string | AdmissionConfiguration |
plugins []AdmissionPluginConfiguration
|
Plugins allows specifying a configuration per admission control plugin. |
EgressSelectorConfiguration
EgressSelectorConfiguration provides versioned configuration for egress selector clients.
Field | Description |
---|---|
apiVersion string | apiserver.k8s.io/v1alpha1 |
kind string | EgressSelectorConfiguration |
egressSelections [Required][]EgressSelection
|
connectionServices contains a list of egress selection client configurations |
TracingConfiguration
TracingConfiguration provides versioned configuration for tracing clients.
Field | Description |
---|---|
apiVersion string | apiserver.k8s.io/v1alpha1 |
kind string | TracingConfiguration |
endpoint string
|
Endpoint of the collector that's running on the control-plane node. The APIServer uses the egressType ControlPlane when sending data to the collector. The syntax is defined in https://github.com/grpc/grpc/blob/master/doc/naming.md. Defaults to the otlpgrpc default, localhost:4317 The connection is insecure, and does not support TLS. |
samplingRatePerMillion int32
|
SamplingRatePerMillion is the number of samples to collect per million spans. Defaults to 0. |
AdmissionPluginConfiguration
Appears in:
AdmissionPluginConfiguration provides the configuration for a single plug-in.
Field | Description |
---|---|
name [Required]string
|
Name is the name of the admission controller. It must match the registered admission plugin name. |
path string
|
Path is the path to a configuration file that contains the plugin's configuration |
configuration k8s.io/apimachinery/pkg/runtime.Unknown
|
Configuration is an embedded configuration object to be used as the plugin's configuration. If present, it will be used instead of the path to the configuration file. |
Connection
Appears in:
Connection provides the configuration for a single egress selection client.
Field | Description |
---|---|
proxyProtocol [Required]ProtocolType
|
Protocol is the protocol used to connect from client to the konnectivity server. |
transport Transport
|
Transport defines the transport configurations we use to dial to the konnectivity server. This is required if ProxyProtocol is HTTPConnect or GRPC. |
EgressSelection
Appears in:
EgressSelection provides the configuration for a single egress selection client.
Field | Description |
---|---|
name [Required]string
|
name is the name of the egress selection. Currently supported values are "controlplane", "master", "etcd" and "cluster" The "master" egress selector is deprecated in favor of "controlplane" |
connection [Required]Connection
|
connection is the exact information used to configure the egress selection |
ProtocolType
(Alias of string
)
Appears in:
ProtocolType is a set of valid values for Connection.ProtocolType
TCPTransport
Appears in:
TCPTransport provides the information to connect to konnectivity server via TCP
Field | Description |
---|---|
url [Required]string
|
URL is the location of the konnectivity server to connect to. As an example it might be "https://127.0.0.1:8131" |
tlsConfig TLSConfig
|
TLSConfig is the config needed to use TLS when connecting to konnectivity server |
TLSConfig
Appears in:
TLSConfig provides the authentication information to connect to konnectivity server Only used with TCPTransport
Field | Description |
---|---|
caBundle string
|
caBundle is the file location of the CA to be used to determine trust with the konnectivity server. Must be absent/empty if TCPTransport.URL is prefixed with http:// If absent while TCPTransport.URL is prefixed with https://, default to system trust roots. |
clientKey string
|
clientKey is the file location of the client key to be used in mtls handshakes with the konnectivity server. Must be absent/empty if TCPTransport.URL is prefixed with http:// Must be configured if TCPTransport.URL is prefixed with https:// |
clientCert string
|
clientCert is the file location of the client certificate to be used in mtls handshakes with the konnectivity server. Must be absent/empty if TCPTransport.URL is prefixed with http:// Must be configured if TCPTransport.URL is prefixed with https:// |
Transport
Appears in:
Transport defines the transport configurations we use to dial to the konnectivity server
Field | Description |
---|---|
tcp TCPTransport
|
TCP is the TCP configuration for communicating with the konnectivity server via TCP ProxyProtocol of GRPC is not supported with TCP transport at the moment Requires at least one of TCP or UDS to be set |
uds UDSTransport
|
UDS is the UDS configuration for communicating with the konnectivity server via UDS Requires at least one of TCP or UDS to be set |
UDSTransport
Appears in:
UDSTransport provides the information to connect to konnectivity server via UDS
Field | Description |
---|---|
udsName [Required]string
|
UDSName is the name of the unix domain socket to connect to konnectivity server This does not use a unix:// prefix. (Eg: /etc/srv/kubernetes/konnectivity-server/konnectivity-server.socket) |
6.12.6 - kube-apiserver Encryption Configuration (v1)
Package v1 is the v1 version of the API.
Resource Types
EncryptionConfiguration
EncryptionConfiguration stores the complete configuration for encryption providers.
Field | Description |
---|---|
apiVersion string | apiserver.config.k8s.io/v1 |
kind string | EncryptionConfiguration |
resources [Required][]ResourceConfiguration
|
resources is a list containing resources, and their corresponding encryption providers. |
AESConfiguration
Appears in:
AESConfiguration contains the API configuration for an AES transformer.
Field | Description |
---|---|
keys [Required][]Key
|
keys is a list of keys to be used for creating the AES transformer. Each key has to be 32 bytes long for AES-CBC and 16, 24 or 32 bytes for AES-GCM. |
IdentityConfiguration
Appears in:
IdentityConfiguration is an empty struct to allow identity transformer in provider configuration.
KMSConfiguration
Appears in:
KMSConfiguration contains the name, cache size and path to configuration file for a KMS based envelope transformer.
Field | Description |
---|---|
name [Required]string
|
name is the name of the KMS plugin to be used. |
cachesize int32
|
cachesize is the maximum number of secrets which are cached in memory. The default value is 1000. Set to a negative value to disable caching. |
endpoint [Required]string
|
endpoint is the gRPC server listening address, for example "unix:///var/run/kms-provider.sock". |
timeout meta/v1.Duration
|
timeout for gRPC calls to kms-plugin (ex. 5s). The default is 3 seconds. |
Key
Appears in:
Key contains name and secret of the provided key for a transformer.
Field | Description |
---|---|
name [Required]string
|
name is the name of the key to be used while storing data to disk. |
secret [Required]string
|
secret is the actual key, encoded in base64. |
ProviderConfiguration
Appears in:
ProviderConfiguration stores the provided configuration for an encryption provider.
Field | Description |
---|---|
aesgcm [Required]AESConfiguration
|
aesgcm is the configuration for the AES-GCM transformer. |
aescbc [Required]AESConfiguration
|
aescbc is the configuration for the AES-CBC transformer. |
secretbox [Required]SecretboxConfiguration
|
secretbox is the configuration for the Secretbox based transformer. |
identity [Required]IdentityConfiguration
|
identity is the (empty) configuration for the identity transformer. |
kms [Required]KMSConfiguration
|
kms contains the name, cache size and path to configuration file for a KMS based envelope transformer. |
ResourceConfiguration
Appears in:
ResourceConfiguration stores per resource configuration.
Field | Description |
---|---|
resources [Required][]string
|
resources is a list of kubernetes resources which have to be encrypted. |
providers [Required][]ProviderConfiguration
|
providers is a list of transformers to be used for reading and writing the resources to disk. eg: aesgcm, aescbc, secretbox, identity. |
SecretboxConfiguration
Appears in:
SecretboxConfiguration contains the API configuration for an Secretbox transformer.
Field | Description |
---|---|
keys [Required][]Key
|
keys is a list of keys to be used for creating the Secretbox transformer. Each key has to be 32 bytes long. |
6.12.7 - kube-proxy Configuration (v1alpha1)
Resource Types
KubeProxyConfiguration
KubeProxyConfiguration contains everything necessary to configure the Kubernetes proxy server.
Field | Description |
---|---|
apiVersion string | kubeproxy.config.k8s.io/v1alpha1 |
kind string | KubeProxyConfiguration |
featureGates [Required]map[string]bool
|
featureGates is a map of feature names to bools that enable or disable alpha/experimental features. |
bindAddress [Required]string
|
bindAddress is the IP address for the proxy server to serve on (set to 0.0.0.0 for all interfaces) |
healthzBindAddress [Required]string
|
healthzBindAddress is the IP address and port for the health check server to serve on, defaulting to 0.0.0.0:10256 |
metricsBindAddress [Required]string
|
metricsBindAddress is the IP address and port for the metrics server to serve on, defaulting to 127.0.0.1:10249 (set to 0.0.0.0 for all interfaces) |
bindAddressHardFail [Required]bool
|
bindAddressHardFail, if true, kube-proxy will treat failure to bind to a port as fatal and exit |
enableProfiling [Required]bool
|
enableProfiling enables profiling via web interface on /debug/pprof handler. Profiling handlers will be handled by metrics server. |
clusterCIDR [Required]string
|
clusterCIDR is the CIDR range of the pods in the cluster. It is used to bridge traffic coming from outside of the cluster. If not provided, no off-cluster bridging will be performed. |
hostnameOverride [Required]string
|
hostnameOverride, if non-empty, will be used as the identity instead of the actual hostname. |
clientConnection [Required]ClientConnectionConfiguration
|
clientConnection specifies the kubeconfig file and client connection settings for the proxy server to use when communicating with the apiserver. |
iptables [Required]KubeProxyIPTablesConfiguration
|
iptables contains iptables-related configuration options. |
ipvs [Required]KubeProxyIPVSConfiguration
|
ipvs contains ipvs-related configuration options. |
oomScoreAdj [Required]int32
|
oomScoreAdj is the oom-score-adj value for kube-proxy process. Values must be within the range [-1000, 1000] |
mode [Required]ProxyMode
|
mode specifies which proxy mode to use. |
portRange [Required]string
|
portRange is the range of host ports (beginPort-endPort, inclusive) that may be consumed in order to proxy service traffic. If unspecified (0-0) then ports will be randomly chosen. |
udpIdleTimeout [Required]meta/v1.Duration
|
udpIdleTimeout is how long an idle UDP connection will be kept open (e.g. '250ms', '2s'). Must be greater than 0. Only applicable for proxyMode=userspace. |
conntrack [Required]KubeProxyConntrackConfiguration
|
conntrack contains conntrack-related configuration options. |
configSyncPeriod [Required]meta/v1.Duration
|
configSyncPeriod is how often configuration from the apiserver is refreshed. Must be greater than 0. |
nodePortAddresses [Required][]string
|
nodePortAddresses is the --nodeport-addresses value for kube-proxy process. Values must be valid IP blocks. These values are as a parameter to select the interfaces where nodeport works. In case someone would like to expose a service on localhost for local visit and some other interfaces for particular purpose, a list of IP blocks would do that. If set it to "127.0.0.0/8", kube-proxy will only select the loopback interface for NodePort. If set it to a non-zero IP block, kube-proxy will filter that down to just the IPs that applied to the node. An empty string slice is meant to select all network interfaces. |
winkernel [Required]KubeProxyWinkernelConfiguration
|
winkernel contains winkernel-related configuration options. |
showHiddenMetricsForVersion [Required]string
|
ShowHiddenMetricsForVersion is the version for which you want to show hidden metrics. |
detectLocalMode [Required]LocalMode
|
DetectLocalMode determines mode to use for detecting local traffic, defaults to LocalModeClusterCIDR |
KubeProxyConntrackConfiguration
Appears in:
KubeProxyConntrackConfiguration contains conntrack settings for the Kubernetes proxy server.
Field | Description |
---|---|
maxPerCore [Required]int32
|
maxPerCore is the maximum number of NAT connections to track per CPU core (0 to leave the limit as-is and ignore min). |
min [Required]int32
|
min is the minimum value of connect-tracking records to allocate, regardless of conntrackMaxPerCore (set maxPerCore=0 to leave the limit as-is). |
tcpEstablishedTimeout [Required]meta/v1.Duration
|
tcpEstablishedTimeout is how long an idle TCP connection will be kept open (e.g. '2s'). Must be greater than 0 to set. |
tcpCloseWaitTimeout [Required]meta/v1.Duration
|
tcpCloseWaitTimeout is how long an idle conntrack entry in CLOSE_WAIT state will remain in the conntrack table. (e.g. '60s'). Must be greater than 0 to set. |
KubeProxyIPTablesConfiguration
Appears in:
KubeProxyIPTablesConfiguration contains iptables-related configuration details for the Kubernetes proxy server.
Field | Description |
---|---|
masqueradeBit [Required]int32
|
masqueradeBit is the bit of the iptables fwmark space to use for SNAT if using the pure iptables proxy mode. Values must be within the range [0, 31]. |
masqueradeAll [Required]bool
|
masqueradeAll tells kube-proxy to SNAT everything if using the pure iptables proxy mode. |
syncPeriod [Required]meta/v1.Duration
|
syncPeriod is the period that iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0. |
minSyncPeriod [Required]meta/v1.Duration
|
minSyncPeriod is the minimum period that iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). |
KubeProxyIPVSConfiguration
Appears in:
KubeProxyIPVSConfiguration contains ipvs-related configuration details for the Kubernetes proxy server.
Field | Description |
---|---|
syncPeriod [Required]meta/v1.Duration
|
syncPeriod is the period that ipvs rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0. |
minSyncPeriod [Required]meta/v1.Duration
|
minSyncPeriod is the minimum period that ipvs rules are refreshed (e.g. '5s', '1m', '2h22m'). |
scheduler [Required]string
|
ipvs scheduler |
excludeCIDRs [Required][]string
|
excludeCIDRs is a list of CIDR's which the ipvs proxier should not touch when cleaning up ipvs services. |
strictARP [Required]bool
|
strict ARP configure arp_ignore and arp_announce to avoid answering ARP queries from kube-ipvs0 interface |
tcpTimeout [Required]meta/v1.Duration
|
tcpTimeout is the timeout value used for idle IPVS TCP sessions. The default value is 0, which preserves the current timeout value on the system. |
tcpFinTimeout [Required]meta/v1.Duration
|
tcpFinTimeout is the timeout value used for IPVS TCP sessions after receiving a FIN. The default value is 0, which preserves the current timeout value on the system. |
udpTimeout [Required]meta/v1.Duration
|
udpTimeout is the timeout value used for IPVS UDP packets. The default value is 0, which preserves the current timeout value on the system. |
KubeProxyWinkernelConfiguration
Appears in:
KubeProxyWinkernelConfiguration contains Windows/HNS settings for the Kubernetes proxy server.
Field | Description |
---|---|
networkName [Required]string
|
networkName is the name of the network kube-proxy will use to create endpoints and policies |
sourceVip [Required]string
|
sourceVip is the IP address of the source VIP endoint used for NAT when loadbalancing |
enableDSR [Required]bool
|
enableDSR tells kube-proxy whether HNS policies should be created with DSR |
LocalMode
(Alias of string
)
Appears in:
LocalMode represents modes to detect local traffic from the node
ProxyMode
(Alias of string
)
Appears in:
ProxyMode represents modes used by the Kubernetes proxy server.
Currently, three modes of proxy are available in Linux platform: 'userspace' (older, going to be EOL), 'iptables' (newer, faster), 'ipvs'(newest, better in performance and scalability).
Two modes of proxy are available in Windows platform: 'userspace'(older, stable) and 'kernelspace' (newer, faster).
In Linux platform, if proxy mode is blank, use the best-available proxy (currently iptables, but may change in the future). If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy. IPVS mode will be enabled when proxy mode is set to 'ipvs', and the fall back path is firstly iptables and then userspace.
In Windows platform, if proxy mode is blank, use the best-available proxy (currently userspace, but may change in the future). If winkernel proxy is selected, regardless of how, but the Windows kernel can't support this mode of proxy, this always falls back to the userspace proxy.
ClientConnectionConfiguration
Appears in:
ClientConnectionConfiguration contains details for constructing a client.
Field | Description |
---|---|
kubeconfig [Required]string
|
kubeconfig is the path to a KubeConfig file. |
acceptContentTypes [Required]string
|
acceptContentTypes defines the Accept header sent by clients when connecting to a server, overriding the default value of 'application/json'. This field will control all connections to the server used by a particular client. |
contentType [Required]string
|
contentType is the content type used when sending data to the server from this client. |
qps [Required]float32
|
qps controls the number of queries per second allowed for this connection. |
burst [Required]int32
|
burst allows extra queries to accumulate when a client is exceeding its rate. |
DebuggingConfiguration
Appears in:
DebuggingConfiguration holds configuration for Debugging related features.
Field | Description |
---|---|
enableProfiling [Required]bool
|
enableProfiling enables profiling via web interface host:port/debug/pprof/ |
enableContentionProfiling [Required]bool
|
enableContentionProfiling enables lock contention profiling, if enableProfiling is true. |
FormatOptions
Appears in:
FormatOptions contains options for the different logging formats.
Field | Description |
---|---|
json [Required]JSONOptions
|
[Experimental] JSON contains options for logging format "json". |
JSONOptions
Appears in:
JSONOptions contains options for logging format "json".
Field | Description |
---|---|
splitStream [Required]bool
|
[Experimental] SplitStream redirects error messages to stderr while info messages go to stdout, with buffering. The default is to write both to stdout, without buffering. |
infoBufferSize [Required]k8s.io/apimachinery/pkg/api/resource.QuantityValue
|
[Experimental] InfoBufferSize sets the size of the info stream when using split streams. The default is zero, which disables buffering. |
LeaderElectionConfiguration
Appears in:
LeaderElectionConfiguration defines the configuration of leader election clients for components that can run with leader election enabled.
Field | Description |
---|---|
leaderElect [Required]bool
|
leaderElect enables a leader election client to gain leadership before executing the main loop. Enable this when running replicated components for high availability. |
leaseDuration [Required]meta/v1.Duration
|
leaseDuration is the duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. |
renewDeadline [Required]meta/v1.Duration
|
renewDeadline is the interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled. |
retryPeriod [Required]meta/v1.Duration
|
retryPeriod is the duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. |
resourceLock [Required]string
|
resourceLock indicates the resource object type that will be used to lock during leader election cycles. |
resourceName [Required]string
|
resourceName indicates the name of resource object that will be used to lock during leader election cycles. |
resourceNamespace [Required]string
|
resourceName indicates the namespace of resource object that will be used to lock during leader election cycles. |
LoggingConfiguration
Appears in:
LoggingConfiguration contains logging options Refer Logs Options for more information.
Field | Description |
---|---|
format [Required]string
|
Format Flag specifies the structure of log messages.
default value of format is |
flushFrequency [Required]time.Duration
|
Maximum number of seconds between log flushes. Ignored if the selected logging backend writes log messages without buffering. |
verbosity [Required]uint32
|
Verbosity is the threshold that determines which log messages are logged. Default is zero which logs only the most important messages. Higher values enable additional messages. Error messages are always logged. |
vmodule [Required]VModuleConfiguration
|
VModule overrides the verbosity threshold for individual files. Only supported for "text" log format. |
sanitization [Required]bool
|
[Experimental] When enabled prevents logging of fields tagged as sensitive (passwords, keys, tokens). Runtime log sanitization may introduce significant computation overhead and therefore should not be enabled in production.`) |
options [Required]FormatOptions
|
[Experimental] Options holds additional parameters that are specific to the different logging formats. Only the options for the selected format get used, but all of them get validated. |
VModuleConfiguration
(Alias of []k8s.io/component-base/config/v1alpha1.VModuleItem
)
Appears in:
VModuleConfiguration is a collection of individual file names or patterns and the corresponding verbosity threshold.
6.12.8 - kube-scheduler Configuration (v1beta2)
Resource Types
- DefaultPreemptionArgs
- InterPodAffinityArgs
- KubeSchedulerConfiguration
- NodeAffinityArgs
- NodeResourcesBalancedAllocationArgs
- NodeResourcesFitArgs
- PodTopologySpreadArgs
- VolumeBindingArgs
ClientConnectionConfiguration
Appears in:
ClientConnectionConfiguration contains details for constructing a client.
Field | Description |
---|---|
kubeconfig [Required]string
|
kubeconfig is the path to a KubeConfig file. |
acceptContentTypes [Required]string
|
acceptContentTypes defines the Accept header sent by clients when connecting to a server, overriding the default value of 'application/json'. This field will control all connections to the server used by a particular client. |
contentType [Required]string
|
contentType is the content type used when sending data to the server from this client. |
qps [Required]float32
|
qps controls the number of queries per second allowed for this connection. |
burst [Required]int32
|
burst allows extra queries to accumulate when a client is exceeding its rate. |
DebuggingConfiguration
Appears in:
DebuggingConfiguration holds configuration for Debugging related features.
Field | Description |
---|---|
enableProfiling [Required]bool
|
enableProfiling enables profiling via web interface host:port/debug/pprof/ |
enableContentionProfiling [Required]bool
|
enableContentionProfiling enables lock contention profiling, if enableProfiling is true. |
FormatOptions
Appears in:
FormatOptions contains options for the different logging formats.
Field | Description |
---|---|
json [Required]JSONOptions
|
[Experimental] JSON contains options for logging format "json". |
JSONOptions
Appears in:
JSONOptions contains options for logging format "json".
Field | Description |
---|---|
splitStream [Required]bool
|
[Experimental] SplitStream redirects error messages to stderr while info messages go to stdout, with buffering. The default is to write both to stdout, without buffering. |
infoBufferSize [Required]k8s.io/apimachinery/pkg/api/resource.QuantityValue
|
[Experimental] InfoBufferSize sets the size of the info stream when using split streams. The default is zero, which disables buffering. |
LeaderElectionConfiguration
Appears in:
LeaderElectionConfiguration defines the configuration of leader election clients for components that can run with leader election enabled.
Field | Description |
---|---|
leaderElect [Required]bool
|
leaderElect enables a leader election client to gain leadership before executing the main loop. Enable this when running replicated components for high availability. |
leaseDuration [Required]meta/v1.Duration
|
leaseDuration is the duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. |
renewDeadline [Required]meta/v1.Duration
|
renewDeadline is the interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled. |
retryPeriod [Required]meta/v1.Duration
|
retryPeriod is the duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. |
resourceLock [Required]string
|
resourceLock indicates the resource object type that will be used to lock during leader election cycles. |
resourceName [Required]string
|
resourceName indicates the name of resource object that will be used to lock during leader election cycles. |
resourceNamespace [Required]string
|
resourceName indicates the namespace of resource object that will be used to lock during leader election cycles. |
LoggingConfiguration
Appears in:
LoggingConfiguration contains logging options Refer Logs Options for more information.
Field | Description |
---|---|
format [Required]string
|
Format Flag specifies the structure of log messages.
default value of format is |
flushFrequency [Required]time.Duration
|
Maximum number of seconds between log flushes. Ignored if the selected logging backend writes log messages without buffering. |
verbosity [Required]uint32
|
Verbosity is the threshold that determines which log messages are logged. Default is zero which logs only the most important messages. Higher values enable additional messages. Error messages are always logged. |
vmodule [Required]VModuleConfiguration
|
VModule overrides the verbosity threshold for individual files. Only supported for "text" log format. |
sanitization [Required]bool
|
[Experimental] When enabled prevents logging of fields tagged as sensitive (passwords, keys, tokens). Runtime log sanitization may introduce significant computation overhead and therefore should not be enabled in production.`) |
options [Required]FormatOptions
|
[Experimental] Options holds additional parameters that are specific to the different logging formats. Only the options for the selected format get used, but all of them get validated. |
VModuleConfiguration
(Alias of []k8s.io/component-base/config/v1alpha1.VModuleItem
)
Appears in:
VModuleConfiguration is a collection of individual file names or patterns and the corresponding verbosity threshold.
DefaultPreemptionArgs
DefaultPreemptionArgs holds arguments used to configure the DefaultPreemption plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta2 |
kind string | DefaultPreemptionArgs |
minCandidateNodesPercentage [Required]int32
|
MinCandidateNodesPercentage is the minimum number of candidates to shortlist when dry running preemption as a percentage of number of nodes. Must be in the range [0, 100]. Defaults to 10% of the cluster size if unspecified. |
minCandidateNodesAbsolute [Required]int32
|
MinCandidateNodesAbsolute is the absolute minimum number of candidates to shortlist. The likely number of candidates enumerated for dry running preemption is given by the formula: numCandidates = max(numNodes * minCandidateNodesPercentage, minCandidateNodesAbsolute) We say "likely" because there are other factors such as PDB violations that play a role in the number of candidates shortlisted. Must be at least 0 nodes. Defaults to 100 nodes if unspecified. |
InterPodAffinityArgs
InterPodAffinityArgs holds arguments used to configure the InterPodAffinity plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta2 |
kind string | InterPodAffinityArgs |
hardPodAffinityWeight [Required]int32
|
HardPodAffinityWeight is the scoring weight for existing pods with a matching hard affinity to the incoming pod. |
KubeSchedulerConfiguration
KubeSchedulerConfiguration configures a scheduler
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta2 |
kind string | KubeSchedulerConfiguration |
parallelism [Required]int32
|
Parallelism defines the amount of parallelism in algorithms for scheduling a Pods. Must be greater than 0. Defaults to 16 |
leaderElection [Required]LeaderElectionConfiguration
|
LeaderElection defines the configuration of leader election client. |
clientConnection [Required]ClientConnectionConfiguration
|
ClientConnection specifies the kubeconfig file and client connection settings for the proxy server to use when communicating with the apiserver. |
healthzBindAddress [Required]string
|
Note: Both HealthzBindAddress and MetricsBindAddress fields are deprecated. Only empty address or port 0 is allowed. Anything else will fail validation. HealthzBindAddress is the IP address and port for the health check server to serve on. |
metricsBindAddress [Required]string
|
MetricsBindAddress is the IP address and port for the metrics server to serve on. |
DebuggingConfiguration [Required]DebuggingConfiguration
|
(Members of DebuggingConfiguration are embedded into this type.)
DebuggingConfiguration holds configuration for Debugging related features TODO: We might wanna make this a substruct like Debugging componentbaseconfigv1alpha1.DebuggingConfiguration |
percentageOfNodesToScore [Required]int32
|
PercentageOfNodesToScore is the percentage of all nodes that once found feasible for running a pod, the scheduler stops its search for more feasible nodes in the cluster. This helps improve scheduler's performance. Scheduler always tries to find at least "minFeasibleNodesToFind" feasible nodes no matter what the value of this flag is. Example: if the cluster size is 500 nodes and the value of this flag is 30, then scheduler stops finding further feasible nodes once it finds 150 feasible ones. When the value is 0, default percentage (5%--50% based on the size of the cluster) of the nodes will be scored. |
podInitialBackoffSeconds [Required]int64
|
PodInitialBackoffSeconds is the initial backoff for unschedulable pods. If specified, it must be greater than 0. If this value is null, the default value (1s) will be used. |
podMaxBackoffSeconds [Required]int64
|
PodMaxBackoffSeconds is the max backoff for unschedulable pods. If specified, it must be greater than podInitialBackoffSeconds. If this value is null, the default value (10s) will be used. |
profiles [Required][]KubeSchedulerProfile
|
Profiles are scheduling profiles that kube-scheduler supports. Pods can choose to be scheduled under a particular profile by setting its associated scheduler name. Pods that don't specify any scheduler name are scheduled with the "default-scheduler" profile, if present here. |
extenders [Required][]Extender
|
Extenders are the list of scheduler extenders, each holding the values of how to communicate with the extender. These extenders are shared by all scheduler profiles. |
NodeAffinityArgs
NodeAffinityArgs holds arguments to configure the NodeAffinity plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta2 |
kind string | NodeAffinityArgs |
addedAffinity core/v1.NodeAffinity
|
AddedAffinity is applied to all Pods additionally to the NodeAffinity specified in the PodSpec. That is, Nodes need to satisfy AddedAffinity AND .spec.NodeAffinity. AddedAffinity is empty by default (all Nodes match). When AddedAffinity is used, some Pods with affinity requirements that match a specific Node (such as Daemonset Pods) might remain unschedulable. |
NodeResourcesBalancedAllocationArgs
NodeResourcesBalancedAllocationArgs holds arguments used to configure NodeResourcesBalancedAllocation plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta2 |
kind string | NodeResourcesBalancedAllocationArgs |
resources [Required][]ResourceSpec
|
Resources to be managed, the default is "cpu" and "memory" if not specified. |
NodeResourcesFitArgs
NodeResourcesFitArgs holds arguments used to configure the NodeResourcesFit plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta2 |
kind string | NodeResourcesFitArgs |
ignoredResources [Required][]string
|
IgnoredResources is the list of resources that NodeResources fit filter should ignore. This doesn't apply to scoring. |
ignoredResourceGroups [Required][]string
|
IgnoredResourceGroups defines the list of resource groups that NodeResources fit filter should ignore. e.g. if group is ["example.com"], it will ignore all resource names that begin with "example.com", such as "example.com/aaa" and "example.com/bbb". A resource group name can't contain '/'. This doesn't apply to scoring. |
scoringStrategy [Required]ScoringStrategy
|
ScoringStrategy selects the node resource scoring strategy. The default strategy is LeastAllocated with an equal "cpu" and "memory" weight. |
PodTopologySpreadArgs
PodTopologySpreadArgs holds arguments used to configure the PodTopologySpread plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta2 |
kind string | PodTopologySpreadArgs |
defaultConstraints []core/v1.TopologySpreadConstraint
|
DefaultConstraints defines topology spread constraints to be applied to
Pods that don't define any in |
defaultingType PodTopologySpreadConstraintsDefaulting
|
DefaultingType determines how .defaultConstraints are deduced. Can be one of "System" or "List".
Defaults to "List" if feature gate DefaultPodTopologySpread is disabled and to "System" if enabled. |
VolumeBindingArgs
VolumeBindingArgs holds arguments used to configure the VolumeBinding plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta2 |
kind string | VolumeBindingArgs |
bindTimeoutSeconds [Required]int64
|
BindTimeoutSeconds is the timeout in seconds in volume binding operation. Value must be non-negative integer. The value zero indicates no waiting. If this value is nil, the default value (600) will be used. |
shape []UtilizationShapePoint
|
Shape specifies the points defining the score function shape, which is used to score nodes based on the utilization of statically provisioned PVs. The utilization is calculated by dividing the total requested storage of the pod by the total capacity of feasible PVs on each node. Each point contains utilization (ranges from 0 to 100) and its associated score (ranges from 0 to 10). You can turn the priority by specifying different scores for different utilization numbers. The default shape points are:
|
Extender
Appears in:
Extender holds the parameters used to communicate with the extender. If a verb is unspecified/empty, it is assumed that the extender chose not to provide that extension.
Field | Description |
---|---|
urlPrefix [Required]string
|
URLPrefix at which the extender is available |
filterVerb [Required]string
|
Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender. |
preemptVerb [Required]string
|
Verb for the preempt call, empty if not supported. This verb is appended to the URLPrefix when issuing the preempt call to extender. |
prioritizeVerb [Required]string
|
Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender. |
weight [Required]int64
|
The numeric multiplier for the node scores that the prioritize call generates. The weight should be a positive integer |
bindVerb [Required]string
|
Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender. If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. Only one extender can implement this function. |
enableHTTPS [Required]bool
|
EnableHTTPS specifies whether https should be used to communicate with the extender |
tlsConfig [Required]ExtenderTLSConfig
|
TLSConfig specifies the transport layer security config |
httpTimeout [Required]meta/v1.Duration
|
HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize timeout is ignored, k8s/other extenders priorities are used to select the node. |
nodeCacheCapable [Required]bool
|
NodeCacheCapable specifies that the extender is capable of caching node information, so the scheduler should only send minimal information about the eligible nodes assuming that the extender already cached full details of all nodes in the cluster |
managedResources []ExtenderManagedResource
|
ManagedResources is a list of extended resources that are managed by this extender.
|
ignorable [Required]bool
|
Ignorable specifies if the extender is ignorable, i.e. scheduling should not fail when the extender returns an error or is not reachable. |
ExtenderManagedResource
Appears in:
ExtenderManagedResource describes the arguments of extended resources managed by an extender.
Field | Description |
---|---|
name [Required]string
|
Name is the extended resource name. |
ignoredByScheduler [Required]bool
|
IgnoredByScheduler indicates whether kube-scheduler should ignore this resource when applying predicates. |
ExtenderTLSConfig
Appears in:
ExtenderTLSConfig contains settings to enable TLS with extender
Field | Description |
---|---|
insecure [Required]bool
|
Server should be accessed without verifying the TLS certificate. For testing only. |
serverName [Required]string
|
ServerName is passed to the server for SNI and is used in the client to check server certificates against. If ServerName is empty, the hostname used to contact the server is used. |
certFile [Required]string
|
Server requires TLS client certificate authentication |
keyFile [Required]string
|
Server requires TLS client certificate authentication |
caFile [Required]string
|
Trusted root certificates for server |
certData [Required][]byte
|
CertData holds PEM-encoded bytes (typically read from a client certificate file). CertData takes precedence over CertFile |
keyData [Required][]byte
|
KeyData holds PEM-encoded bytes (typically read from a client certificate key file). KeyData takes precedence over KeyFile |
caData [Required][]byte
|
CAData holds PEM-encoded bytes (typically read from a root certificates bundle). CAData takes precedence over CAFile |
KubeSchedulerProfile
Appears in:
KubeSchedulerProfile is a scheduling profile.
Field | Description |
---|---|
schedulerName [Required]string
|
SchedulerName is the name of the scheduler associated to this profile. If SchedulerName matches with the pod's "spec.schedulerName", then the pod is scheduled with this profile. |
plugins [Required]Plugins
|
Plugins specify the set of plugins that should be enabled or disabled. Enabled plugins are the ones that should be enabled in addition to the default plugins. Disabled plugins are any of the default plugins that should be disabled. When no enabled or disabled plugin is specified for an extension point, default plugins for that extension point will be used if there is any. If a QueueSort plugin is specified, the same QueueSort Plugin and PluginConfig must be specified for all profiles. |
pluginConfig [Required][]PluginConfig
|
PluginConfig is an optional set of custom plugin arguments for each plugin. Omitting config args for a plugin is equivalent to using the default config for that plugin. |
Plugin
Appears in:
Plugin specifies a plugin name and its weight when applicable. Weight is used only for Score plugins.
Field | Description |
---|---|
name [Required]string
|
Name defines the name of plugin |
weight [Required]int32
|
Weight defines the weight of plugin, only used for Score plugins. |
PluginConfig
Appears in:
PluginConfig specifies arguments that should be passed to a plugin at the time of initialization. A plugin that is invoked at multiple extension points is initialized once. Args can have arbitrary structure. It is up to the plugin to process these Args.
Field | Description |
---|---|
name [Required]string
|
Name defines the name of plugin being configured |
args [Required]k8s.io/apimachinery/pkg/runtime.RawExtension
|
Args defines the arguments passed to the plugins at the time of initialization. Args can have arbitrary structure. |
PluginSet
Appears in:
PluginSet specifies enabled and disabled plugins for an extension point. If an array is empty, missing, or nil, default plugins at that extension point will be used.
Field | Description |
---|---|
enabled [Required][]Plugin
|
Enabled specifies plugins that should be enabled in addition to default plugins. If the default plugin is also configured in the scheduler config file, the weight of plugin will be overridden accordingly. These are called after default plugins and in the same order specified here. |
disabled [Required][]Plugin
|
Disabled specifies default plugins that should be disabled. When all default plugins need to be disabled, an array containing only one "*" should be provided. |
Plugins
Appears in:
Plugins include multiple extension points. When specified, the list of plugins for a particular extension point are the only ones enabled. If an extension point is omitted from the config, then the default set of plugins is used for that extension point. Enabled plugins are called in the order specified here, after default plugins. If they need to be invoked before default plugins, default plugins must be disabled and re-enabled here in desired order.
Field | Description |
---|---|
queueSort [Required]PluginSet
|
QueueSort is a list of plugins that should be invoked when sorting pods in the scheduling queue. |
preFilter [Required]PluginSet
|
PreFilter is a list of plugins that should be invoked at "PreFilter" extension point of the scheduling framework. |
filter [Required]PluginSet
|
Filter is a list of plugins that should be invoked when filtering out nodes that cannot run the Pod. |
postFilter [Required]PluginSet
|
PostFilter is a list of plugins that are invoked after filtering phase, but only when no feasible nodes were found for the pod. |
preScore [Required]PluginSet
|
PreScore is a list of plugins that are invoked before scoring. |
score [Required]PluginSet
|
Score is a list of plugins that should be invoked when ranking nodes that have passed the filtering phase. |
reserve [Required]PluginSet
|
Reserve is a list of plugins invoked when reserving/unreserving resources after a node is assigned to run the pod. |
permit [Required]PluginSet
|
Permit is a list of plugins that control binding of a Pod. These plugins can prevent or delay binding of a Pod. |
preBind [Required]PluginSet
|
PreBind is a list of plugins that should be invoked before a pod is bound. |
bind [Required]PluginSet
|
Bind is a list of plugins that should be invoked at "Bind" extension point of the scheduling framework. The scheduler call these plugins in order. Scheduler skips the rest of these plugins as soon as one returns success. |
postBind [Required]PluginSet
|
PostBind is a list of plugins that should be invoked after a pod is successfully bound. |
multiPoint [Required]PluginSet
|
MultiPoint is a simplified config section to enable plugins for all valid extension points. |
PodTopologySpreadConstraintsDefaulting
(Alias of string
)
Appears in:
PodTopologySpreadConstraintsDefaulting defines how to set default constraints for the PodTopologySpread plugin.
RequestedToCapacityRatioParam
Appears in:
RequestedToCapacityRatioParam define RequestedToCapacityRatio parameters
Field | Description |
---|---|
shape [Required][]UtilizationShapePoint
|
Shape is a list of points defining the scoring function shape. |
ResourceSpec
Appears in:
ResourceSpec represents a single resource.
Field | Description |
---|---|
name [Required]string
|
Name of the resource. |
weight [Required]int64
|
Weight of the resource. |
ScoringStrategy
Appears in:
ScoringStrategy define ScoringStrategyType for node resource plugin
Field | Description |
---|---|
type [Required]ScoringStrategyType
|
Type selects which strategy to run. |
resources [Required][]ResourceSpec
|
Resources to consider when scoring. The default resource set includes "cpu" and "memory" with an equal weight. Allowed weights go from 1 to 100. Weight defaults to 1 if not specified or explicitly set to 0. |
requestedToCapacityRatio [Required]RequestedToCapacityRatioParam
|
Arguments specific to RequestedToCapacityRatio strategy. |
ScoringStrategyType
(Alias of string
)
Appears in:
ScoringStrategyType the type of scoring strategy used in NodeResourcesFit plugin.
UtilizationShapePoint
Appears in:
UtilizationShapePoint represents single point of priority function shape.
Field | Description |
---|---|
utilization [Required]int32
|
Utilization (x axis). Valid values are 0 to 100. Fully utilized node maps to 100. |
score [Required]int32
|
Score assigned to given utilization (y axis). Valid values are 0 to 10. |
6.12.9 - kube-scheduler Configuration (v1beta3)
Resource Types
- DefaultPreemptionArgs
- InterPodAffinityArgs
- KubeSchedulerConfiguration
- NodeAffinityArgs
- NodeResourcesBalancedAllocationArgs
- NodeResourcesFitArgs
- PodTopologySpreadArgs
- VolumeBindingArgs
DefaultPreemptionArgs
DefaultPreemptionArgs holds arguments used to configure the DefaultPreemption plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta3 |
kind string | DefaultPreemptionArgs |
minCandidateNodesPercentage [Required]int32
|
MinCandidateNodesPercentage is the minimum number of candidates to shortlist when dry running preemption as a percentage of number of nodes. Must be in the range [0, 100]. Defaults to 10% of the cluster size if unspecified. |
minCandidateNodesAbsolute [Required]int32
|
MinCandidateNodesAbsolute is the absolute minimum number of candidates to shortlist. The likely number of candidates enumerated for dry running preemption is given by the formula: numCandidates = max(numNodes * minCandidateNodesPercentage, minCandidateNodesAbsolute) We say "likely" because there are other factors such as PDB violations that play a role in the number of candidates shortlisted. Must be at least 0 nodes. Defaults to 100 nodes if unspecified. |
InterPodAffinityArgs
InterPodAffinityArgs holds arguments used to configure the InterPodAffinity plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta3 |
kind string | InterPodAffinityArgs |
hardPodAffinityWeight [Required]int32
|
HardPodAffinityWeight is the scoring weight for existing pods with a matching hard affinity to the incoming pod. |
KubeSchedulerConfiguration
KubeSchedulerConfiguration configures a scheduler
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta3 |
kind string | KubeSchedulerConfiguration |
parallelism [Required]int32
|
Parallelism defines the amount of parallelism in algorithms for scheduling a Pods. Must be greater than 0. Defaults to 16 |
leaderElection [Required]LeaderElectionConfiguration
|
LeaderElection defines the configuration of leader election client. |
clientConnection [Required]ClientConnectionConfiguration
|
ClientConnection specifies the kubeconfig file and client connection settings for the proxy server to use when communicating with the apiserver. |
DebuggingConfiguration [Required]DebuggingConfiguration
|
(Members of DebuggingConfiguration are embedded into this type.)
DebuggingConfiguration holds configuration for Debugging related features TODO: We might wanna make this a substruct like Debugging componentbaseconfigv1alpha1.DebuggingConfiguration |
percentageOfNodesToScore [Required]int32
|
PercentageOfNodesToScore is the percentage of all nodes that once found feasible for running a pod, the scheduler stops its search for more feasible nodes in the cluster. This helps improve scheduler's performance. Scheduler always tries to find at least "minFeasibleNodesToFind" feasible nodes no matter what the value of this flag is. Example: if the cluster size is 500 nodes and the value of this flag is 30, then scheduler stops finding further feasible nodes once it finds 150 feasible ones. When the value is 0, default percentage (5%--50% based on the size of the cluster) of the nodes will be scored. |
podInitialBackoffSeconds [Required]int64
|
PodInitialBackoffSeconds is the initial backoff for unschedulable pods. If specified, it must be greater than 0. If this value is null, the default value (1s) will be used. |
podMaxBackoffSeconds [Required]int64
|
PodMaxBackoffSeconds is the max backoff for unschedulable pods. If specified, it must be greater than podInitialBackoffSeconds. If this value is null, the default value (10s) will be used. |
profiles [Required][]KubeSchedulerProfile
|
Profiles are scheduling profiles that kube-scheduler supports. Pods can choose to be scheduled under a particular profile by setting its associated scheduler name. Pods that don't specify any scheduler name are scheduled with the "default-scheduler" profile, if present here. |
extenders [Required][]Extender
|
Extenders are the list of scheduler extenders, each holding the values of how to communicate with the extender. These extenders are shared by all scheduler profiles. |
NodeAffinityArgs
NodeAffinityArgs holds arguments to configure the NodeAffinity plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta3 |
kind string | NodeAffinityArgs |
addedAffinity core/v1.NodeAffinity
|
AddedAffinity is applied to all Pods additionally to the NodeAffinity specified in the PodSpec. That is, Nodes need to satisfy AddedAffinity AND .spec.NodeAffinity. AddedAffinity is empty by default (all Nodes match). When AddedAffinity is used, some Pods with affinity requirements that match a specific Node (such as Daemonset Pods) might remain unschedulable. |
NodeResourcesBalancedAllocationArgs
NodeResourcesBalancedAllocationArgs holds arguments used to configure NodeResourcesBalancedAllocation plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta3 |
kind string | NodeResourcesBalancedAllocationArgs |
resources [Required][]ResourceSpec
|
Resources to be managed, the default is "cpu" and "memory" if not specified. |
NodeResourcesFitArgs
NodeResourcesFitArgs holds arguments used to configure the NodeResourcesFit plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta3 |
kind string | NodeResourcesFitArgs |
ignoredResources [Required][]string
|
IgnoredResources is the list of resources that NodeResources fit filter should ignore. This doesn't apply to scoring. |
ignoredResourceGroups [Required][]string
|
IgnoredResourceGroups defines the list of resource groups that NodeResources fit filter should ignore. e.g. if group is ["example.com"], it will ignore all resource names that begin with "example.com", such as "example.com/aaa" and "example.com/bbb". A resource group name can't contain '/'. This doesn't apply to scoring. |
scoringStrategy [Required]ScoringStrategy
|
ScoringStrategy selects the node resource scoring strategy. The default strategy is LeastAllocated with an equal "cpu" and "memory" weight. |
PodTopologySpreadArgs
PodTopologySpreadArgs holds arguments used to configure the PodTopologySpread plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta3 |
kind string | PodTopologySpreadArgs |
defaultConstraints []core/v1.TopologySpreadConstraint
|
DefaultConstraints defines topology spread constraints to be applied to
Pods that don't define any in |
defaultingType PodTopologySpreadConstraintsDefaulting
|
DefaultingType determines how .defaultConstraints are deduced. Can be one of "System" or "List".
Defaults to "List" if feature gate DefaultPodTopologySpread is disabled and to "System" if enabled. |
VolumeBindingArgs
VolumeBindingArgs holds arguments used to configure the VolumeBinding plugin.
Field | Description |
---|---|
apiVersion string | kubescheduler.config.k8s.io/v1beta3 |
kind string | VolumeBindingArgs |
bindTimeoutSeconds [Required]int64
|
BindTimeoutSeconds is the timeout in seconds in volume binding operation. Value must be non-negative integer. The value zero indicates no waiting. If this value is nil, the default value (600) will be used. |
shape []UtilizationShapePoint
|
Shape specifies the points defining the score function shape, which is used to score nodes based on the utilization of statically provisioned PVs. The utilization is calculated by dividing the total requested storage of the pod by the total capacity of feasible PVs on each node. Each point contains utilization (ranges from 0 to 100) and its associated score (ranges from 0 to 10). You can turn the priority by specifying different scores for different utilization numbers. The default shape points are:
|
Extender
Appears in:
Extender holds the parameters used to communicate with the extender. If a verb is unspecified/empty, it is assumed that the extender chose not to provide that extension.
Field | Description |
---|---|
urlPrefix [Required]string
|
URLPrefix at which the extender is available |
filterVerb [Required]string
|
Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender. |
preemptVerb [Required]string
|
Verb for the preempt call, empty if not supported. This verb is appended to the URLPrefix when issuing the preempt call to extender. |
prioritizeVerb [Required]string
|
Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender. |
weight [Required]int64
|
The numeric multiplier for the node scores that the prioritize call generates. The weight should be a positive integer |
bindVerb [Required]string
|
Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender. If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. Only one extender can implement this function. |
enableHTTPS [Required]bool
|
EnableHTTPS specifies whether https should be used to communicate with the extender |
tlsConfig [Required]ExtenderTLSConfig
|
TLSConfig specifies the transport layer security config |
httpTimeout [Required]meta/v1.Duration
|
HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize timeout is ignored, k8s/other extenders priorities are used to select the node. |
nodeCacheCapable [Required]bool
|
NodeCacheCapable specifies that the extender is capable of caching node information, so the scheduler should only send minimal information about the eligible nodes assuming that the extender already cached full details of all nodes in the cluster |
managedResources []ExtenderManagedResource
|
ManagedResources is a list of extended resources that are managed by this extender.
|
ignorable [Required]bool
|
Ignorable specifies if the extender is ignorable, i.e. scheduling should not fail when the extender returns an error or is not reachable. |
ExtenderManagedResource
Appears in:
ExtenderManagedResource describes the arguments of extended resources managed by an extender.
Field | Description |
---|---|
name [Required]string
|
Name is the extended resource name. |
ignoredByScheduler [Required]bool
|
IgnoredByScheduler indicates whether kube-scheduler should ignore this resource when applying predicates. |
ExtenderTLSConfig
Appears in:
ExtenderTLSConfig contains settings to enable TLS with extender
Field | Description |
---|---|
insecure [Required]bool
|
Server should be accessed without verifying the TLS certificate. For testing only. |
serverName [Required]string
|
ServerName is passed to the server for SNI and is used in the client to check server certificates against. If ServerName is empty, the hostname used to contact the server is used. |
certFile [Required]string
|
Server requires TLS client certificate authentication |
keyFile [Required]string
|
Server requires TLS client certificate authentication |
caFile [Required]string
|
Trusted root certificates for server |
certData [Required][]byte
|
CertData holds PEM-encoded bytes (typically read from a client certificate file). CertData takes precedence over CertFile |
keyData [Required][]byte
|
KeyData holds PEM-encoded bytes (typically read from a client certificate key file). KeyData takes precedence over KeyFile |
caData [Required][]byte
|
CAData holds PEM-encoded bytes (typically read from a root certificates bundle). CAData takes precedence over CAFile |
KubeSchedulerProfile
Appears in:
KubeSchedulerProfile is a scheduling profile.
Field | Description |
---|---|
schedulerName [Required]string
|
SchedulerName is the name of the scheduler associated to this profile. If SchedulerName matches with the pod's "spec.schedulerName", then the pod is scheduled with this profile. |
plugins [Required]Plugins
|
Plugins specify the set of plugins that should be enabled or disabled. Enabled plugins are the ones that should be enabled in addition to the default plugins. Disabled plugins are any of the default plugins that should be disabled. When no enabled or disabled plugin is specified for an extension point, default plugins for that extension point will be used if there is any. If a QueueSort plugin is specified, the same QueueSort Plugin and PluginConfig must be specified for all profiles. |
pluginConfig [Required][]PluginConfig
|
PluginConfig is an optional set of custom plugin arguments for each plugin. Omitting config args for a plugin is equivalent to using the default config for that plugin. |
Plugin
Appears in:
Plugin specifies a plugin name and its weight when applicable. Weight is used only for Score plugins.
Field | Description |
---|---|
name [Required]string
|
Name defines the name of plugin |
weight [Required]int32
|
Weight defines the weight of plugin, only used for Score plugins. |
PluginConfig
Appears in:
PluginConfig specifies arguments that should be passed to a plugin at the time of initialization. A plugin that is invoked at multiple extension points is initialized once. Args can have arbitrary structure. It is up to the plugin to process these Args.
Field | Description |
---|---|
name [Required]string
|
Name defines the name of plugin being configured |
args [Required]k8s.io/apimachinery/pkg/runtime.RawExtension
|
Args defines the arguments passed to the plugins at the time of initialization. Args can have arbitrary structure. |
PluginSet
Appears in:
PluginSet specifies enabled and disabled plugins for an extension point. If an array is empty, missing, or nil, default plugins at that extension point will be used.
Field | Description |
---|---|
enabled [Required][]Plugin
|
Enabled specifies plugins that should be enabled in addition to default plugins. If the default plugin is also configured in the scheduler config file, the weight of plugin will be overridden accordingly. These are called after default plugins and in the same order specified here. |
disabled [Required][]Plugin
|
Disabled specifies default plugins that should be disabled. When all default plugins need to be disabled, an array containing only one "*" should be provided. |
Plugins
Appears in:
Plugins include multiple extension points. When specified, the list of plugins for a particular extension point are the only ones enabled. If an extension point is omitted from the config, then the default set of plugins is used for that extension point. Enabled plugins are called in the order specified here, after default plugins. If they need to be invoked before default plugins, default plugins must be disabled and re-enabled here in desired order.
Field | Description |
---|---|
queueSort [Required]PluginSet
|
QueueSort is a list of plugins that should be invoked when sorting pods in the scheduling queue. |
preFilter [Required]PluginSet
|
PreFilter is a list of plugins that should be invoked at "PreFilter" extension point of the scheduling framework. |
filter [Required]PluginSet
|
Filter is a list of plugins that should be invoked when filtering out nodes that cannot run the Pod. |
postFilter [Required]PluginSet
|
PostFilter is a list of plugins that are invoked after filtering phase, but only when no feasible nodes were found for the pod. |
preScore [Required]PluginSet
|
PreScore is a list of plugins that are invoked before scoring. |
score [Required]PluginSet
|
Score is a list of plugins that should be invoked when ranking nodes that have passed the filtering phase. |
reserve [Required]PluginSet
|
Reserve is a list of plugins invoked when reserving/unreserving resources after a node is assigned to run the pod. |
permit [Required]PluginSet
|
Permit is a list of plugins that control binding of a Pod. These plugins can prevent or delay binding of a Pod. |
preBind [Required]PluginSet
|
PreBind is a list of plugins that should be invoked before a pod is bound. |
bind [Required]PluginSet
|
Bind is a list of plugins that should be invoked at "Bind" extension point of the scheduling framework. The scheduler call these plugins in order. Scheduler skips the rest of these plugins as soon as one returns success. |
postBind [Required]PluginSet
|
PostBind is a list of plugins that should be invoked after a pod is successfully bound. |
multiPoint [Required]PluginSet
|
MultiPoint is a simplified config section to enable plugins for all valid extension points. Plugins enabled through MultiPoint will automatically register for every individual extension point the plugin has implemented. Disabling a plugin through MultiPoint disables that behavior. The same is true for disabling "*" through MultiPoint (no default plugins will be automatically registered). Plugins can still be disabled through their individual extension points. In terms of precedence, plugin config follows this basic hierarchy
|
PodTopologySpreadConstraintsDefaulting
(Alias of string
)
Appears in:
PodTopologySpreadConstraintsDefaulting defines how to set default constraints for the PodTopologySpread plugin.
RequestedToCapacityRatioParam
Appears in:
RequestedToCapacityRatioParam define RequestedToCapacityRatio parameters
Field | Description |
---|---|
shape [Required][]UtilizationShapePoint
|
Shape is a list of points defining the scoring function shape. |
ResourceSpec
Appears in:
ResourceSpec represents a single resource.
Field | Description |
---|---|
name [Required]string
|
Name of the resource. |
weight [Required]int64
|
Weight of the resource. |
ScoringStrategy
Appears in:
ScoringStrategy define ScoringStrategyType for node resource plugin
Field | Description |
---|---|
type [Required]ScoringStrategyType
|
Type selects which strategy to run. |
resources [Required][]ResourceSpec
|
Resources to consider when scoring. The default resource set includes "cpu" and "memory" with an equal weight. Allowed weights go from 1 to 100. Weight defaults to 1 if not specified or explicitly set to 0. |
requestedToCapacityRatio [Required]RequestedToCapacityRatioParam
|
Arguments specific to RequestedToCapacityRatio strategy. |
ScoringStrategyType
(Alias of string
)
Appears in:
ScoringStrategyType the type of scoring strategy used in NodeResourcesFit plugin.
UtilizationShapePoint
Appears in:
UtilizationShapePoint represents single point of priority function shape.
Field | Description |
---|---|
utilization [Required]int32
|
Utilization (x axis). Valid values are 0 to 100. Fully utilized node maps to 100. |
score [Required]int32
|
Score assigned to given utilization (y axis). Valid values are 0 to 10. |
ClientConnectionConfiguration
Appears in:
ClientConnectionConfiguration contains details for constructing a client.
Field | Description |
---|---|
kubeconfig [Required]string
|
kubeconfig is the path to a KubeConfig file. |
acceptContentTypes [Required]string
|
acceptContentTypes defines the Accept header sent by clients when connecting to a server, overriding the default value of 'application/json'. This field will control all connections to the server used by a particular client. |
contentType [Required]string
|
contentType is the content type used when sending data to the server from this client. |
qps [Required]float32
|
qps controls the number of queries per second allowed for this connection. |
burst [Required]int32
|
burst allows extra queries to accumulate when a client is exceeding its rate. |
DebuggingConfiguration
Appears in:
DebuggingConfiguration holds configuration for Debugging related features.
Field | Description |
---|---|
enableProfiling [Required]bool
|
enableProfiling enables profiling via web interface host:port/debug/pprof/ |
enableContentionProfiling [Required]bool
|
enableContentionProfiling enables lock contention profiling, if enableProfiling is true. |
FormatOptions
Appears in:
FormatOptions contains options for the different logging formats.
Field | Description |
---|---|
json [Required]JSONOptions
|
[Experimental] JSON contains options for logging format "json". |
JSONOptions
Appears in:
JSONOptions contains options for logging format "json".
Field | Description |
---|---|
splitStream [Required]bool
|
[Experimental] SplitStream redirects error messages to stderr while info messages go to stdout, with buffering. The default is to write both to stdout, without buffering. |
infoBufferSize [Required]k8s.io/apimachinery/pkg/api/resource.QuantityValue
|
[Experimental] InfoBufferSize sets the size of the info stream when using split streams. The default is zero, which disables buffering. |
LeaderElectionConfiguration
Appears in:
LeaderElectionConfiguration defines the configuration of leader election clients for components that can run with leader election enabled.
Field | Description |
---|---|
leaderElect [Required]bool
|
leaderElect enables a leader election client to gain leadership before executing the main loop. Enable this when running replicated components for high availability. |
leaseDuration [Required]meta/v1.Duration
|
leaseDuration is the duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. |
renewDeadline [Required]meta/v1.Duration
|
renewDeadline is the interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled. |
retryPeriod [Required]meta/v1.Duration
|
retryPeriod is the duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. |
resourceLock [Required]string
|
resourceLock indicates the resource object type that will be used to lock during leader election cycles. |
resourceName [Required]string
|
resourceName indicates the name of resource object that will be used to lock during leader election cycles. |
resourceNamespace [Required]string
|
resourceName indicates the namespace of resource object that will be used to lock during leader election cycles. |
LoggingConfiguration
Appears in:
LoggingConfiguration contains logging options Refer Logs Options for more information.
Field | Description |
---|---|
format [Required]string
|
Format Flag specifies the structure of log messages.
default value of format is |
flushFrequency [Required]time.Duration
|
Maximum number of seconds between log flushes. Ignored if the selected logging backend writes log messages without buffering. |
verbosity [Required]uint32
|
Verbosity is the threshold that determines which log messages are logged. Default is zero which logs only the most important messages. Higher values enable additional messages. Error messages are always logged. |
vmodule [Required]VModuleConfiguration
|
VModule overrides the verbosity threshold for individual files. Only supported for "text" log format. |
sanitization [Required]bool
|
[Experimental] When enabled prevents logging of fields tagged as sensitive (passwords, keys, tokens). Runtime log sanitization may introduce significant computation overhead and therefore should not be enabled in production.`) |
options [Required]FormatOptions
|
[Experimental] Options holds additional parameters that are specific to the different logging formats. Only the options for the selected format get used, but all of them get validated. |
VModuleConfiguration
(Alias of []k8s.io/component-base/config/v1alpha1.VModuleItem
)
Appears in:
VModuleConfiguration is a collection of individual file names or patterns and the corresponding verbosity threshold.
6.12.10 - kubeadm Configuration (v1beta2)
Overview
Package v1beta2 defines the v1beta2 version of the kubeadm configuration file format. This version improves on the v1beta1 format by fixing some minor issues and adding a few new fields.
A list of changes since v1beta1:
- "certificateKey" field is added to InitConfiguration and JoinConfiguration.
- "ignorePreflightErrors" field is added to the NodeRegistrationOptions.
- The JSON "omitempty" tag is used in a more places where appropriate.
- The JSON "omitempty" tag of the "taints" field (inside NodeRegistrationOptions) is removed.
See the Kubernetes 1.15 changelog for further details.
Migration from old kubeadm config versions
Please convert your v1beta1 configuration files to v1beta2 using the "kubeadm config migrate" command of kubeadm v1.15.x (conversion from older releases of kubeadm config files requires older release of kubeadm as well e.g.
- kubeadm v1.11 should be used to migrate v1alpha1 to v1alpha2; kubeadm v1.12 should be used to translate v1alpha2 to v1alpha3;
- kubeadm v1.13 or v1.14 should be used to translate v1alpha3 to v1beta1)
Nevertheless, kubeadm v1.15.x will support reading from v1beta1 version of the kubeadm config file format.
Basics
The preferred way to configure kubeadm is to pass an YAML configuration file with the --config
option. Some of the
configuration options defined in the kubeadm config file are also available as command line flags, but only
the most common/simple use case are supported with this approach.
A kubeadm config file could contain multiple configuration types separated using three dashes (---
).
kubeadm supports the following configuration types:
apiVersion: kubeadm.k8s.io/v1beta2 kind: InitConfiguration apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration apiVersion: kubeproxy.config.k8s.io/v1alpha1 kind: KubeProxyConfiguration apiVersion: kubeadm.k8s.io/v1beta2 kind: JoinConfiguration
To print the defaults for "init" and "join" actions use the following commands:
kubeadm config print init-defaults kubeadm config print join-defaults
The list of configuration types that must be included in a configuration file depends by the action you are
performing (init
or join
) and by the configuration options you are going to use (defaults or advanced customization).
If some configuration types are not provided, or provided only partially, kubeadm will use default values; defaults
provided by kubeadm includes also enforcing consistency of values across components when required (e.g.
--cluster-cidr
flag on controller manager and clusterCIDR
on kube-proxy).
Users are always allowed to override default values, with the only exception of a small subset of setting with relevance for security (e.g. enforce authorization-mode Node and RBAC on API server)
If the user provides a configuration types that is not expected for the action you are performing, kubeadm will ignore those types and print a warning.
Kubeadm init configuration types
When executing kubeadm init with the --config
option, the following configuration types could be used:
InitConfiguration, ClusterConfiguration, KubeProxyConfiguration, KubeletConfiguration, but only one
between InitConfiguration and ClusterConfiguration is mandatory.
apiVersion: kubeadm.k8s.io/v1beta2 kind: InitConfiguration bootstrapTokens: ... nodeRegistration: ...
The InitConfiguration type should be used to configure runtime settings, that in case of kubeadm init
are the configuration of the bootstrap token and all the setting which are specific to the node where kubeadm
is executed, including:
-
nodeRegistration
, that holds fields that relate to registering the new node to the cluster; use it to customize the node name, the CRI socket to use or any other settings that should apply to this node only (e.g. the node ip). -
apiServer
, that represents the endpoint of the instance of the API server to be deployed on this node; use it e.g. to customize the API server advertise address.
apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration networking: ... etcd: ... apiServer: extraArgs: ... extraVolumes: ... ...
The ClusterConfiguration type should be used to configure cluster-wide settings, including settings for:
-
Networking, that holds configuration for the networking topology of the cluster; use it e.g. to customize pod subnet or services subnet.
-
Etcd configurations; use it e.g. to customize the local etcd or to configure the API server for using an external etcd cluster.
-
kube-apiserver, kube-scheduler, kube-controller-manager configurations; use it to customize control-plane components by adding customized setting or overriding kubeadm default settings.
apiVersion: kubeproxy.config.k8s.io/v1alpha1 kind: KubeProxyConfiguration ...
The KubeProxyConfiguration type should be used to change the configuration passed to kube-proxy instances deployed in the cluster. If this object is not provided or provided only partially, kubeadm applies defaults.
See https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ or https://godoc.org/k8s.io/kube-proxy/config/v1alpha1#KubeProxyConfiguration for kube proxy official documentation.
apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration ...
The KubeletConfiguration type should be used to change the configurations that will be passed to all kubelet instances deployed in the cluster. If this object is not provided or provided only partially, kubeadm applies defaults.
See https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ or https://godoc.org/k8s.io/kubelet/config/v1beta1#KubeletConfiguration for kubelet official documentation.
Here is a fully populated example of a single YAML file containing multiple
configuration types to be used during a kubeadm init
run.
apiVersion: kubeadm.k8s.io/v1beta2 kind: InitConfiguration bootstrapTokens: - token: "9a08jv.c0izixklcxtmnze7" description: "kubeadm bootstrap token" ttl: "24h" - token: "783bde.3f89s0fje9f38fhf" description: "another bootstrap token" usages: - authentication - signing groups: - system:bootstrappers:kubeadm:default-node-token nodeRegistration: name: "ec2-10-100-0-1" criSocket: "/var/run/dockershim.sock" taints: - key: "kubeadmNode" value: "master" effect: "NoSchedule" kubeletExtraArgs: v: 4 ignorePreflightErrors: - IsPrivilegedUser localAPIEndpoint: advertiseAddress: "10.100.0.1" bindPort: 6443 certificateKey: "e6a2eb8581237ab72a4f494f30285ec12a9694d750b9785706a83bfcbbbd2204" --- apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration etcd: # one of local or external local: imageRepository: "k8s.gcr.io" imageTag: "3.2.24" dataDir: "/var/lib/etcd" extraArgs: listen-client-urls: "http://10.100.0.1:2379" serverCertSANs: - "ec2-10-100-0-1.compute-1.amazonaws.com" peerCertSANs: - "10.100.0.1" # external: # endpoints: # - "10.100.0.1:2379" # - "10.100.0.2:2379" # caFile: "/etcd/kubernetes/pki/etcd/etcd-ca.crt" # certFile: "/etcd/kubernetes/pki/etcd/etcd.crt" # keyFile: "/etcd/kubernetes/pki/etcd/etcd.key" networking: serviceSubnet: "10.96.0.0/16" podSubnet: "10.244.0.0/24" dnsDomain: "cluster.local" kubernetesVersion: "v1.12.0" controlPlaneEndpoint: "10.100.0.1:6443" apiServer: extraArgs: authorization-mode: "Node,RBAC" extraVolumes: - name: "some-volume" hostPath: "/etc/some-path" mountPath: "/etc/some-pod-path" readOnly: false pathType: File certSANs: - "10.100.1.1" - "ec2-10-100-0-1.compute-1.amazonaws.com" timeoutForControlPlane: 4m0s controllerManager: extraArgs: "node-cidr-mask-size": "20" extraVolumes: - name: "some-volume" hostPath: "/etc/some-path" mountPath: "/etc/some-pod-path" readOnly: false pathType: File scheduler: extraArgs: address: "10.100.0.1" extraVolumes: - name: "some-volume" hostPath: "/etc/some-path" mountPath: "/etc/some-pod-path" readOnly: false pathType: File certificatesDir: "/etc/kubernetes/pki" imageRepository: "k8s.gcr.io" useHyperKubeImage: false clusterName: "example-cluster" --- apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration # kubelet specific options here --- apiVersion: kubeproxy.config.k8s.io/v1alpha1 kind: KubeProxyConfiguration # kube-proxy specific options here
Kubeadm join configuration types
When executing kubeadm join with the --config
option, the JoinConfiguration type should be provided.
apiVersion: kubeadm.k8s.io/v1beta2 kind: JoinConfiguration ...
The JoinConfiguration type should be used to configure runtime settings, that in case of kubeadm join
are the discovery method used for accessing the cluster info and all the setting which are specific
to the node where kubeadm is executed, including:
-
NodeRegistration
, that holds fields that relate to registering the new node to the cluster; use it to customize the node name, the CRI socket to use or any other settings that should apply to this node only (e.g. the node IP). -
APIEndpoint
, that represents the endpoint of the instance of the API server to be eventually deployed on this node.
Resource Types
ClusterConfiguration
ClusterConfiguration contains cluster-wide configuration for a kubeadm cluster
Field | Description |
---|---|
apiVersion string | kubeadm.k8s.io/v1beta2 |
kind string | ClusterConfiguration |
etcd [Required]Etcd
|
|
networking [Required]Networking
|
|
kubernetesVersion [Required]string
|
|
controlPlaneEndpoint [Required]string
|
|
apiServer [Required]APIServer
|
|
controllerManager [Required]ControlPlaneComponent
|
|
scheduler [Required]ControlPlaneComponent
|
|
dns [Required]DNS
|
|
certificatesDir [Required]string
|
|
imageRepository [Required]string
|
|
useHyperKubeImage [Required]bool
|
|
featureGates [Required]map[string]bool
|
|
clusterName [Required]string
|
The cluster name. |
ClusterStatus
ClusterStatus contains the cluster status. The ClusterStatus will be stored in the kubeadm-config ConfigMap in the cluster, and then updated by kubeadm when additional control plane instance joins or leaves the cluster.
Field | Description |
---|---|
apiVersion string | kubeadm.k8s.io/v1beta2 |
kind string | ClusterStatus |
apiEndpoints [Required]map[string]github.com/tengqm/kubeconfig/config/kubeadm/v1beta2.APIEndpoint
|
|
InitConfiguration
InitConfiguration contains a list of elements that is specific "kubeadm init"-only runtime information.
Field | Description |
---|---|
apiVersion string | kubeadm.k8s.io/v1beta2 |
kind string | InitConfiguration |
bootstrapTokens [Required][]BootstrapToken
|
|
nodeRegistration [Required]NodeRegistrationOptions
|
|
localAPIEndpoint [Required]APIEndpoint
|
|
certificateKey [Required]string
|
|
JoinConfiguration
JoinConfiguration contains elements describing a particular node.
Field | Description |
---|---|
apiVersion string | kubeadm.k8s.io/v1beta2 |
kind string | JoinConfiguration |
nodeRegistration [Required]NodeRegistrationOptions
|
|
caCertPath [Required]string
|
|
discovery [Required]Discovery
|
|
controlPlane [Required]JoinControlPlane
|
|
APIEndpoint
Appears in:
APIEndpoint struct contains elements of API server instance deployed on a node.
Field | Description |
---|---|
advertiseAddress [Required]string
|
|
bindPort [Required]int32
|
|
APIServer
Appears in:
APIServer holds settings necessary for API server deployments in the cluster.
Field | Description |
---|---|
ControlPlaneComponent [Required]ControlPlaneComponent
|
(Members of ControlPlaneComponent are embedded into this type.)
No description provided. |
certSANs [Required][]string
|
|
timeoutForControlPlane [Required]meta/v1.Duration
|
|
BootstrapToken
Appears in:
BootstrapToken describes one bootstrap token, stored as a Secret in the cluster
Field | Description |
---|---|
token [Required]BootstrapTokenString
|
|
description [Required]string
|
|
ttl [Required]meta/v1.Duration
|
|
expires [Required]meta/v1.Time
|
|
usages [Required][]string
|
|
groups [Required][]string
|
|
BootstrapTokenDiscovery
Appears in:
BootstrapTokenDiscovery is used to set the options for bootstrap token based discovery
Field | Description |
---|---|
token [Required]string
|
|
apiServerEndpoint [Required]string
|
|
caCertHashes [Required][]string
|
|
unsafeSkipCAVerification [Required]bool
|
|
BootstrapTokenString
Appears in:
BootstrapTokenString is a token of the format abcdef.abcdef0123456789 that is used for both validation of the practically of the API server from a joining node's point of view and as an authentication method for the node in the bootstrap phase of "kubeadm join". This token is and should be short-lived
Field | Description |
---|---|
- [Required]string
|
No description provided. |
- [Required]string
|
No description provided. |
ControlPlaneComponent
Appears in:
ControlPlaneComponent holds settings common to control plane component of the cluster
Field | Description |
---|---|
extraArgs [Required]map[string]string
|
|
extraVolumes [Required][]HostPathMount
|
|
DNS
Appears in:
DNS defines the DNS addon that should be used in the cluster
Field | Description |
---|---|
type [Required]DNSAddOnType
|
|
ImageMeta [Required]ImageMeta
|
(Members of ImageMeta are embedded into this type.)
ImageMeta allows to customize the image used for the DNS component |
DNSAddOnType
(Alias of string
)
Appears in:
DNSAddOnType defines string identifying DNS add-on types.
Discovery
Appears in:
Discovery specifies the options for the kubelet to use during the TLS Bootstrap process
Field | Description |
---|---|
bootstrapToken [Required]BootstrapTokenDiscovery
|
|
file [Required]FileDiscovery
|
|
tlsBootstrapToken [Required]string
|
|
timeout [Required]meta/v1.Duration
|
|
Etcd
Appears in:
Etcd contains elements describing Etcd configuration.
Field | Description |
---|---|
local [Required]LocalEtcd
|
|
external [Required]ExternalEtcd
|
|
ExternalEtcd
Appears in:
ExternalEtcd describes an external etcd cluster. Kubeadm has no knowledge of where certificate files live and they must be supplied.
Field | Description |
---|---|
endpoints [Required][]string
|
|
caFile [Required]string
|
|
certFile [Required]string
|
|
keyFile [Required]string
|
|
FileDiscovery
Appears in:
FileDiscovery is used to specify a file or URL to a kubeconfig file from which to load cluster information
Field | Description |
---|---|
kubeConfigPath [Required]string
|
|
HostPathMount
Appears in:
HostPathMount contains elements describing volumes that are mounted from the host.
Field | Description |
---|---|
name [Required]string
|
|
hostPath [Required]string
|
|
mountPath [Required]string
|
|
readOnly [Required]bool
|
|
pathType [Required]core/v1.HostPathType
|
|
ImageMeta
Appears in:
ImageMeta allows to customize the image used for components that are not originated from the Kubernetes/Kubernetes release process
Field | Description |
---|---|
imageRepository [Required]string
|
|
imageTag [Required]string
|
|
JoinControlPlane
Appears in:
JoinControlPlane contains elements describing an additional control plane instance to be deployed on the joining node.
Field | Description |
---|---|
localAPIEndpoint [Required]APIEndpoint
|
|
certificateKey [Required]string
|
|
LocalEtcd
Appears in:
LocalEtcd describes that kubeadm should run an etcd cluster locally.
Field | Description |
---|---|
ImageMeta [Required]ImageMeta
|
(Members of ImageMeta are embedded into this type.)
ImageMeta allows to customize the container used for etcd. |
dataDir [Required]string
|
|
extraArgs [Required]map[string]string
|
|
serverCertSANs [Required][]string
|
|
peerCertSANs [Required][]string
|
|
Networking
Appears in:
Networking contains elements describing cluster's networking configuration
Field | Description |
---|---|
serviceSubnet [Required]string
|
|
podSubnet [Required]string
|
|
dnsDomain [Required]string
|
|
NodeRegistrationOptions
Appears in:
NodeRegistrationOptions holds fields that relate to registering a new control-plane or node to the cluster, either via "kubeadm init" or "kubeadm join".
Field | Description |
---|---|
name [Required]string
|
|
criSocket [Required]string
|
`criSocket is used to retrieve container runtime information. This information will be annotated to the Node API object, for later re-use. |
taints [Required][]core/v1.Taint
|
|
kubeletExtraArgs [Required]map[string]string
|
|
ignorePreflightErrors [Required][]string
|
|
6.12.11 - kubeadm Configuration (v1beta3)
Overview
Package v1beta3 defines the v1beta3 version of the kubeadm configuration file format. This version improves on the v1beta2 format by fixing some minor issues and adding a few new fields.
A list of changes since v1beta2:
- The deprecated "ClusterConfiguration.useHyperKubeImage" field has been removed. Kubeadm no longer supports the hyperkube image.
- The "ClusterConfiguration.DNS.Type" field has been removed since CoreDNS is the only supported DNS server type by kubeadm.
- Include "datapolicy" tags on the fields that hold secrets. This would result in the field values to be omitted when API structures are printed with klog.
- Add "InitConfiguration.SkipPhases", "JoinConfiguration.SkipPhases" to allow skipping a list of phases during kubeadm init/join command execution.
- Add "InitConfiguration.NodeRegistration.ImagePullPolicy" and "JoinConfiguration.NodeRegistration.ImagePullPolicy" to allow specifying the images pull policy during kubeadm "init" and "join". The value must be one of "Always", "Never" or "IfNotPresent". "IfNotPresent" is the default, which has been the existing behavior prior to this addition.
- Add "InitConfiguration.Patches.Directory", "JoinConfiguration.Patches.Directory" to allow the user to configure a directory from which to take patches for components deployed by kubeadm.
- Move the BootstrapToken* API and related utilities out of the "kubeadm" API group to a new group "bootstraptoken". The kubeadm API version v1beta3 no longer contains the BootstrapToken* structures.
Migration from old kubeadm config versions
- kubeadm v1.15.x and newer can be used to migrate from v1beta1 to v1beta2.
- kubeadm v1.22.x and newer no longer support v1beta1 and older APIs, but can be used to migrate v1beta2 to v1beta3.
Basics
The preferred way to configure kubeadm is to pass an YAML configuration file with the --config
option. Some of the
configuration options defined in the kubeadm config file are also available as command line flags, but only
the most common/simple use case are supported with this approach.
A kubeadm config file could contain multiple configuration types separated using three dashes (---
).
kubeadm supports the following configuration types:
apiVersion: kubeadm.k8s.io/v1beta3 kind: InitConfiguration apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration apiVersion: kubeproxy.config.k8s.io/v1alpha1 kind: KubeProxyConfiguration apiVersion: kubeadm.k8s.io/v1beta3 kind: JoinConfiguration
To print the defaults for "init" and "join" actions use the following commands:
kubeadm config print init-defaults kubeadm config print join-defaults
The list of configuration types that must be included in a configuration file depends by the action you are
performing (init
or join
) and by the configuration options you are going to use (defaults or advanced
customization).
If some configuration types are not provided, or provided only partially, kubeadm will use default values; defaults
provided by kubeadm includes also enforcing consistency of values across components when required (e.g.
--cluster-cidr
flag on controller manager and clusterCIDR
on kube-proxy).
Users are always allowed to override default values, with the only exception of a small subset of setting with relevance for security (e.g. enforce authorization-mode Node and RBAC on api server)
If the user provides a configuration types that is not expected for the action you are performing, kubeadm will ignore those types and print a warning.
Kubeadm init configuration types
When executing kubeadm init with the --config
option, the following configuration types could be used:
InitConfiguration, ClusterConfiguration, KubeProxyConfiguration, KubeletConfiguration, but only one
between InitConfiguration and ClusterConfiguration is mandatory.
apiVersion: kubeadm.k8s.io/v1beta3 kind: InitConfiguration bootstrapTokens: ... nodeRegistration: ...
The InitConfiguration type should be used to configure runtime settings, that in case of kubeadm init are the configuration of the bootstrap token and all the setting which are specific to the node where kubeadm is executed, including:
-
NodeRegistration, that holds fields that relate to registering the new node to the cluster; use it to customize the node name, the CRI socket to use or any other settings that should apply to this node only (e.g. the node ip).
-
LocalAPIEndpoint, that represents the endpoint of the instance of the API server to be deployed on this node; use it e.g. to customize the API server advertise address.
apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration networking: ... etcd: ... apiServer: extraArgs: ... extraVolumes: ... ...
The ClusterConfiguration type should be used to configure cluster-wide settings, including settings for:
-
networking
that holds configuration for the networking topology of the cluster; use it e.g. to customize Pod subnet or services subnet. -
etcd
: use it e.g. to customize the local etcd or to configure the API server for using an external etcd cluster. -
kube-apiserver, kube-scheduler, kube-controller-manager configurations; use it to customize control-plane components by adding customized setting or overriding kubeadm default settings.
apiVersion: kubeproxy.config.k8s.io/v1alpha1 kind: KubeProxyConfiguration ...
The KubeProxyConfiguration type should be used to change the configuration passed to kube-proxy instances deployed in the cluster. If this object is not provided or provided only partially, kubeadm applies defaults.
See https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ or https://godoc.org/k8s.io/kube-proxy/config/v1alpha1#KubeProxyConfiguration for kube-proxy official documentation.
apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration ...
The KubeletConfiguration type should be used to change the configurations that will be passed to all kubelet instances deployed in the cluster. If this object is not provided or provided only partially, kubeadm applies defaults.
See https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ or https://godoc.org/k8s.io/kubelet/config/v1beta1#KubeletConfiguration for kubelet official documentation.
Here is a fully populated example of a single YAML file containing multiple
configuration types to be used during a kubeadm init
run.
apiVersion: kubeadm.k8s.io/v1beta3 kind: InitConfiguration bootstrapTokens: - token: "9a08jv.c0izixklcxtmnze7" description: "kubeadm bootstrap token" ttl: "24h" - token: "783bde.3f89s0fje9f38fhf" description: "another bootstrap token" usages: - authentication - signing groups: - system:bootstrappers:kubeadm:default-node-token nodeRegistration: name: "ec2-10-100-0-1" criSocket: "/var/run/dockershim.sock" taints: - key: "kubeadmNode" value: "master" effect: "NoSchedule" kubeletExtraArgs: v: 4 ignorePreflightErrors: - IsPrivilegedUser imagePullPolicy: "IfNotPresent" localAPIEndpoint: advertiseAddress: "10.100.0.1" bindPort: 6443 certificateKey: "e6a2eb8581237ab72a4f494f30285ec12a9694d750b9785706a83bfcbbbd2204" skipPhases: - addon/kube-proxy --- apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration etcd: # one of local or external local: imageRepository: "k8s.gcr.io" imageTag: "3.2.24" dataDir: "/var/lib/etcd" extraArgs: listen-client-urls: "http://10.100.0.1:2379" serverCertSANs: - "ec2-10-100-0-1.compute-1.amazonaws.com" peerCertSANs: - "10.100.0.1" # external: # endpoints: # - "10.100.0.1:2379" # - "10.100.0.2:2379" # caFile: "/etcd/kubernetes/pki/etcd/etcd-ca.crt" # certFile: "/etcd/kubernetes/pki/etcd/etcd.crt" # keyFile: "/etcd/kubernetes/pki/etcd/etcd.key" networking: serviceSubnet: "10.96.0.0/16" podSubnet: "10.244.0.0/24" dnsDomain: "cluster.local" kubernetesVersion: "v1.21.0" controlPlaneEndpoint: "10.100.0.1:6443" apiServer: extraArgs: authorization-mode: "Node,RBAC" extraVolumes: - name: "some-volume" hostPath: "/etc/some-path" mountPath: "/etc/some-pod-path" readOnly: false pathType: File certSANs: - "10.100.1.1" - "ec2-10-100-0-1.compute-1.amazonaws.com" timeoutForControlPlane: 4m0s controllerManager: extraArgs: "node-cidr-mask-size": "20" extraVolumes: - name: "some-volume" hostPath: "/etc/some-path" mountPath: "/etc/some-pod-path" readOnly: false pathType: File scheduler: extraArgs: address: "10.100.0.1" extraVolumes: - name: "some-volume" hostPath: "/etc/some-path" mountPath: "/etc/some-pod-path" readOnly: false pathType: File certificatesDir: "/etc/kubernetes/pki" imageRepository: "k8s.gcr.io" clusterName: "example-cluster" --- apiVersion: kubelet.config.k8s.io/v1beta1 kind: KubeletConfiguration # kubelet specific options here --- apiVersion: kubeproxy.config.k8s.io/v1alpha1 kind: KubeProxyConfiguration # kube-proxy specific options here
Kubeadm join configuration types
When executing kubeadm join
with the --config
option, the JoinConfiguration type should be provided.
apiVersion: kubeadm.k8s.io/v1beta3 kind: JoinConfiguration ...
The JoinConfiguration type should be used to configure runtime settings, that in case of kubeadm join
are the discovery method used for accessing the cluster info and all the setting which are specific
to the node where kubeadm is executed, including:
-
nodeRegistration
, that holds fields that relate to registering the new node to the cluster; use it to customize the node name, the CRI socket to use or any other settings that should apply to this node only (e.g. the node ip). -
apiEndpoint
, that represents the endpoint of the instance of the API server to be eventually deployed on this node.
Resource Types
BootstrapToken
Appears in:
BootstrapToken describes one bootstrap token, stored as a Secret in the cluster
Field | Description |
---|---|
token [Required]BootstrapTokenString
|
|
description string
|
|
ttl meta/v1.Duration
|
|
expires meta/v1.Time
|
|
usages []string
|
|
groups []string
|
|
BootstrapTokenString
Appears in:
BootstrapTokenString is a token of the format abcdef.abcdef0123456789
that is used
for both validation of the practically of the API server from a joining node's point
of view and as an authentication method for the node in the bootstrap phase of
"kubeadm join". This token is and should be short-lived.
Field | Description |
---|---|
- [Required]string
|
No description provided. |
- [Required]string
|
No description provided. |
ClusterConfiguration
ClusterConfiguration contains cluster-wide configuration for a kubeadm cluster
Field | Description |
---|---|
apiVersion string | kubeadm.k8s.io/v1beta3 |
kind string | ClusterConfiguration |
etcd Etcd
|
|
networking Networking
|
|
kubernetesVersion string
|
|
controlPlaneEndpoint string
|
|
apiServer APIServer
|
|
controllerManager ControlPlaneComponent
|
|
scheduler ControlPlaneComponent
|
|
dns DNS
|
|
certificatesDir string
|
|
imageRepository string
|
|
featureGates map[string]bool
|
|
clusterName string
|
The cluster name. |
InitConfiguration
InitConfiguration contains a list of elements that is specific "kubeadm init"-only runtime
information.
kubeadm init
-only information. These fields are solely used the first time kubeadm init
runs.
After that, the information in the fields IS NOT uploaded to the kubeadm-config
ConfigMap
that is used by kubeadm upgrade
for instance. These fields must be omitempty.
Field | Description |
---|---|
apiVersion string | kubeadm.k8s.io/v1beta3 |
kind string | InitConfiguration |
bootstrapTokens []BootstrapToken
|
|
nodeRegistration NodeRegistrationOptions
|
|
localAPIEndpoint APIEndpoint
|
|
certificateKey string
|
|
skipPhases []string
|
|
patches Patches
|
|
JoinConfiguration
JoinConfiguration contains elements describing a particular node.
Field | Description |
---|---|
apiVersion string | kubeadm.k8s.io/v1beta3 |
kind string | JoinConfiguration |
nodeRegistration NodeRegistrationOptions
|
|
caCertPath string
|
|
discovery [Required]Discovery
|
|
controlPlane JoinControlPlane
|
|
skipPhases []string
|
|
patches Patches
|
|
APIEndpoint
Appears in:
APIEndpoint struct contains elements of API server instance deployed on a node.
Field | Description |
---|---|
advertiseAddress string
|
|
bindPort int32
|
|
APIServer
Appears in:
APIServer holds settings necessary for API server deployments in the cluster
Field | Description |
---|---|
ControlPlaneComponent [Required]ControlPlaneComponent
|
(Members of ControlPlaneComponent are embedded into this type.)
No description provided. |
certSANs []string
|
|
timeoutForControlPlane meta/v1.Duration
|
|
BootstrapTokenDiscovery
Appears in:
BootstrapTokenDiscovery is used to set the options for bootstrap token based discovery
Field | Description |
---|---|
token [Required]string
|
|
apiServerEndpoint string
|
|
caCertHashes []string
|
|
unsafeSkipCAVerification bool
|
|
ControlPlaneComponent
Appears in:
ControlPlaneComponent holds settings common to control plane component of the cluster
Field | Description |
---|---|
extraArgs map[string]string
|
|
extraVolumes []HostPathMount
|
|
DNS
Appears in:
DNS defines the DNS addon that should be used in the cluster
Field | Description |
---|---|
ImageMeta [Required]ImageMeta
|
(Members of ImageMeta are embedded into this type.)
|
Discovery
Appears in:
Discovery specifies the options for the kubelet to use during the TLS Bootstrap process.
Field | Description |
---|---|
bootstrapToken BootstrapTokenDiscovery
|
|
file FileDiscovery
|
|
tlsBootstrapToken string
|
|
timeout meta/v1.Duration
|
|
Etcd
Appears in:
Etcd contains elements describing Etcd configuration.
Field | Description |
---|---|
local LocalEtcd
|
|
external ExternalEtcd
|
|
ExternalEtcd
Appears in:
ExternalEtcd describes an external etcd cluster. Kubeadm has no knowledge of where certificate files live and they must be supplied.
Field | Description |
---|---|
endpoints [Required][]string
|
|
caFile [Required]string
|
|
certFile [Required]string
|
|
keyFile [Required]string
|
|
FileDiscovery
Appears in:
FileDiscovery is used to specify a file or URL to a kubeconfig file from which to load cluster information.
Field | Description |
---|---|
kubeConfigPath [Required]string
|
|
HostPathMount
Appears in:
HostPathMount contains elements describing volumes that are mounted from the host.
Field | Description |
---|---|
name [Required]string
|
|
hostPath [Required]string
|
|
mountPath [Required]string
|
|
readOnly bool
|
|
pathType core/v1.HostPathType
|
|
ImageMeta
Appears in:
ImageMeta allows to customize the image used for components that are not originated from the Kubernetes/Kubernetes release process
Field | Description |
---|---|
imageRepository string
|
|
imageTag string
|
|
JoinControlPlane
Appears in:
JoinControlPlane contains elements describing an additional control plane instance to be deployed on the joining node.
Field | Description |
---|---|
localAPIEndpoint APIEndpoint
|
|
certificateKey string
|
|
LocalEtcd
Appears in:
LocalEtcd describes that kubeadm should run an etcd cluster locally
Field | Description |
---|---|
ImageMeta [Required]ImageMeta
|
(Members of ImageMeta are embedded into this type.)
ImageMeta allows to customize the container used for etcd. |
dataDir [Required]string
|
|
extraArgs map[string]string
|
|
serverCertSANs []string
|
|
peerCertSANs []string
|
|
Networking
Appears in:
Networking contains elements describing cluster's networking configuration
Field | Description |
---|---|
serviceSubnet string
|
|
podSubnet string
|
|
dnsDomain string
|
|
NodeRegistrationOptions
Appears in:
NodeRegistrationOptions holds fields that relate to registering a new control-plane or node to the cluster, either via "kubeadm init" or "kubeadm join"
Field | Description |
---|---|
name string
|
|
criSocket string
|
|
taints [Required][]core/v1.Taint
|
|
kubeletExtraArgs map[string]string
|
|
ignorePreflightErrors []string
|
|
imagePullPolicy core/v1.PullPolicy
|
|
Patches
Appears in:
Patches contains options related to applying patches to components deployed by kubeadm.
Field | Description |
---|---|
directory string
|
|
6.12.12 - Kubelet Configuration (v1alpha1)
Resource Types
CredentialProviderConfig
CredentialProviderConfig is the configuration containing information about each exec credential provider. Kubelet reads this configuration from disk and enables each provider as specified by the CredentialProvider type.
Field | Description |
---|---|
apiVersion string | kubelet.config.k8s.io/v1alpha1 |
kind string | CredentialProviderConfig |
providers [Required][]CredentialProvider
|
providers is a list of credential provider plugins that will be enabled by the kubelet. Multiple providers may match against a single image, in which case credentials from all providers will be returned to the kubelet. If multiple providers are called for a single image, the results are combined. If providers return overlapping auth keys, the value from the provider earlier in this list is used. |
CredentialProvider
Appears in:
CredentialProvider represents an exec plugin to be invoked by the kubelet. The plugin is only invoked when an image being pulled matches the images handled by the plugin (see matchImages).
Field | Description |
---|---|
name [Required]string
|
name is the required name of the credential provider. It must match the name of the provider executable as seen by the kubelet. The executable must be in the kubelet's bin directory (set by the --image-credential-provider-bin-dir flag). |
matchImages [Required][]string
|
matchImages is a required list of strings used to match against images in order to determine if this provider should be invoked. If one of the strings matches the requested image from the kubelet, the plugin will be invoked and given a chance to provide credentials. Images are expected to contain the registry domain and URL path. Each entry in matchImages is a pattern which can optionally contain a port and a path. Globs can be used in the domain, but not in the port or the path. Globs are supported as subdomains like '.k8s.io' or 'k8s..io', and top-level-domains such as 'k8s.'. Matching partial subdomains like 'app.k8s.io' is also supported. Each glob can only match a single subdomain segment, so *.io does not match *.k8s.io. A match exists between an image and a matchImage when all of the below are true:
Example values of matchImages:
|
defaultCacheDuration [Required]meta/v1.Duration
|
defaultCacheDuration is the default duration the plugin will cache credentials in-memory if a cache duration is not provided in the plugin response. This field is required. |
apiVersion [Required]string
|
Required input version of the exec CredentialProviderRequest. The returned CredentialProviderResponse MUST use the same encoding version as the input. Current supported values are:
|
args []string
|
Arguments to pass to the command when executing it. |
env []ExecEnvVar
|
Env defines additional environment variables to expose to the process. These are unioned with the host's environment, as well as variables client-go uses to pass argument to the plugin. |
ExecEnvVar
Appears in:
ExecEnvVar is used for setting environment variables when executing an exec-based credential plugin.
Field | Description |
---|---|
name [Required]string
|
No description provided. |
value [Required]string
|
No description provided. |
FormatOptions
Appears in:
FormatOptions contains options for the different logging formats.
Field | Description |
---|---|
json [Required]JSONOptions
|
[Experimental] JSON contains options for logging format "json". |
JSONOptions
Appears in:
JSONOptions contains options for logging format "json".
Field | Description |
---|---|
splitStream [Required]bool
|
[Experimental] SplitStream redirects error messages to stderr while info messages go to stdout, with buffering. The default is to write both to stdout, without buffering. |
infoBufferSize [Required]k8s.io/apimachinery/pkg/api/resource.QuantityValue
|
[Experimental] InfoBufferSize sets the size of the info stream when using split streams. The default is zero, which disables buffering. |
VModuleConfiguration
(Alias of []k8s.io/component-base/config/v1alpha1.VModuleItem
)
Appears in:
VModuleConfiguration is a collection of individual file names or patterns and the corresponding verbosity threshold.
6.12.13 - Kubelet Configuration (v1beta1)
Resource Types
KubeletConfiguration
KubeletConfiguration contains the configuration for the Kubelet
Field | Description |
---|---|
apiVersion string | kubelet.config.k8s.io/v1beta1 |
kind string | KubeletConfiguration |
enableServer [Required]bool
|
enableServer enables Kubelet's secured server. Note: Kubelet's insecure port is controlled by the readOnlyPort option. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Default: true |
staticPodPath string
|
staticPodPath is the path to the directory containing local (static) pods to run, or the path to a single static pod file. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that the set of static pods specified at the new path may be different than the ones the Kubelet initially started with, and this may disrupt your node. Default: "" |
syncFrequency meta/v1.Duration
|
syncFrequency is the max period between synchronizing running containers and config. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that shortening this duration may have a negative performance impact, especially as the number of Pods on the node increases. Alternatively, increasing this duration will result in longer refresh times for ConfigMaps and Secrets. Default: "1m" |
fileCheckFrequency meta/v1.Duration
|
fileCheckFrequency is the duration between checking config files for new data. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that shortening the duration will cause the Kubelet to reload local Static Pod configurations more frequently, which may have a negative performance impact. Default: "20s" |
httpCheckFrequency meta/v1.Duration
|
httpCheckFrequency is the duration between checking http for new data. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that shortening the duration will cause the Kubelet to poll staticPodURL more frequently, which may have a negative performance impact. Default: "20s" |
staticPodURL string
|
staticPodURL is the URL for accessing static pods to run. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that the set of static pods specified at the new URL may be different than the ones the Kubelet initially started with, and this may disrupt your node. Default: "" |
staticPodURLHeader map[string][]string
|
staticPodURLHeader is a map of slices with HTTP headers to use when accessing the podURL. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt the ability to read the latest set of static pods from StaticPodURL. Default: nil |
address string
|
address is the IP address for the Kubelet to serve on (set to 0.0.0.0 for all interfaces). If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Default: "0.0.0.0" |
port int32
|
port is the port for the Kubelet to serve on. The port number must be between 1 and 65535, inclusive. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Default: 10250 |
readOnlyPort int32
|
readOnlyPort is the read-only port for the Kubelet to serve on with no authentication/authorization. The port number must be between 1 and 65535, inclusive. Setting this field to 0 disables the read-only service. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Default: 0 (disabled) |
tlsCertFile string
|
tlsCertFile is the file containing x509 Certificate for HTTPS. (CA cert, if any, concatenated after server cert). If tlsCertFile and tlsPrivateKeyFile are not provided, a self-signed certificate and key are generated for the public address and saved to the directory passed to the Kubelet's --cert-dir flag. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Default: "" |
tlsPrivateKeyFile string
|
tlsPrivateKeyFile is the file containing x509 private key matching tlsCertFile. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Default: "" |
tlsCipherSuites []string
|
tlsCipherSuites is the list of allowed cipher suites for the server. Values are from tls package constants (https://golang.org/pkg/crypto/tls/#pkg-constants). If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Default: nil |
tlsMinVersion string
|
tlsMinVersion is the minimum TLS version supported. Values are from tls package constants (https://golang.org/pkg/crypto/tls/#pkg-constants). If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Default: "" |
rotateCertificates bool
|
rotateCertificates enables client certificate rotation. The Kubelet will request a new certificate from the certificates.k8s.io API. This requires an approver to approve the certificate signing requests. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that disabling it may disrupt the Kubelet's ability to authenticate with the API server after the current certificate expires. Default: false |
serverTLSBootstrap bool
|
serverTLSBootstrap enables server certificate bootstrap. Instead of self signing a serving certificate, the Kubelet will request a certificate from the 'certificates.k8s.io' API. This requires an approver to approve the certificate signing requests (CSR). The RotateKubeletServerCertificate feature must be enabled when setting this field. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that disabling it will stop the renewal of Kubelet server certificates, which can disrupt components that interact with the Kubelet server in the long term, due to certificate expiration. Default: false |
authentication KubeletAuthentication
|
authentication specifies how requests to the Kubelet's server are authenticated. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Defaults: anonymous: enabled: false webhook: enabled: true cacheTTL: "2m" |
authorization KubeletAuthorization
|
authorization specifies how requests to the Kubelet's server are authorized. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Defaults: mode: Webhook webhook: cacheAuthorizedTTL: "5m" cacheUnauthorizedTTL: "30s" |
registryPullQPS int32
|
registryPullQPS is the limit of registry pulls per second. The value must not be a negative number. Setting it to 0 means no limit. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact scalability by changing the amount of traffic produced by image pulls. Default: 5 |
registryBurst int32
|
registryBurst is the maximum size of bursty pulls, temporarily allows pulls to burst to this number, while still not exceeding registryPullQPS. The value must not be a negative number. Only used if registryPullQPS is greater than 0. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact scalability by changing the amount of traffic produced by image pulls. Default: 10 |
eventRecordQPS int32
|
eventRecordQPS is the maximum event creations per second. If 0, there is no limit enforced. The value cannot be a negative number. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact scalability by changing the amount of traffic produced by event creations. Default: 5 |
eventBurst int32
|
eventBurst is the maximum size of a burst of event creations, temporarily allows event creations to burst to this number, while still not exceeding eventRecordQPS. This field canot be a negative number and it is only used when eventRecordQPS > 0. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact scalability by changing the amount of traffic produced by event creations. Default: 10 |
enableDebuggingHandlers bool
|
enableDebuggingHandlers enables server endpoints for log access and local running of containers and commands, including the exec, attach, logs, and portforward features. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that disabling it may disrupt components that interact with the Kubelet server. Default: true |
enableContentionProfiling bool
|
enableContentionProfiling enables lock contention profiling, if enableDebuggingHandlers is true. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that enabling it may carry a performance impact. Default: false |
healthzPort int32
|
healthzPort is the port of the localhost healthz endpoint (set to 0 to disable). A valid number is between 1 and 65535. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that monitor Kubelet health. Default: 10248 |
healthzBindAddress string
|
healthzBindAddress is the IP address for the healthz server to serve on. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that monitor Kubelet health. Default: "127.0.0.1" |
oomScoreAdj int32
|
oomScoreAdj is The oom-score-adj value for kubelet process. Values must be within the range [-1000, 1000]. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact the stability of nodes under memory pressure. Default: -999 |
clusterDomain string
|
clusterDomain is the DNS domain for this cluster. If set, kubelet will configure all containers to search this domain in addition to the host's search domains. Dynamic Kubelet Config (deprecated): Dynamically updating this field is not recommended, as it should be kept in sync with the rest of the cluster. Default: "" |
clusterDNS []string
|
clusterDNS is a list of IP addresses for the cluster DNS server. If set, kubelet will configure all containers to use this for DNS resolution instead of the host's DNS servers. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that changes will only take effect on Pods created after the update. Draining the node is recommended before changing this field. Default: nil |
streamingConnectionIdleTimeout meta/v1.Duration
|
streamingConnectionIdleTimeout is the maximum time a streaming connection can be idle before the connection is automatically closed. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact components that rely on infrequent updates over streaming connections to the Kubelet server. Default: "4h" |
nodeStatusUpdateFrequency meta/v1.Duration
|
nodeStatusUpdateFrequency is the frequency that kubelet computes node status. If node lease feature is not enabled, it is also the frequency that kubelet posts node status to master. Note: When node lease feature is not enabled, be cautious when changing the constant, it must work with nodeMonitorGracePeriod in nodecontroller. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact node scalability, and also that the node controller's nodeMonitorGracePeriod must be set to N*NodeStatusUpdateFrequency, where N is the number of retries before the node controller marks the node unhealthy. Default: "10s" |
nodeStatusReportFrequency meta/v1.Duration
|
nodeStatusReportFrequency is the frequency that kubelet posts node status to master if node status does not change. Kubelet will ignore this frequency and post node status immediately if any change is detected. It is only used when node lease feature is enabled. nodeStatusReportFrequency's default value is 5m. But if nodeStatusUpdateFrequency is set explicitly, nodeStatusReportFrequency's default value will be set to nodeStatusUpdateFrequency for backward compatibility. Default: "5m" |
nodeLeaseDurationSeconds int32
|
nodeLeaseDurationSeconds is the duration the Kubelet will set on its corresponding Lease. NodeLease provides an indicator of node health by having the Kubelet create and periodically renew a lease, named after the node, in the kube-node-lease namespace. If the lease expires, the node can be considered unhealthy. The lease is currently renewed every 10s, per KEP-0009. In the future, the lease renewal interval may be set based on the lease duration. The field value must be greater than 0. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that decreasing the duration may reduce tolerance for issues that temporarily prevent the Kubelet from renewing the lease (e.g. a short-lived network issue). Default: 40 |
imageMinimumGCAge meta/v1.Duration
|
imageMinimumGCAge is the minimum age for an unused image before it is garbage collected. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may trigger or delay garbage collection, and may change the image overhead on the node. Default: "2m" |
imageGCHighThresholdPercent int32
|
imageGCHighThresholdPercent is the percent of disk usage after which image garbage collection is always run. The percent is calculated by dividing this field value by 100, so this field must be between 0 and 100, inclusive. When specified, the value must be greater than imageGCLowThresholdPercent. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may trigger or delay garbage collection, and may change the image overhead on the node. Default: 85 |
imageGCLowThresholdPercent int32
|
imageGCLowThresholdPercent is the percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. The percent is calculated by dividing this field value by 100, so the field value must be between 0 and 100, inclusive. When specified, the value must be less than imageGCHighThresholdPercent. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may trigger or delay garbage collection, and may change the image overhead on the node. Default: 80 |
volumeStatsAggPeriod meta/v1.Duration
|
volumeStatsAggPeriod is the frequency for calculating and caching volume disk usage for all pods. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that shortening the period may carry a performance impact. Default: "1m" |
kubeletCgroups string
|
kubeletCgroups is the absolute name of cgroups to isolate the kubelet in Dynamic Kubelet Config (deprecated): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: "" |
systemCgroups string
|
systemCgroups is absolute name of cgroups in which to place all non-kernel processes that are not already in a container. Empty for no container. Rolling back the flag requires a reboot. The cgroupRoot must be specified if this field is not empty. Dynamic Kubelet Config (deprecated): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: "" |
cgroupRoot string
|
cgroupRoot is the root cgroup to use for pods. This is handled by the container runtime on a best effort basis. Dynamic Kubelet Config (deprecated): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: "" |
cgroupsPerQOS bool
|
cgroupsPerQOS enable QoS based CGroup hierarchy: top level CGroups for QoS classes and all Burstable and BestEffort Pods are brought up under their specific top level QoS CGroup. Dynamic Kubelet Config (deprecated): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: true |
cgroupDriver string
|
cgroupDriver is the driver kubelet uses to manipulate CGroups on the host (cgroupfs or systemd). Dynamic Kubelet Config (deprecated): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: "cgroupfs" |
cpuManagerPolicy string
|
cpuManagerPolicy is the name of the policy to use. Requires the CPUManager feature gate to be enabled. Dynamic Kubelet Config (deprecated): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: "None" |
cpuManagerPolicyOptions map[string]string
|
cpuManagerPolicyOptions is a set of key=value which allows to set extra options to fine tune the behaviour of the cpu manager policies. Requires both the "CPUManager" and "CPUManagerPolicyOptions" feature gates to be enabled. Dynamic Kubelet Config (beta): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: nil |
cpuManagerReconcilePeriod meta/v1.Duration
|
cpuManagerReconcilePeriod is the reconciliation period for the CPU Manager. Requires the CPUManager feature gate to be enabled. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that shortening the period may carry a performance impact. Default: "10s" |
memoryManagerPolicy string
|
memoryManagerPolicy is the name of the policy to use by memory manager. Requires the MemoryManager feature gate to be enabled. Dynamic Kubelet Config (deprecated): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: "none" |
topologyManagerPolicy string
|
topologyManagerPolicy is the name of the topology manager policy to use. Valid values include:
Policies other than "none" require the TopologyManager feature gate to be enabled. Dynamic Kubelet Config (deprecated): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: "none" |
topologyManagerScope string
|
topologyManagerScope represents the scope of topology hint generation that topology manager requests and hint providers generate. Valid values include:
"pod" scope requires the TopologyManager feature gate to be enabled. Default: "container" |
qosReserved map[string]string
|
qosReserved is a set of resource name to percentage pairs that specify the minimum percentage of a resource reserved for exclusive use by the guaranteed QoS tier. Currently supported resources: "memory" Requires the QOSReserved feature gate to be enabled. Dynamic Kubelet Config (deprecated): This field should not be updated without a full node reboot. It is safest to keep this value the same as the local config. Default: nil |
runtimeRequestTimeout meta/v1.Duration
|
runtimeRequestTimeout is the timeout for all runtime requests except long running requests - pull, logs, exec and attach. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may disrupt components that interact with the Kubelet server. Default: "2m" |
hairpinMode string
|
hairpinMode specifies how the Kubelet should configure the container bridge for hairpin packets. Setting this flag allows endpoints in a Service to loadbalance back to themselves if they should try to access their own Service. Values:
Generally, one must set |
maxPods int32
|
maxPods is the maximum number of Pods that can run on this Kubelet. The value must be a non-negative integer. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that changes may cause Pods to fail admission on Kubelet restart, and may change the value reported in Node.Status.Capacity[v1.ResourcePods], thus affecting future scheduling decisions. Increasing this value may also decrease performance, as more Pods can be packed into a single node. Default: 110 |
podCIDR string
|
podCIDR is the CIDR to use for pod IP addresses, only used in standalone mode. In cluster mode, this is obtained from the control plane. Dynamic Kubelet Config (deprecated): This field should always be set to the empty default. It should only set for standalone Kubelets, which cannot use Dynamic Kubelet Config. Default: "" |
podPidsLimit int64
|
podPidsLimit is the maximum number of PIDs in any pod. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that lowering it may prevent container processes from forking after the change. Default: -1 |
resolvConf string
|
resolvConf is the resolver configuration file used as the basis for the container DNS resolution configuration. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that changes will only take effect on Pods created after the update. Draining the node is recommended before changing this field. If set to the empty string, will override the default and effectively disable DNS lookups. Default: "/etc/resolv.conf" |
runOnce bool
|
runOnce causes the Kubelet to check the API server once for pods, run those in addition to the pods specified by static pod files, and exit. Default: false |
cpuCFSQuota bool
|
cpuCFSQuota enables CPU CFS quota enforcement for containers that specify CPU limits. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that disabling it may reduce node stability. Default: true |
cpuCFSQuotaPeriod meta/v1.Duration
|
cpuCFSQuotaPeriod is the CPU CFS quota period value, |
nodeStatusMaxImages int32
|
nodeStatusMaxImages caps the number of images reported in Node.status.images. The value must be greater than -2. Note: If -1 is specified, no cap will be applied. If 0 is specified, no image is returned. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that different values can be reported on node status. Default: 50 |
maxOpenFiles int64
|
maxOpenFiles is Number of files that can be opened by Kubelet process. The value must be a non-negative number. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact the ability of the Kubelet to interact with the node's filesystem. Default: 1000000 |
contentType string
|
contentType is contentType of requests sent to apiserver. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact the ability for the Kubelet to communicate with the API server. If the Kubelet loses contact with the API server due to a change to this field, the change cannot be reverted via dynamic Kubelet config. Default: "application/vnd.kubernetes.protobuf" |
kubeAPIQPS int32
|
kubeAPIQPS is the QPS to use while talking with kubernetes apiserver. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact scalability by changing the amount of traffic the Kubelet sends to the API server. Default: 5 |
kubeAPIBurst int32
|
kubeAPIBurst is the burst to allow while talking with kubernetes API server. This field cannot be a negative number. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact scalability by changing the amount of traffic the Kubelet sends to the API server. Default: 10 |
serializeImagePulls bool
|
serializeImagePulls when enabled, tells the Kubelet to pull images one at a time. We recommend not changing the default value on nodes that run docker daemon with version < 1.9 or an Aufs storage backend. Issue #10959 has more details. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact the performance of image pulls. Default: true |
evictionHard map[string]string
|
evictionHard is a map of signal names to quantities that defines hard eviction
thresholds. For example: |
evictionSoft map[string]string
|
evictionSoft is a map of signal names to quantities that defines soft eviction thresholds.
For example: |
evictionSoftGracePeriod map[string]string
|
evictionSoftGracePeriod is a map of signal names to quantities that defines grace
periods for each soft eviction signal. For example: |
evictionPressureTransitionPeriod meta/v1.Duration
|
evictionPressureTransitionPeriod is the duration for which the kubelet has to wait before transitioning out of an eviction pressure condition. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that lowering it may decrease the stability of the node when the node is overcommitted. Default: "5m" |
evictionMaxPodGracePeriod int32
|
evictionMaxPodGracePeriod is the maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met. This value effectively caps the Pod's terminationGracePeriodSeconds value during soft evictions. Note: Due to issue #64530, the behavior has a bug where this value currently just overrides the grace period during soft eviction, which can increase the grace period from what is set on the Pod. This bug will be fixed in a future release. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that lowering it decreases the amount of time Pods will have to gracefully clean up before being killed during a soft eviction. Default: 0 |
evictionMinimumReclaim map[string]string
|
evictionMinimumReclaim is a map of signal names to quantities that defines minimum reclaims,
which describe the minimum amount of a given resource the kubelet will reclaim when
performing a pod eviction while that resource is under pressure.
For example: |
podsPerCore int32
|
podsPerCore is the maximum number of pods per core. Cannot exceed maxPods.
The value must be a non-negative integer.
If 0, there is no limit on the number of Pods.
If DynamicKubeletConfig (deprecated; default off) is on, when
dynamically updating this field, consider that
changes may cause Pods to fail admission on Kubelet restart, and may change
the value reported in |
enableControllerAttachDetach bool
|
enableControllerAttachDetach enables the Attach/Detach controller to manage attachment/detachment of volumes scheduled to this node, and disables kubelet from executing any attach/detach operations. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that changing which component is responsible for volume management on a live node may result in volumes refusing to detach if the node is not drained prior to the update, and if Pods are scheduled to the node before the volumes.kubernetes.io/controller-managed-attach-detach annotation is updated by the Kubelet. In general, it is safest to leave this value set the same as local config. Default: true |
protectKernelDefaults bool
|
protectKernelDefaults, if true, causes the Kubelet to error if kernel flags are not as it expects. Otherwise the Kubelet will attempt to modify kernel flags to match its expectation. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that enabling it may cause the Kubelet to crash-loop if the Kernel is not configured as Kubelet expects. Default: false |
makeIPTablesUtilChains bool
|
makeIPTablesUtilChains, if true, causes the Kubelet ensures a set of iptables rules are present on host. These rules will serve as utility rules for various components, e.g. kube-proxy. The rules will be created based on iptablesMasqueradeBit and iptablesDropBit. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that disabling it will prevent the Kubelet from healing locally misconfigured iptables rules. Default: true |
iptablesMasqueradeBit int32
|
iptablesMasqueradeBit is the bit of the iptables fwmark space to mark for SNAT. Values must be within the range [0, 31]. Must be different from other mark bits. Warning: Please match the value of the corresponding parameter in kube-proxy. TODO: clean up IPTablesMasqueradeBit in kube-proxy. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it needs to be coordinated with other components, like kube-proxy, and the update will only be effective if MakeIPTablesUtilChains is enabled. Default: 14 |
iptablesDropBit int32
|
iptablesDropBit is the bit of the iptables fwmark space to mark for dropping packets. Values must be within the range [0, 31]. Must be different from other mark bits. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it needs to be coordinated with other components, like kube-proxy, and the update will only be effective if MakeIPTablesUtilChains is enabled. Default: 15 |
featureGates map[string]bool
|
featureGates is a map of feature names to bools that enable or disable experimental features. This field modifies piecemeal the built-in default values from "k8s.io/kubernetes/pkg/features/kube_features.go". If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider the documentation for the features you are enabling or disabling. While we encourage feature developers to make it possible to dynamically enable and disable features, some changes may require node reboots, and some features may require careful coordination to retroactively disable. Default: nil |
failSwapOn bool
|
failSwapOn tells the Kubelet to fail to start if swap is enabled on the node. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that setting it to true will cause the Kubelet to crash-loop if swap is enabled. Default: true |
memorySwap MemorySwapConfiguration
|
memorySwap configures swap memory available to container workloads. |
containerLogMaxSize string
|
containerLogMaxSize is a quantity defining the maximum size of the container log file before it is rotated. For example: "5Mi" or "256Ki". If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may trigger log rotation. Default: "10Mi" |
containerLogMaxFiles int32
|
containerLogMaxFiles specifies the maximum number of container log files that can be present for a container. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that lowering it may cause log files to be deleted. Default: 5 |
configMapAndSecretChangeDetectionStrategy ResourceChangeDetectionStrategy
|
configMapAndSecretChangeDetectionStrategy is a mode in which ConfigMap and Secret managers are running. Valid values include:
Default: "Watch" |
systemReserved map[string]string
|
systemReserved is a set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may not be possible to increase the reserved resources, because this requires resizing cgroups. Always look for a NodeAllocatableEnforced event after updating this field to ensure that the update was successful. Default: nil |
kubeReserved map[string]string
|
kubeReserved is a set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs that describe resources reserved for kubernetes system components. Currently cpu, memory and local storage for root file system are supported. See https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ for more details. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may not be possible to increase the reserved resources, because this requires resizing cgroups. Always look for a NodeAllocatableEnforced event after updating this field to ensure that the update was successful. Default: nil |
reservedSystemCPUs [Required]string
|
The reservedSystemCPUs option specifies the CPU list reserved for the host level system threads and kubernetes related threads. This provide a "static" CPU list rather than the "dynamic" list by systemReserved and kubeReserved. This option does not support systemReservedCgroup or kubeReservedCgroup. |
showHiddenMetricsForVersion string
|
showHiddenMetricsForVersion is the previous version for which you want to show
hidden metrics.
Only the previous minor version is meaningful, other values will not be allowed.
The format is |
systemReservedCgroup string
|
systemReservedCgroup helps the kubelet identify absolute name of top level CGroup used
to enforce |
kubeReservedCgroup string
|
kubeReservedCgroup helps the kubelet identify absolute name of top level CGroup used
to enforce |
enforceNodeAllocatable []string
|
This flag specifies the various Node Allocatable enforcements that Kubelet needs to perform.
This flag accepts a list of options. Acceptable options are |
allowedUnsafeSysctls []string
|
A comma separated whitelist of unsafe sysctls or sysctl patterns (ending in |
volumePluginDir string
|
volumePluginDir is the full path of the directory in which to search for additional third party volume plugins. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that changing the volumePluginDir may disrupt workloads relying on third party volume plugins. Default: "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/" |
providerID string
|
providerID, if set, sets the unique ID of the instance that an external provider (i.e. cloudprovider) can use to identify a specific node. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact the ability of the Kubelet to interact with cloud providers. Default: "" |
kernelMemcgNotification bool
|
kernelMemcgNotification, if set, instructs the the kubelet to integrate with the kernel memcg notification for determining if memory eviction thresholds are exceeded rather than polling. If DynamicKubeletConfig (deprecated; default off) is on, when dynamically updating this field, consider that it may impact the way Kubelet interacts with the kernel. Default: false |
logging [Required]LoggingConfiguration
|
logging specifies the options of logging. Refer to Logs Options for more information. Default: Format: text |
enableSystemLogHandler bool
|
enableSystemLogHandler enables system logs via web interface host:port/logs/ Default: true |
shutdownGracePeriod meta/v1.Duration
|
shutdownGracePeriod specifies the total duration that the node should delay the shutdown and total grace period for pod termination during a node shutdown. Default: "0s" |
shutdownGracePeriodCriticalPods meta/v1.Duration
|
shutdownGracePeriodCriticalPods specifies the duration used to terminate critical pods during a node shutdown. This should be less than shutdownGracePeriod. For example, if shutdownGracePeriod=30s, and shutdownGracePeriodCriticalPods=10s, during a node shutdown the first 20 seconds would be reserved for gracefully terminating normal pods, and the last 10 seconds would be reserved for terminating critical pods. Default: "0s" |
shutdownGracePeriodByPodPriority []ShutdownGracePeriodByPodPriority
|
shutdownGracePeriodByPodPriority specifies the shutdown grace period for Pods based on their associated priority class value. When a shutdown request is received, the Kubelet will initiate shutdown on all pods running on the node with a grace period that depends on the priority of the pod, and then wait for all pods to exit. Each entry in the array represents the graceful shutdown time a pod with a priority class value that lies in the range of that value and the next higher entry in the list when the node is shutting down. For example, to allow critical pods 10s to shutdown, priority>=10000 pods 20s to shutdown, and all remaining pods 30s to shutdown. shutdownGracePeriodByPodPriority:
The time the Kubelet will wait before exiting will at most be the maximum of all shutdownGracePeriodSeconds for each priority class range represented on the node. When all pods have exited or reached their grace periods, the Kubelet will release the shutdown inhibit lock. Requires the GracefulNodeShutdown feature gate to be enabled. This configuration must be empty if either ShutdownGracePeriod or ShutdownGracePeriodCriticalPods is set. Default: nil |
reservedMemory []MemoryReservation
|
reservedMemory specifies a comma-separated list of memory reservations for NUMA nodes. The parameter makes sense only in the context of the memory manager feature. The memory manager will not allocate reserved memory for container workloads. For example, if you have a NUMA0 with 10Gi of memory and the reservedMemory was specified to reserve 1Gi of memory at NUMA0, the memory manager will assume that only 9Gi is available for allocation. You can specify a different amount of NUMA node and memory types. You can omit this parameter at all, but you should be aware that the amount of reserved memory from all NUMA nodes should be equal to the amount of memory specified by the node allocatable. If at least one node allocatable parameter has a non-zero value, you will need to specify at least one NUMA node. Also, avoid specifying:
Default: nil |
enableProfilingHandler bool
|
enableProfilingHandler enables profiling via web interface host:port/debug/pprof/ Default: true |
enableDebugFlagsHandler bool
|
enableDebugFlagsHandler enables flags endpoint via web interface host:port/debug/flags/v Default: true |
seccompDefault bool
|
SeccompDefault enables the use of |
memoryThrottlingFactor float64
|
MemoryThrottlingFactor specifies the factor multiplied by the memory limit or node allocatable memory when setting the cgroupv2 memory.high value to enforce MemoryQoS. Decreasing this factor will set lower high limit for container cgroups and put heavier reclaim pressure while increasing will put less reclaim pressure. See http://kep.k8s.io/2570 for more details. Default: 0.8 |
registerWithTaints []core/v1.Taint
|
registerWithTaints are an array of taints to add to a node object when the kubelet registers itself. This only takes effect when registerNode is true and upon the initial registration of the node. Default: nil |
registerNode bool
|
registerNode enables automatic registration with the apiserver. Default: true |
SerializedNodeConfigSource
SerializedNodeConfigSource allows us to serialize v1.NodeConfigSource. This type is used internally by the Kubelet for tracking checkpointed dynamic configs. It exists in the kubeletconfig API group because it is classified as a versioned input to the Kubelet.
Field | Description |
---|---|
apiVersion string | kubelet.config.k8s.io/v1beta1 |
kind string | SerializedNodeConfigSource |
source core/v1.NodeConfigSource
|
source is the source that we are serializing. |
KubeletAnonymousAuthentication
Appears in:
Field | Description |
---|---|
enabled bool
|
enabled allows anonymous requests to the kubelet server.
Requests that are not rejected by another authentication method are treated as
anonymous requests.
Anonymous requests have a username of |
KubeletAuthentication
Appears in:
Field | Description |
---|---|
x509 KubeletX509Authentication
|
x509 contains settings related to x509 client certificate authentication. |
webhook KubeletWebhookAuthentication
|
webhook contains settings related to webhook bearer token authentication. |
anonymous KubeletAnonymousAuthentication
|
anonymous contains settings related to anonymous authentication. |
KubeletAuthorization
Appears in:
Field | Description |
---|---|
mode KubeletAuthorizationMode
|
mode is the authorization mode to apply to requests to the kubelet server.
Valid values are |
webhook KubeletWebhookAuthorization
|
webhook contains settings related to Webhook authorization. |
KubeletAuthorizationMode
(Alias of string
)
Appears in:
KubeletWebhookAuthentication
Appears in:
Field | Description |
---|---|
enabled bool
|
enabled allows bearer token authentication backed by the tokenreviews.authentication.k8s.io API. |
cacheTTL meta/v1.Duration
|
cacheTTL enables caching of authentication results |
KubeletWebhookAuthorization
Appears in:
Field | Description |
---|---|
cacheAuthorizedTTL meta/v1.Duration
|
cacheAuthorizedTTL is the duration to cache 'authorized' responses from the webhook authorizer. |
cacheUnauthorizedTTL meta/v1.Duration
|
cacheUnauthorizedTTL is the duration to cache 'unauthorized' responses from the webhook authorizer. |
KubeletX509Authentication
Appears in:
Field | Description |
---|---|
clientCAFile string
|
clientCAFile is the path to a PEM-encoded certificate bundle. If set, any request presenting a client certificate signed by one of the authorities in the bundle is authenticated with a username corresponding to the CommonName, and groups corresponding to the Organization in the client certificate. |
MemoryReservation
Appears in:
MemoryReservation specifies the memory reservation of different types for each NUMA node
Field | Description |
---|---|
numaNode [Required]int32
|
No description provided. |
limits [Required]core/v1.ResourceList
|
No description provided. |
MemorySwapConfiguration
Appears in:
Field | Description |
---|---|
swapBehavior string
|
swapBehavior configures swap memory available to container workloads. May be one of "", "LimitedSwap": workload combined memory and swap usage cannot exceed pod memory limit "UnlimitedSwap": workloads can use unlimited swap, up to the allocatable limit. |
ResourceChangeDetectionStrategy
(Alias of string
)
Appears in:
ResourceChangeDetectionStrategy denotes a mode in which internal managers (secret, configmap) are discovering object changes.
ShutdownGracePeriodByPodPriority
Appears in:
ShutdownGracePeriodByPodPriority specifies the shutdown grace period for Pods based on their associated priority class value
Field | Description |
---|---|
priority [Required]int32
|
priority is the priority value associated with the shutdown grace period |
shutdownGracePeriodSeconds [Required]int64
|
shutdownGracePeriodSeconds is the shutdown grace period in seconds |
FormatOptions
Appears in:
FormatOptions contains options for the different logging formats.
Field | Description |
---|---|
json [Required]JSONOptions
|
[Experimental] JSON contains options for logging format "json". |
JSONOptions
Appears in:
JSONOptions contains options for logging format "json".
Field | Description |
---|---|
splitStream [Required]bool
|
[Experimental] SplitStream redirects error messages to stderr while info messages go to stdout, with buffering. The default is to write both to stdout, without buffering. |
infoBufferSize [Required]k8s.io/apimachinery/pkg/api/resource.QuantityValue
|
[Experimental] InfoBufferSize sets the size of the info stream when using split streams. The default is zero, which disables buffering. |
LoggingConfiguration
Appears in:
LoggingConfiguration contains logging options Refer Logs Options for more information.
Field | Description |
---|---|
format [Required]string
|
Format Flag specifies the structure of log messages.
default value of format is |
flushFrequency [Required]time.Duration
|
Maximum number of seconds between log flushes. Ignored if the selected logging backend writes log messages without buffering. |
verbosity [Required]uint32
|
Verbosity is the threshold that determines which log messages are logged. Default is zero which logs only the most important messages. Higher values enable additional messages. Error messages are always logged. |
vmodule [Required]VModuleConfiguration
|
VModule overrides the verbosity threshold for individual files. Only supported for "text" log format. |
sanitization [Required]bool
|
[Experimental] When enabled prevents logging of fields tagged as sensitive (passwords, keys, tokens). Runtime log sanitization may introduce significant computation overhead and therefore should not be enabled in production.`) |
options [Required]FormatOptions
|
[Experimental] Options holds additional parameters that are specific to the different logging formats. Only the options for the selected format get used, but all of them get validated. |
VModuleConfiguration
(Alias of []k8s.io/component-base/config/v1alpha1.VModuleItem
)
Appears in:
VModuleConfiguration is a collection of individual file names or patterns and the corresponding verbosity threshold.
6.12.14 - Kubelet CredentialProvider (v1alpha1)
Resource Types
CredentialProviderRequest
CredentialProviderRequest includes the image that the kubelet requires authentication for. Kubelet will pass this request object to the plugin via stdin. In general, plugins should prefer responding with the same apiVersion they were sent.
Field | Description |
---|---|
apiVersion string | credentialprovider.kubelet.k8s.io/v1alpha1 |
kind string | CredentialProviderRequest |
image [Required]string
|
image is the container image that is being pulled as part of the credential provider plugin request. Plugins may optionally parse the image to extract any information required to fetch credentials. |
CredentialProviderResponse
CredentialProviderResponse holds credentials that the kubelet should use for the specified image provided in the original request. Kubelet will read the response from the plugin via stdout. This response should be set to the same apiVersion as CredentialProviderRequest.
Field | Description |
---|---|
apiVersion string | credentialprovider.kubelet.k8s.io/v1alpha1 |
kind string | CredentialProviderResponse |
cacheKeyType [Required]PluginCacheKeyType
|
cacheKeyType indiciates the type of caching key to use based on the image provided in the request. There are three valid values for the cache key type: Image, Registry, and Global. If an invalid value is specified, the response will NOT be used by the kubelet. |
cacheDuration meta/v1.Duration
|
cacheDuration indicates the duration the provided credentials should be cached for. The kubelet will use this field to set the in-memory cache duration for credentials in the AuthConfig. If null, the kubelet will use defaultCacheDuration provided in CredentialProviderConfig. If set to 0, the kubelet will not cache the provided AuthConfig. |
auth map[string]k8s.io/kubelet/pkg/apis/credentialprovider/v1alpha1.AuthConfig
|
auth is a map containing authentication information passed into the kubelet.
Each key is a match image string (more on this below). The corresponding authConfig value
should be valid for all images that match against this key. A plugin should set
this field to null if no valid credentials can be returned for the requested image.
Each key in the map is a pattern which can optionally contain a port and a path. Globs can be used in the domain, but not in the port or the path. Globs are supported as subdomains like '∗.k8s.io' or 'k8s.∗.io', and top-level-domains such as 'k8s.∗'. Matching partial subdomains like 'app∗.k8s.io' is also supported. Each glob can only match a single subdomain segment, so ∗.io does not match ∗.k8s.io. The kubelet will match images against the key when all of the below are true:
When multiple keys are returned, the kubelet will traverse all keys in reverse order so that:
For any given match, the kubelet will attempt an image pull with the provided credentials, stopping after the first successfully authenticated pull. Example keys:
|
AuthConfig
Appears in:
AuthConfig contains authentication information for a container registry. Only username/password based authentication is supported today, but more authentication mechanisms may be added in the future.
Field | Description |
---|---|
username [Required]string
|
username is the username used for authenticating to the container registry An empty username is valid. |
password [Required]string
|
password is the password used for authenticating to the container registry An empty password is valid. |
PluginCacheKeyType
(Alias of string
)
Appears in:
6.12.15 - WebhookAdmission Configuration (v1)
Package v1 is the v1 version of the API.
Resource Types
WebhookAdmission
WebhookAdmission provides configuration for the webhook admission controller.
Field | Description |
---|---|
apiVersion string | apiserver.config.k8s.io/v1 |
kind string | WebhookAdmission |
kubeConfigFile [Required]string
|
KubeConfigFile is the path to the kubeconfig file. |
6.13 - Scheduling
6.13.1 - Scheduler Configuration
Kubernetes v1.19 [beta]
You can customize the behavior of the kube-scheduler
by writing a configuration
file and passing its path as a command line argument.
A scheduling Profile allows you to configure the different stages of scheduling in the kube-scheduler. Each stage is exposed in an extension point. Plugins provide scheduling behaviors by implementing one or more of these extension points.
You can specify scheduling profiles by running kube-scheduler --config <filename>
,
using the
KubeSchedulerConfiguration (v1beta2
or v1beta3)
struct.
A minimal configuration looks as follows:
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: /etc/srv/kubernetes/kube-scheduler/kubeconfig
Profiles
A scheduling Profile allows you to configure the different stages of scheduling in the kube-scheduler. Each stage is exposed in an extension point. Plugins provide scheduling behaviors by implementing one or more of these extension points.
You can configure a single instance of kube-scheduler
to run
multiple profiles.
Extension points
Scheduling happens in a series of stages that are exposed through the following extension points:
queueSort
: These plugins provide an ordering function that is used to sort pending Pods in the scheduling queue. Exactly one queue sort plugin may be enabled at a time.preFilter
: These plugins are used to pre-process or check information about a Pod or the cluster before filtering. They can mark a pod as unschedulable.filter
: These plugins are the equivalent of Predicates in a scheduling Policy and are used to filter out nodes that can not run the Pod. Filters are called in the configured order. A pod is marked as unschedulable if no nodes pass all the filters.postFilter
: These plugins are called in their configured order when no feasible nodes were found for the pod. If anypostFilter
plugin marks the Pod schedulable, the remaining plugins are not called.preScore
: This is an informational extension point that can be used for doing pre-scoring work.score
: These plugins provide a score to each node that has passed the filtering phase. The scheduler will then select the node with the highest weighted scores sum.reserve
: This is an informational extension point that notifies plugins when resources have been reserved for a given Pod. Plugins also implement anUnreserve
call that gets called in the case of failure during or afterReserve
.permit
: These plugins can prevent or delay the binding of a Pod.preBind
: These plugins perform any work required before a Pod is bound.bind
: The plugins bind a Pod to a Node.bind
plugins are called in order and once one has done the binding, the remaining plugins are skipped. At least one bind plugin is required.postBind
: This is an informational extension point that is called after a Pod has been bound.multiPoint
: This is a config-only field that allows plugins to be enabled or disabled for all of their applicable extension points simultaneously.
For each extension point, you could disable specific default plugins or enable your own. For example:
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- plugins:
score:
disabled:
- name: PodTopologySpread
enabled:
- name: MyCustomPluginA
weight: 2
- name: MyCustomPluginB
weight: 1
You can use *
as name in the disabled array to disable all default plugins
for that extension point. This can also be used to rearrange plugins order, if
desired.
Scheduling plugins
The following plugins, enabled by default, implement one or more of these extension points:
ImageLocality
: Favors nodes that already have the container images that the Pod runs. Extension points:score
.TaintToleration
: Implements taints and tolerations. Implements extension points:filter
,preScore
,score
.NodeName
: Checks if a Pod spec node name matches the current node. Extension points:filter
.NodePorts
: Checks if a node has free ports for the requested Pod ports. Extension points:preFilter
,filter
.NodeAffinity
: Implements node selectors and node affinity. Extension points:filter
,score
.PodTopologySpread
: Implements Pod topology spread. Extension points:preFilter
,filter
,preScore
,score
.NodeUnschedulable
: Filters out nodes that have.spec.unschedulable
set to true. Extension points:filter
.NodeResourcesFit
: Checks if the node has all the resources that the Pod is requesting. The score can use one of three strategies:LeastAllocated
(default),MostAllocated
andRequestedToCapacityRatio
. Extension points:preFilter
,filter
,score
.NodeResourcesBalancedAllocation
: Favors nodes that would obtain a more balanced resource usage if the Pod is scheduled there. Extension points:score
.VolumeBinding
: Checks if the node has or if it can bind the requested volumes. Extension points:preFilter
,filter
,reserve
,preBind
,score
.Note:score
extension point is enabled whenVolumeCapacityPriority
feature is enabled. It prioritizes the smallest PVs that can fit the requested volume size.VolumeRestrictions
: Checks that volumes mounted in the node satisfy restrictions that are specific to the volume provider. Extension points:filter
.VolumeZone
: Checks that volumes requested satisfy any zone requirements they might have. Extension points:filter
.NodeVolumeLimits
: Checks that CSI volume limits can be satisfied for the node. Extension points:filter
.EBSLimits
: Checks that AWS EBS volume limits can be satisfied for the node. Extension points:filter
.GCEPDLimits
: Checks that GCP-PD volume limits can be satisfied for the node. Extension points:filter
.AzureDiskLimits
: Checks that Azure disk volume limits can be satisfied for the node. Extension points:filter
.InterPodAffinity
: Implements inter-Pod affinity and anti-affinity. Extension points:preFilter
,filter
,preScore
,score
.PrioritySort
: Provides the default priority based sorting. Extension points:queueSort
.DefaultBinder
: Provides the default binding mechanism. Extension points:bind
.DefaultPreemption
: Provides the default preemption mechanism. Extension points:postFilter
.
You can also enable the following plugins, through the component config APIs, that are not enabled by default:
SelectorSpread
: Favors spreading across nodes for Pods that belong to Services, ReplicaSets and StatefulSets. Extension points:preScore
,score
.CinderLimits
: Checks that OpenStack Cinder volume limits can be satisfied for the node. Extension points:filter
.
Multiple profiles
You can configure kube-scheduler
to run more than one profile.
Each profile has an associated scheduler name and can have a different set of
plugins configured in its extension points.
With the following sample configuration, the scheduler will run with two profiles: one with the default plugins and one with all scoring plugins disabled.
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
- schedulerName: no-scoring-scheduler
plugins:
preScore:
disabled:
- name: '*'
score:
disabled:
- name: '*'
Pods that want to be scheduled according to a specific profile can include
the corresponding scheduler name in its .spec.schedulerName
.
By default, one profile with the scheduler name default-scheduler
is created.
This profile includes the default plugins described above. When declaring more
than one profile, a unique scheduler name for each of them is required.
If a Pod doesn't specify a scheduler name, kube-apiserver will set it to
default-scheduler
. Therefore, a profile with this scheduler name should exist
to get those pods scheduled.
.spec.schedulerName
as the ReportingController.
Events for leader election use the scheduler name of the first profile in the
list.
queueSort
extension point and have
the same configuration parameters (if applicable). This is because the scheduler
only has one pending pods queue.
Plugins that apply to multiple extension points
Starting from kubescheduler.config.k8s.io/v1beta3
, there is an additional field in the
profile config, multiPoint
, which allows for easily enabling or disabling a plugin
across several extension points. The intent of multiPoint
config is to simplify the
configuration needed for users and administrators when using custom profiles.
Consider a plugin, MyPlugin
, which implements the preScore
, score
, preFilter
,
and filter
extension points. To enable MyPlugin
for all its available extension
points, the profile config looks like:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: multipoint-scheduler
plugins:
multiPoint:
enabled:
- name: MyPlugin
This would equate to manually enabling MyPlugin
for all of its extension
points, like so:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: non-multipoint-scheduler
plugins:
preScore:
enabled:
- name: MyPlugin
score:
enabled:
- name: MyPlugin
preFilter:
enabled:
- name: MyPlugin
filter:
enabled:
- name: MyPlugin
One benefit of using multiPoint
here is that if MyPlugin
implements another
extension point in the future, the multiPoint
config will automatically enable it
for the new extension.
Specific extension points can be excluded from MultiPoint
expansion using
the disabled
field for that extension point. This works with disabling default
plugins, non-default plugins, or with the wildcard ('*'
) to disable all plugins.
An example of this, disabling Score
and PreScore
, would be:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: non-multipoint-scheduler
plugins:
multiPoint:
enabled:
- name: 'MyPlugin'
preScore:
disabled:
- name: '*'
score:
disabled:
- name: '*'
In v1beta3
, all default plugins are enabled internally through MultiPoint
.
However, individual extension points are still available to allow flexible
reconfiguration of the default values (such as ordering and Score weights). For
example, consider two Score plugins DefaultScore1
and DefaultScore2
, each with
a weight of 1
. They can be reordered with different weights like so:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: multipoint-scheduler
plugins:
score:
enabled:
- name: 'DefaultScore2'
weight: 5
In this example, it's unnecessary to specify the plugins in MultiPoint
explicitly
because they are default plugins. And the only plugin specified in Score
is DefaultScore2
.
This is because plugins set through specific extension points will always take precedence
over MultiPoint
plugins. So, this snippet essentially re-orders the two plugins
without needing to specify both of them.
The general hierarchy for precedence when configuring MultiPoint
plugins is as follows:
- Specific extension points run first, and their settings override whatever is set elsewhere
- Plugins manually configured through
MultiPoint
and their settings - Default plugins and their default settings
To demonstrate the above hierarchy, the following example is based on these plugins:
Plugin | Extension Points |
---|---|
DefaultQueueSort |
QueueSort |
CustomQueueSort |
QueueSort |
DefaultPlugin1 |
Score , Filter |
DefaultPlugin2 |
Score |
CustomPlugin1 |
Score , Filter |
CustomPlugin2 |
Score , Filter |
A valid sample configuration for these plugins would be:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: multipoint-scheduler
plugins:
multiPoint:
enabled:
- name: 'CustomQueueSort'
- name: 'CustomPlugin1'
weight: 3
- name: 'CustomPlugin2'
disabled:
- name: 'DefaultQueueSort'
filter:
disabled:
- name: 'DefaultPlugin1'
score:
enabled:
- name: 'DefaultPlugin2'
Note that there is no error for re-declaring a MultiPoint
plugin in a specific
extension point. The re-declaration is ignored (and logged), as specific extension points
take precedence.
Besides keeping most of the config in one spot, this sample does a few things:
- Enables the custom
queueSort
plugin and disables the default one - Enables
CustomPlugin1
andCustomPlugin2
, which will run first for all of their extension points - Disables
DefaultPlugin1
, but only forfilter
- Reorders
DefaultPlugin2
to run first inscore
(even before the custom plugins)
In versions of the config before v1beta3
, without multiPoint
, the above snippet would equate to this:
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: multipoint-scheduler
plugins:
# Disable the default QueueSort plugin
queueSort:
enabled:
- name: 'CustomQueueSort'
disabled:
- name: 'DefaultQueueSort'
# Enable custom Filter plugins
filter:
enabled:
- name: 'CustomPlugin1'
- name: 'CustomPlugin2'
- name: 'DefaultPlugin2'
disabled:
- name: 'DefaultPlugin1'
# Enable and reorder custom score plugins
score:
enabled:
- name: 'DefaultPlugin2'
weight: 1
- name: 'DefaultPlugin1'
weight: 3
While this is a complicated example, it demonstrates the flexibility of MultiPoint
config
as well as its seamless integration with the existing methods for configuring extension points.
Scheduler configuration migrations
-
With the v1beta2 configuration version, you can use a new score extension for the
NodeResourcesFit
plugin. The new extension combines the functionalities of theNodeResourcesLeastAllocated
,NodeResourcesMostAllocated
andRequestedToCapacityRatio
plugins. For example, if you previously used theNodeResourcesMostAllocated
plugin, you would instead useNodeResourcesFit
(enabled by default) and add apluginConfig
with ascoreStrategy
that is similar to:apiVersion: kubescheduler.config.k8s.io/v1beta2 kind: KubeSchedulerConfiguration profiles: - pluginConfig: - args: scoringStrategy: resources: - name: cpu weight: 1 type: MostAllocated name: NodeResourcesFit
-
The scheduler plugin
NodeLabel
is deprecated; instead, use theNodeAffinity
plugin (enabled by default) to achieve similar behavior. -
The scheduler plugin
ServiceAffinity
is deprecated; instead, use theInterPodAffinity
plugin (enabled by default) to achieve similar behavior. -
The scheduler plugin
NodePreferAvoidPods
is deprecated; instead, use node taints to achieve similar behavior. -
A plugin enabled in a v1beta2 configuration file takes precedence over the default configuration for that plugin.
-
Invalid
host
orport
configured for scheduler healthz and metrics bind address will cause validation failure.
- Three plugins' weight are increased by default:
InterPodAffinity
from 1 to 2NodeAffinity
from 1 to 2TaintToleration
from 1 to 3
What's next
- Read the kube-scheduler reference
- Learn about scheduling
- Read the kube-scheduler configuration (v1beta2) reference
- Read the kube-scheduler configuration (v1beta3) reference
6.13.2 - Scheduling Policies
In Kubernetes versions before v1.23, a scheduling policy can be used to specify the predicates and priorities process. For example, you can set a scheduling policy by
running kube-scheduler --policy-config-file <filename>
or kube-scheduler --policy-configmap <ConfigMap>
.
This scheduling policy is not supported since Kubernetes v1.23. Associated flags policy-config-file
, policy-configmap
, policy-configmap-namespace
and use-legacy-policy-config
are also not supported. Instead, use the Scheduler Configuration to achieve similar behavior.
What's next
- Learn about scheduling
- Learn about kube-scheduler Configuration
- Read the kube-scheduler configuration reference (v1beta3)
6.14 - Other Tools
Kubernetes contains several tools to help you work with the Kubernetes system.
crictl
crictl
is a command-line
interface for inspecting and debugging CRI-compatible
container runtimes.
Dashboard
Dashboard
, the web-based user interface of Kubernetes, allows you to deploy containerized applications
to a Kubernetes cluster, troubleshoot them, and manage the cluster and its
resources itself.
Helm
Helm is a tool for managing packages of pre-configured Kubernetes resources. These packages are known as Helm charts.
Helm is a third party managed tool for managing packages of pre-configured Kubernetes resources, aka Kubernetes charts.
Use Helm to:
- Find and use popular software packaged as Kubernetes charts
- Share your own applications as Kubernetes charts
- Create reproducible builds of your Kubernetes applications
- Intelligently manage your Kubernetes manifest files
- Manage releases of Helm packages
Kompose
Kompose
is a tool to help Docker Compose users move to Kubernetes.
Use Kompose to:
- Translate a Docker Compose file into Kubernetes objects
- Go from local Docker development to managing your application via Kubernetes
- Convert v1 or v2 Docker Compose
yaml
files or Distributed Application Bundles
Kui
Kui
is a GUI tool that takes your normal
kubectl
command line requests and responds with graphics.
Kui takes the normal kubectl
command line requests and responds with graphics. Instead
of ASCII tables, Kui provides a GUI rendering with tables that you can sort.
Kui lets you:
- Directly click on long, auto-generated resource names instead of copying and pasting
- Type in
kubectl
commands and see them execute, even sometimes faster thankubectl
itself - Query a Job and see its execution rendered as a waterfall diagram
- Click through resources in your cluster using a tabbed UI
Minikube
minikube
is a tool that
runs a single-node Kubernetes cluster locally on your workstation for
development and testing purposes.
6.14.1 - Mapping from dockercli to crictl
crictl
is a command-line interface for CRI-compatible container runtimes.
You can use it to inspect and debug container runtimes and applications on a
Kubernetes node. crictl
and its source are hosted in the
cri-tools repository.
This page provides a reference for mapping common commands for the docker
command-line tool into the equivalent commands for crictl
.
Mapping from docker CLI to crictl
The exact versions for the mapping table are for docker
CLI v1.40 and crictl
v1.19.0. This list is not exhaustive. For example, it doesn't include
experimental docker
CLI commands.
crictl
is similar to docker
CLI, despite some missing
columns for some CLI. Make sure to check output for the specific command if your
command output is being parsed programmatically.
Retrieve debugging information
docker cli | crictl | Description | Unsupported Features |
---|---|---|---|
attach |
attach |
Attach to a running container | --detach-keys , --sig-proxy |
exec |
exec |
Run a command in a running container | --privileged , --user , --detach-keys |
images |
images |
List images | |
info |
info |
Display system-wide information | |
inspect |
inspect , inspecti |
Return low-level information on a container, image or task | |
logs |
logs |
Fetch the logs of a container | --details |
ps |
ps |
List containers | |
stats |
stats |
Display a live stream of container(s) resource usage statistics | Column: NET/BLOCK I/O, PIDs |
version |
version |
Show the runtime (Docker, ContainerD, or others) version information |
Perform Changes
docker cli | crictl | Description | Unsupported Features |
---|---|---|---|
create |
create |
Create a new container | |
kill |
stop (timeout = 0) |
Kill one or more running container | --signal |
pull |
pull |
Pull an image or a repository from a registry | --all-tags , --disable-content-trust |
rm |
rm |
Remove one or more containers | |
rmi |
rmi |
Remove one or more images | |
run |
run |
Run a command in a new container | |
start |
start |
Start one or more stopped containers | --detach-keys |
stop |
stop |
Stop one or more running containers | |
update |
update |
Update configuration of one or more containers | --restart , --blkio-weight and some other resource limit not supported by CRI. |
Supported only in crictl
crictl | Description |
---|---|
imagefsinfo |
Return image filesystem info |
inspectp |
Display the status of one or more pods |
port-forward |
Forward local port to a pod |
pods |
List pods |
runp |
Run a new pod |
rmp |
Remove one or more pods |
stopp |
Stop one or more running pods |
7 - Contribute to K8s docs
Kubernetes welcomes improvements from all contributors, new and experienced!
This website is maintained by Kubernetes SIG Docs.
Kubernetes documentation contributors:
- Improve existing content
- Create new content
- Translate the documentation
- Manage and publish the documentation parts of the Kubernetes release cycle
Getting started
Anyone can open an issue about documentation, or contribute a change with a
pull request (PR) to the
kubernetes/website
GitHub repository.
You need to be comfortable with
git and
GitHub
to work effectively in the Kubernetes community.
To get involved with documentation:
- Sign the CNCF Contributor License Agreement.
- Familiarize yourself with the documentation repository and the website's static site generator.
- Make sure you understand the basic processes for opening a pull request and reviewing changes.
Figure 1. Getting started for a new contributor.
Figure 1 outlines a roadmap for new contributors. You can follow some or all of the steps for Sign up
and Review
. Now you are ready to open PRs that achieve your contribution objectives with some listed under Open PR
. Again, questions are always welcome!
Some tasks require more trust and more access in the Kubernetes organization. See Participating in SIG Docs for more details about roles and permissions.
Your first contribution
You can prepare for your first contribution by reviewing several steps beforehand. Figure 2 outlines the steps and the details follow.
Figure 2. Preparation for your first contribution.
- Read the Contribution overview to learn about the different ways you can contribute.
- Check
kubernetes/website
issues list for issues that make good entry points. - Open a pull request using GitHub to existing documentation and learn more about filing issues in GitHub.
- Review pull requests from other Kubernetes community members for accuracy and language.
- Read the Kubernetes content and style guides so you can leave informed comments.
- Learn about page content types and Hugo shortcodes.
Next steps
-
Learn to work from a local clone of the repository.
-
Document features in a release.
-
Participate in SIG Docs, and become a member or reviewer.
-
Start or help with a localization.
Get involved with SIG Docs
SIG Docs is the group of contributors who publish and maintain Kubernetes documentation and the website. Getting involved with SIG Docs is a great way for Kubernetes contributors (feature development or otherwise) to have a large impact on the Kubernetes project.
SIG Docs communicates with different methods:
- Join
#sig-docs
on the Kubernetes Slack instance. Make sure to introduce yourself! - Join the
kubernetes-sig-docs
mailing list, where broader discussions take place and official decisions are recorded. - Join the SIG Docs video meeting held every two weeks. Meetings are always announced on
#sig-docs
and added to the Kubernetes community meetings calendar. You'll need to download the Zoom client or dial in using a phone. - Join the SIG Docs async Slack standup meeting on those weeks when the in-person Zoom video meeting does not take place. Meetings are always announced on
#sig-docs
. You can contribute to any one of the threads up to 24 hours after meeting announcement.
Other ways to contribute
- Visit the Kubernetes community site. Participate on Twitter or Stack Overflow, learn about local Kubernetes meetups and events, and more.
- Read the contributor cheatsheet to get involved with Kubernetes feature development.
- Visit the contributor site to learn more about Kubernetes Contributors and additional contributor resources.
- Submit a blog post or case study.
7.1 - Suggesting content improvements
If you notice an issue with Kubernetes documentation or have an idea for new content, then open an issue. All you need is a GitHub account and a web browser.
In most cases, new work on Kubernetes documentation begins with an issue in GitHub. Kubernetes contributors then review, categorize and tag issues as needed. Next, you or another member of the Kubernetes community open a pull request with changes to resolve the issue.
Opening an issue
If you want to suggest improvements to existing content or notice an error, then open an issue.
- Click the Create an issue link on the right sidebar. This redirects you to a GitHub issue page pre-populated with some headers.
- Describe the issue or suggestion for improvement. Provide as many details as you can.
- Click Submit new issue.
After submitting, check in on your issue occasionally or turn on GitHub notifications. Reviewers and other community members might ask questions before they can take action on your issue.
Suggesting new content
If you have an idea for new content, but you aren't sure where it should go, you can still file an issue. Either:
- Choose an existing page in the section you think the content belongs in and click Create an issue.
- Go to GitHub and file the issue directly.
How to file great issues
Keep the following in mind when filing an issue:
- Provide a clear issue description. Describe what specifically is missing, out of date, wrong, or needs improvement.
- Explain the specific impact the issue has on users.
- Limit the scope of a given issue to a reasonable unit of work. For problems with a large scope, break them down into smaller issues. For example, "Fix the security docs" is too broad, but "Add details to the 'Restricting network access' topic" is specific enough to be actionable.
- Search the existing issues to see if there's anything related or similar to the new issue.
- If the new issue relates to another issue or pull request, refer to it
either by its full URL or by the issue or pull request number prefixed
with a
#
character. For example,Introduced by #987654
. - Follow the Code of Conduct. Respect your fellow contributors. For example, "The docs are terrible" is not helpful or polite feedback.
7.2 - Contributing new content
This section contains information you should know before contributing new content.
Figure - Contributing new content preparation
The figure above depicts the information you should know prior to submitting new content. The information details follow.
Contributing basics
- Write Kubernetes documentation in Markdown and build the Kubernetes site using Hugo.
- Kubernetes documentation uses CommonMark as its flavor of Markdown.
- The source is in GitHub. You can find
Kubernetes documentation at
/content/en/docs/
. Some of the reference documentation is automatically generated from scripts in theupdate-imported-docs/
directory. - Page content types describe the presentation of documentation content in Hugo.
- You can use Docsy shortcodes or custom Hugo shortcodes to contribute to Kubernetes documentation.
- In addition to the standard Hugo shortcodes, we use a number of custom Hugo shortcodes in our documentation to control the presentation of content.
- Documentation source is available in multiple languages in
/content/
. Each language has its own folder with a two-letter code determined by the ISO 639-1 standard . For example, English documentation source is stored in/content/en/docs/
. - For more information about contributing to documentation in multiple languages or starting a new translation, see localization.
Before you begin
Sign the CNCF CLA
All Kubernetes contributors must read the Contributor guide and sign the Contributor License Agreement (CLA) .
Pull requests from contributors who haven't signed the CLA fail the automated
tests. The name and email you provide must match those found in
your git config
, and your git name and email must match those used for the
CNCF CLA.
Choose which Git branch to use
When opening a pull request, you need to know in advance which branch to base your work on.
Scenario | Branch |
---|---|
Existing or new English language content for the current release | main |
Content for a feature change release | The branch which corresponds to the major and minor version the feature change is in, using the pattern dev-<version> . For example, if a feature changes in the v1.24 release, then add documentation changes to the dev-1.24 branch. |
Content in other languages (localizations) | Use the localization's convention. See the Localization branching strategy for more information. |
If you're still not sure which branch to choose, ask in #sig-docs
on Slack.
Languages per PR
Limit pull requests to one language per PR. If you need to make an identical change to the same code sample in multiple languages, open a separate PR for each language.
Tools for contributors
The doc contributors tools
directory in the kubernetes/website
repository contains tools to help your
contribution journey go more smoothly.
7.2.1 - Opening a pull request
To contribute new content pages or improve existing content pages, open a pull request (PR). Make sure you follow all the requirements in the Before you begin section.
If your change is small, or you're unfamiliar with git, read Changes using GitHub to learn how to edit a page.
If your changes are large, read Work from a local fork to learn how to make changes locally on your computer.
Changes using GitHub
If you're less experienced with git workflows, here's an easier method of opening a pull request. The figure below outlines the steps and the details follow.
Figure - Steps for opening a PR using GitHub
-
On the page where you see the issue, select the pencil icon at the top right. You can also scroll to the bottom of the page and select Edit this page.
-
Make your changes in the GitHub markdown editor.
-
Below the editor, fill in the Propose file change form. In the first field, give your commit message a title. In the second field, provide a description.
Note: Do not use any GitHub Keywords in your commit message. You can add those to the pull request description later. -
Select Propose file change.
-
Select Create pull request.
-
The Open a pull request screen appears. Fill in the form:
- The Subject field of the pull request defaults to the commit summary. You can change it if needed.
- The Body contains your extended commit message, if you have one, and some template text. Add the details the template text asks for, then delete the extra template text.
- Leave the Allow edits from maintainers checkbox selected.
Note: PR descriptions are a great way to help reviewers understand your change. For more information, see Opening a PR. -
Select Create pull request.
Addressing feedback in GitHub
Before merging a pull request, Kubernetes community members review and
approve it. The k8s-ci-robot
suggests reviewers based on the nearest
owner mentioned in the pages. If you have someone specific in mind,
leave a comment with their GitHub username in it.
If a reviewer asks you to make changes:
- Go to the Files changed tab.
- Select the pencil (edit) icon on any files changed by the pull request.
- Make the changes requested.
- Commit the changes.
If you are waiting on a reviewer, reach out once every 7 days. You can also post a message in the #sig-docs
Slack channel.
When your review is complete, a reviewer merges your PR and your changes go live a few minutes later.
Work from a local fork
If you're more experienced with git, or if your changes are larger than a few lines, work from a local fork.
Make sure you have git installed on your computer. You can also use a git UI application.
The figure below shows the steps to follow when you work from a local fork. The details for each step follow.
Figure - Working from a local fork to make your changes
Fork the kubernetes/website repository
- Navigate to the
kubernetes/website
repository. - Select Fork.
Create a local clone and set the upstream
-
In a terminal window, clone your fork and update the Docsy Hugo theme:
git clone git@github.com/<github_username>/website cd website git submodule update --init --recursive --depth 1
-
Navigate to the new
website
directory. Set thekubernetes/website
repository as theupstream
remote:cd website git remote add upstream https://github.com/kubernetes/website.git
-
Confirm your
origin
andupstream
repositories:git remote -v
Output is similar to:
origin git@github.com:<github_username>/website.git (fetch) origin git@github.com:<github_username>/website.git (push) upstream https://github.com/kubernetes/website.git (fetch) upstream https://github.com/kubernetes/website.git (push)
-
Fetch commits from your fork's
origin/main
andkubernetes/website
'supstream/main
:git fetch origin git fetch upstream
This makes sure your local repository is up to date before you start making changes.
Note: This workflow is different than the Kubernetes Community GitHub Workflow. You do not need to merge your local copy ofmain
withupstream/main
before pushing updates to your fork.
Create a branch
- Decide which branch base to your work on:
-
For improvements to existing content, use
upstream/main
. -
For new content about existing features, use
upstream/main
. -
For localized content, use the localization's conventions. For more information, see localizing Kubernetes documentation.
-
For new features in an upcoming Kubernetes release, use the feature branch. For more information, see documenting for a release.
-
For long-running efforts that multiple SIG Docs contributors collaborate on, like content reorganization, use a specific feature branch created for that effort.
If you need help choosing a branch, ask in the
#sig-docs
Slack channel.
-
Create a new branch based on the branch identified in step 1. This example assumes the base branch is
upstream/main
:git checkout -b <my_new_branch> upstream/main
-
Make your changes using a text editor.
At any time, use the git status
command to see what files you've changed.
Commit your changes
When you are ready to submit a pull request, commit your changes.
-
In your local repository, check which files you need to commit:
git status
Output is similar to:
On branch <my_new_branch> Your branch is up to date with 'origin/<my_new_branch>'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: content/en/docs/contribute/new-content/contributing-content.md no changes added to commit (use "git add" and/or "git commit -a")
-
Add the files listed under Changes not staged for commit to the commit:
git add <your_file_name>
Repeat this for each file.
-
After adding all the files, create a commit:
git commit -m "Your commit message"
Note: Do not use any GitHub Keywords in your commit message. You can add those to the pull request description later. -
Push your local branch and its new commit to your remote fork:
git push origin <my_new_branch>
Preview your changes locally
It's a good idea to preview your changes locally before pushing them or opening a pull request. A preview lets you catch build errors or markdown formatting problems.
You can either build the website's container image or run Hugo locally. Building the container image is slower but displays Hugo shortcodes, which can be useful for debugging.
CONTAINER_ENGINE
environment variable to override this behaviour.
-
Build the image locally:
# Use docker (default) make container-image ### OR ### # Use podman CONTAINER_ENGINE=podman make container-image
-
After building the
kubernetes-hugo
image locally, build and serve the site:# Use docker (default) make container-serve ### OR ### # Use podman CONTAINER_ENGINE=podman make container-serve
-
In a web browser, navigate to
https://localhost:1313
. Hugo watches the changes and rebuilds the site as needed. -
To stop the local Hugo instance, go back to the terminal and type
Ctrl+C
, or close the terminal window.
Alternately, install and use the hugo
command on your computer:
-
Install the Hugo version specified in
website/netlify.toml
. -
If you have not updated your website repository, the
website/themes/docsy
directory is empty. The site cannot build without a local copy of the theme. To update the website theme, run:git submodule update --init --recursive --depth 1
-
In a terminal, go to your Kubernetes website repository and start the Hugo server:
cd <path_to_your_repo>/website hugo server --buildFuture
-
In a web browser, navigate to
https://localhost:1313
. Hugo watches the changes and rebuilds the site as needed. -
To stop the local Hugo instance, go back to the terminal and type
Ctrl+C
, or close the terminal window.
Open a pull request from your fork to kubernetes/website
The figure below shows the steps to open a PR from your fork to the K8s/website. The details follow.
Figure - Steps to open a PR from your fork to the K8s/website
-
In a web browser, go to the
kubernetes/website
repository. -
Select New Pull Request.
-
Select compare across forks.
-
From the head repository drop-down menu, select your fork.
-
From the compare drop-down menu, select your branch.
-
Select Create Pull Request.
-
Add a description for your pull request:
- Title (50 characters or less): Summarize the intent of the change.
- Description: Describe the change in more detail.
- If there is a related GitHub issue, include
Fixes #12345
orCloses #12345
in the description. GitHub's automation closes the mentioned issue after merging the PR if used. If there are other related PRs, link those as well. - If you want advice on something specific, include any questions you'd like reviewers to think about in your description.
- If there is a related GitHub issue, include
-
Select the Create pull request button.
Congratulations! Your pull request is available in Pull requests.
After opening a PR, GitHub runs automated tests and tries to deploy a preview using Netlify.
- If the Netlify build fails, select Details for more information.
- If the Netlify build succeeds, select Details opens a staged version of the Kubernetes website with your changes applied. This is how reviewers check your changes.
GitHub also automatically assigns labels to a PR, to help reviewers. You can add them too, if needed. For more information, see Adding and removing issue labels.
Addressing feedback locally
-
After making your changes, amend your previous commit:
git commit -a --amend
-a
: commits all changes--amend
: amends the previous commit, rather than creating a new one
-
Update your commit message if needed.
-
Use
git push origin <my_new_branch>
to push your changes and re-run the Netlify tests.Note: If you usegit commit -m
instead of amending, you must squash your commits before merging.
Changes from reviewers
Sometimes reviewers commit to your pull request. Before making any other changes, fetch those commits.
-
Fetch commits from your remote fork and rebase your working branch:
git fetch origin git rebase origin/<your-branch-name>
-
After rebasing, force-push new changes to your fork:
git push --force-with-lease origin <your-branch-name>
Merge conflicts and rebasing
#sig-docs
Slack channel for help.
If another contributor commits changes to the same file in another PR, it can create a merge conflict. You must resolve all merge conflicts in your PR.
-
Update your fork and rebase your local branch:
git fetch origin git rebase origin/<your-branch-name>
Then force-push the changes to your fork:
git push --force-with-lease origin <your-branch-name>
-
Fetch changes from
kubernetes/website
'supstream/main
and rebase your branch:git fetch upstream git rebase upstream/main
-
Inspect the results of the rebase:
git status
This results in a number of files marked as conflicted.
-
Open each conflicted file and look for the conflict markers:
>>>
,<<<
, and===
. Resolve the conflict and delete the conflict marker.Note: For more information, see How conflicts are presented. -
Add the files to the changeset:
git add <filename>
-
Continue the rebase:
git rebase --continue
-
Repeat steps 2 to 5 as needed.
After applying all commits, the
git status
command shows that the rebase is complete. -
Force-push the branch to your fork:
git push --force-with-lease origin <your-branch-name>
The pull request no longer shows any conflicts.
Squashing commits
#sig-docs
Slack channel for help.
If your PR has multiple commits, you must squash them into a single commit before merging your PR. You can check the number of commits on your PR's Commits tab or by running the git log
command locally.
vim
as the command line text editor.
-
Start an interactive rebase:
git rebase -i HEAD~<number_of_commits_in_branch>
Squashing commits is a form of rebasing. The
-i
switch tells git you want to rebase interactively.HEAD~<number_of_commits_in_branch
indicates how many commits to look at for the rebase.Output is similar to:
pick d875112ca Original commit pick 4fa167b80 Address feedback 1 pick 7d54e15ee Address feedback 2 # Rebase 3d18sf680..7d54e15ee onto 3d183f680 (3 commands) ... # These lines can be re-ordered; they are executed from top to bottom.
The first section of the output lists the commits in the rebase. The second section lists the options for each commit. Changing the word
pick
changes the status of the commit once the rebase is complete.For the purposes of rebasing, focus on
squash
andpick
.Note: For more information, see Interactive Mode. -
Start editing the file.
Change the original text:
pick d875112ca Original commit pick 4fa167b80 Address feedback 1 pick 7d54e15ee Address feedback 2
To:
pick d875112ca Original commit squash 4fa167b80 Address feedback 1 squash 7d54e15ee Address feedback 2
This squashes commits
4fa167b80 Address feedback 1
and7d54e15ee Address feedback 2
intod875112ca Original commit
, leaving onlyd875112ca Original commit
as a part of the timeline. -
Save and exit your file.
-
Push your squashed commit:
git push --force-with-lease origin <branch_name>
Contribute to other repos
The Kubernetes project contains 50+ repositories. Many of these repositories contain documentation: user-facing help text, error messages, API references or code comments.
If you see text you'd like to improve, use GitHub to search all repositories in the Kubernetes organization. This can help you figure out where to submit your issue or PR.
Each repository has its own processes and procedures. Before you file an
issue or submit a PR, read that repository's README.md
, CONTRIBUTING.md
, and
code-of-conduct.md
, if they exist.
Most repositories use issue and PR templates. Have a look through some open issues and PRs to get a feel for that team's processes. Make sure to fill out the templates with as much detail as possible when you file issues or PRs.
What's next
- Read Reviewing to learn more about the review process.
7.2.2 - Documenting a feature for a release
Each major Kubernetes release introduces new features that require documentation. New releases also bring updates to existing features and documentation (such as upgrading a feature from alpha to beta).
Generally, the SIG responsible for a feature submits draft documentation of the
feature as a pull request to the appropriate development branch of the
kubernetes/website
repository, and someone on the SIG Docs team provides
editorial feedback or edits the draft directly. This section covers the branching
conventions and process used during a release by both groups.
For documentation contributors
In general, documentation contributors don't write content from scratch for a release. Instead, they work with the SIG creating a new feature to refine the draft documentation and make it release ready.
After you've chosen a feature to document or assist, ask about it in the #sig-docs
Slack channel, in a weekly SIG Docs meeting, or directly on the PR filed by the
feature SIG. If you're given the go-ahead, you can edit into the PR using one of
the techniques described in
Commit into another person's PR.
Find out about upcoming features
To find out about upcoming features, attend the weekly SIG Release meeting (see the community page for upcoming meetings) and monitor the release-specific documentation in the kubernetes/sig-release repository. Each release has a sub-directory in the /sig-release/tree/master/releases/ directory. The sub-directory contains a release schedule, a draft of the release notes, and a document listing each person on the release team.
The release schedule contains links to all other documents, meetings, meeting minutes, and milestones relating to the release. It also contains information about the goals and timeline of the release, and any special processes in place for this release. Near the bottom of the document, several release-related terms are defined.
This document also contains a link to the Feature tracking sheet, which is the official way to find out about all new features scheduled to go into the release.
The release team document lists who is responsible for each release role. If it's not clear who to talk to about a specific feature or question you have, either attend the release meeting to ask your question, or contact the release lead so that they can redirect you.
The release notes draft is a good place to find out about specific features, changes, deprecations, and more about the release. The content is not finalized until late in the release cycle, so use caution.
Feature tracking sheet
The feature tracking sheet for a given Kubernetes release lists each feature that is planned for a release. Each line item includes the name of the feature, a link to the feature's main GitHub issue, its stability level (Alpha, Beta, or Stable), the SIG and individual responsible for implementing it, whether it needs docs, a draft release note for the feature, and whether it has been merged. Keep the following in mind:
- Beta and Stable features are generally a higher documentation priority than Alpha features.
- It's hard to test (and therefore to document) a feature that hasn't been merged, or is at least considered feature-complete in its PR.
- Determining whether a feature needs documentation is a manual process. Even if a feature is not marked as needing docs, you may need to document the feature.
For developers or other SIG members
This section is information for members of other Kubernetes SIGs documenting new features for a release.
If you are a member of a SIG developing a new feature for Kubernetes, you need
to work with SIG Docs to be sure your feature is documented in time for the
release. Check the
feature tracking spreadsheet
or check in the #sig-release
Kubernetes Slack channel to verify scheduling details and
deadlines.
Open a placeholder PR
- Open a draft pull request against the
dev-1.24
branch in thekubernetes/website
repository, with a small commit that you will amend later. To create a draft pull request, use the Create Pull Request drop-down and select Create Draft Pull Request, then click Draft Pull Request. - Edit the pull request description to include links to kubernetes/kubernetes PR(s) and kubernetes/enhancements issue(s).
- Leave a comment on the related kubernetes/enhancements issue with a link to the PR to notify the docs person managing this release that the feature docs are coming and should be tracked for the release.
If your feature does not need
any documentation changes, make sure the sig-release team knows this, by
mentioning it in the #sig-release
Slack channel. If the feature does need
documentation but the PR is not created, the feature may be removed from the
milestone.
PR ready for review
When ready, populate your placeholder PR with feature documentation and change the state of the PR from draft to ready for review. To mark a pull request as ready for review, navigate to the merge box and click Ready for review.
Do your best to describe your feature and how to use it. If you need help structuring your documentation, ask in the #sig-docs
slack channel.
When you complete your content, the documentation person assigned to your feature reviews it. To ensure technical accuracy, the content may also require a technical review from corresponding SIG(s). Use their suggestions to get the content to a release ready state.
If your feature is an Alpha or Beta feature and is behind a feature gate, make sure you add it to Alpha/Beta Feature gates table as part of your pull request. With new feature gates, a description of the feature gate is also required. If your feature is GA'ed or deprecated, make sure to move it from that table to Feature gates for graduated or deprecated features table with Alpha and Beta history intact.
If your feature needs documentation and the first draft content is not received, the feature may be removed from the milestone.
All PRs reviewed and ready to merge
If your PR has not yet been merged into the dev-1.24
branch by the release deadline, work with the
docs person managing the release to get it in by the deadline. If your feature needs
documentation and the docs are not ready, the feature may be removed from the
milestone.
7.2.3 - Submitting blog posts and case studies
Anyone can write a blog post and submit it for review. Case studies require extensive review before they're approved.
The Kubernetes Blog
The Kubernetes blog is used by the project to communicate new features, community reports, and any news that might be relevant to the Kubernetes community. This includes end users and developers. Most of the blog's content is about things happening in the core project, but we encourage you to submit about things happening elsewhere in the ecosystem too!
Anyone can write a blog post and submit it for review.
Submit a Post
Blog posts should not be commercial in nature and should consist of original content that applies broadly to the Kubernetes community. Appropriate blog content includes:
- New Kubernetes capabilities
- Kubernetes projects updates
- Updates from Special Interest Groups
- Tutorials and walkthroughs
- Thought leadership around Kubernetes
- Kubernetes Partner OSS integration
- Original content only
Unsuitable content includes:
- Vendor product pitches
- Partner updates without an integration and customer story
- Syndicated posts (language translations ok)
To submit a blog post, follow these steps:
- Sign the CLA if you have not yet done so.
- Have a look at the Markdown format for existing blog posts in the website repository.
- Write out your blog post in a text editor of your choice.
- On the same link from step 2, click the Create new file button. Paste your content into the editor. Name the file to match the proposed title of the blog post, but don’t put the date in the file name. The blog reviewers will work with you on the final file name and the date the blog will be published.
- When you save the file, GitHub will walk you through the pull request process.
- A blog post reviewer will review your submission and work with you on feedback and final details. When the blog post is approved, the blog will be scheduled for publication.
Guidelines and expectations
- Blog posts should not be vendor pitches.
- Articles must contain content that applies broadly to the Kubernetes community. For example, a submission should focus on upstream Kubernetes as opposed to vendor-specific configurations. Check the Documentation style guide for what is typically allowed on Kubernetes properties.
- Links should primarily be to the official Kubernetes documentation. When using external references, links should be diverse - For example a submission shouldn't contain only links back to a single company's blog.
- Sometimes this is a delicate balance. The blog team is there to give guidance on whether a post is appropriate for the Kubernetes blog, so don't hesitate to reach out.
- Blog posts are not published on specific dates.
- Articles are reviewed by community volunteers. We'll try our best to accommodate specific timing, but we make no guarantees.
- Many core parts of the Kubernetes projects submit blog posts during release windows, delaying publication times. Consider submitting during a quieter period of the release cycle.
- If you are looking for greater coordination on post release dates, coordinating with CNCF marketing is a more appropriate choice than submitting a blog post.
- Sometimes reviews can get backed up. If you feel your review isn't getting the attention it needs, you can reach out to the blog team via this slack channel to ask in real time.
- Blog posts should be relevant to Kubernetes users.
- Topics related to participation in or results of Kubernetes SIGs activities are always on topic (see the work in the Upstream Marketing Team for support on these posts).
- The components of Kubernetes are purposely modular, so tools that use existing integration points like CNI and CSI are on topic.
- Posts about other CNCF projects may or may not be on topic. We recommend asking the blog team before submitting a draft.
- Many CNCF projects have their own blog. These are often a better choice for posts. There are times of major feature or milestone for a CNCF project that users would be interested in reading on the Kubernetes blog.
- Blog posts about contributing to the Kubernetes project should be in the Kubernetes Contributors site
- Blog posts should be original content
- The official blog is not for repurposing existing content from a third party as new content.
- The license for the blog allows commercial use of the content for commercial purposes, but not the other way around.
- Blog posts should aim to be future proof
- Given the development velocity of the project, we want evergreen content that won't require updates to stay accurate for the reader.
- It can be a better choice to add a tutorial or update official documentation than to write a high level overview as a blog post.
- Consider concentrating the long technical content as a call to action of the blog post, and focus on the problem space or why readers should care.
Technical Considerations for submitting a blog post
Submissions need to be in Markdown format to be used by the Hugo generator for the blog. There are many resources available on how to use this technology stack.
We recognize that this requirement makes the process more difficult for less-familiar folks to submit, and we're constantly looking at solutions to lower this bar. If you have ideas on how to lower the barrier, please volunteer to help out.
The SIG Docs blog subproject manages the review process for blog posts. For more information, see Submit a post.
To submit a blog post follow these directions:
-
Open a pull request with a new blog post. New blog posts go under the
content/en/blog/_posts
directory. -
Ensure that your blog post follows the correct naming conventions and the following frontmatter (metadata) information:
- The Markdown file name must follow the format
YYYY-MM-DD-Your-Title-Here.md
. For example,2020-02-07-Deploying-External-OpenStack-Cloud-Provider-With-Kubeadm.md
. - Do not include dots in the filename. A name like
2020-01-01-whats-new-in-1.19.md
causes failures during a build. - The front matter must include the following:
--- layout: blog title: "Your Title Here" date: YYYY-MM-DD slug: text-for-URL-link-here-no-spaces ---
- The first or initial commit message should be a short summary of the work being done and should stand alone as a description of the blog post. Please note that subsequent edits to your blog will be squashed into this main commit, so it should be as useful as possible.
- Examples of a good commit message:
- Add blog post on the foo kubernetes feature
- blog: foobar announcement
- Examples of bad commit message:
- Add blog post
- .
- initial commit
- draft post
- Examples of a good commit message:
- The blog team will then review your PR and give you comments on things you might need to fix. After that the bot will merge your PR and your blog post will be published.
- If the content of the blog post contains only content that is not expected to require updates to stay accurate for the reader, it can be marked as evergreen and exempted from the automatic warning about outdated content added to blog posts older than one year.
-
To mark a blog post as evergreen, add this to the front matter:
evergreen: true
-
Examples of content that should not be marked evergreen:
- Tutorials that only apply to specific releases or versions and not all future versions
- References to pre-GA APIs or features
-
- The Markdown file name must follow the format
Submit a case study
Case studies highlight how organizations are using Kubernetes to solve real-world problems. The Kubernetes marketing team and members of the CNCF collaborate with you on all case studies.
Have a look at the source for the existing case studies.
Refer to the case study guidelines and submit your request as outlined in the guidelines.
7.3 - Reviewing changes
This section describes how to review content.
7.3.1 - Reviewing pull requests
Anyone can review a documentation pull request. Visit the pull requests section in the Kubernetes website repository to see open pull requests.
Reviewing documentation pull requests is a great way to introduce yourself to the Kubernetes community. It helps you learn the code base and build trust with other contributors.
Before reviewing, it's a good idea to:
- Read the content guide and style guide so you can leave informed comments.
- Understand the different roles and responsibilities in the Kubernetes documentation community.
Before you begin
Before you start a review:
- Read the CNCF Code of Conduct and ensure that you abide by it at all times.
- Be polite, considerate, and helpful.
- Comment on positive aspects of PRs as well as changes.
- Be empathetic and mindful of how your review may be received.
- Assume good intent and ask clarifying questions.
- Experienced contributors, consider pairing with new contributors whose work requires extensive changes.
Review process
In general, review pull requests for content and style in English. Figure 1 outlines the steps for the review process. The details for each step follow.
Figure 1. Review process steps.
-
Go to https://github.com/kubernetes/website/pulls. You see a list of every open pull request against the Kubernetes website and docs.
-
Filter the open PRs using one or all of the following labels:
cncf-cla: yes
(Recommended): PRs submitted by contributors who have not signed the CLA cannot be merged. See Sign the CLA for more information.language/en
(Recommended): Filters for english language PRs only.size/<size>
: filters for PRs of a certain size. If you're new, start with smaller PRs.
Additionally, ensure the PR isn't marked as a work in progress. PRs using the
work in progress
label are not ready for review yet. -
Once you've selected a PR to review, understand the change by:
- Reading the PR description to understand the changes made, and read any linked issues
- Reading any comments by other reviewers
- Clicking the Files changed tab to see the files and lines changed
- Previewing the changes in the Netlify preview build by scrolling to the PR's build check section at the bottom of the Conversation tab. Here's a screenshot (this shows GitHub's desktop site; if you're reviewing on a tablet or smartphone device, the GitHub web UI is slightly different): To open the preview, click on the Details link of the deploy/netlify line in the list of checks.
-
Go to the Files changed tab to start your review.
- Click on the
+
symbol beside the line you want to comment on. - Fill in any comments you have about the line and click either Add single comment (if you have only one comment to make) or Start a review (if you have multiple comments to make).
- When finished, click Review changes at the top of the page. Here, you can add add a summary of your review (and leave some positive comments for the contributor!), approve the PR, comment or request changes as needed. New contributors should always choose Comment.
- Click on the
Reviewing checklist
When reviewing, use the following as a starting point.
Language and grammar
- Are there any obvious errors in language or grammar? Is there a better way to phrase something?
- Are there any complicated or archaic words which could be replaced with a simpler word?
- Are there any words, terms or phrases in use which could be replaced with a non-discriminatory alternative?
- Does the word choice and its capitalization follow the style guide?
- Are there long sentences which could be shorter or less complex?
- Are there any long paragraphs which might work better as a list or table?
Content
- Does similar content exist elsewhere on the Kubernetes site?
- Does the content excessively link to off-site, individual vendor or non-open source documentation?
Website
- Did this PR change or remove a page title, slug/alias or anchor link? If so, are there broken links as a result of this PR? Is there another option, like changing the page title without changing the slug?
- Does the PR introduce a new page? If so:
- Is the page using the right page content type and associated Hugo shortcodes?
- Does the page appear correctly in the section's side navigation (or at all)?
- Should the page appear on the Docs Home listing?
- Do the changes show up in the Netlify preview? Be particularly vigilant about lists, code blocks, tables, notes and images.
Other
For small issues with a PR, like typos or whitespace, prefix your comments with nit:
. This lets the author know the issue is non-critical.
7.3.2 - Reviewing for approvers and reviewers
SIG Docs Reviewers and Approvers do a few extra things when reviewing a change.
Every week a specific docs approver volunteers to triage and review pull requests. This person is the "PR Wrangler" for the week. See the PR Wrangler scheduler for more information. To become a PR Wrangler, attend the weekly SIG Docs meeting and volunteer. Even if you are not on the schedule for the current week, you can still review pull requests (PRs) that are not already under active review.
In addition to the rotation, a bot assigns reviewers and approvers for the PR based on the owners for the affected files.
Reviewing a PR
Kubernetes documentation follows the Kubernetes code review process.
Everything described in Reviewing a pull request applies, but Reviewers and Approvers should also do the following:
-
Using the
/assign
Prow command to assign a specific reviewer to a PR as needed. This is extra important when it comes to requesting technical review from code contributors.Note: Look at thereviewers
field in the front-matter at the top of a Markdown file to see who can provide technical review. -
Making sure the PR follows the Content and Style guides; link the author to the relevant part of the guide(s) if it doesn't.
-
Using the GitHub Request Changes option when applicable to suggest changes to the PR author.
-
Changing your review status in GitHub using the
/approve
or/lgtm
Prow commands, if your suggestions are implemented.
Commit into another person's PR
Leaving PR comments is helpful, but there might be times when you need to commit into another person's PR instead.
Do not "take over" for another person unless they explicitly ask you to, or you want to resurrect a long-abandoned PR. While it may be faster in the short term, it deprives the person of the chance to contribute.
The process you use depends on whether you need to edit a file that is already in the scope of the PR, or a file that the PR has not yet touched.
You can't commit into someone else's PR if either of the following things is true:
-
If the PR author pushed their branch directly to the https://github.com/kubernetes/website/ repository. Only a reviewer with push access can commit to another user's PR.
Note: Encourage the author to push their branch to their fork before opening the PR next time. -
The PR author explicitly disallows edits from approvers.
Prow commands for reviewing
Prow is
the Kubernetes-based CI/CD system that runs jobs against pull requests (PRs). Prow
enables chatbot-style commands to handle GitHub actions across the Kubernetes
organization, like adding and removing labels, closing issues, and assigning an approver. Enter Prow commands as GitHub comments using the /<command-name>
format.
The most common prow commands reviewers and approvers use are:
Prow Command | Role Restrictions | Description |
---|---|---|
/lgtm |
Organization members | Signals that you've finished reviewing a PR and are satisfied with the changes. |
/approve |
Approvers | Approves a PR for merging. |
/assign |
Reviewers or Approvers | Assigns a person to review or approve a PR |
/close |
Reviewers or Approvers | Closes an issue or PR. |
/hold |
Anyone | Adds the do-not-merge/hold label, indicating the PR cannot be automatically merged. |
/hold cancel |
Anyone | Removes the do-not-merge/hold label. |
See the Prow command reference to see the full list of commands you can use in a PR.
Triage and categorize issues
In general, SIG Docs follows the Kubernetes issue triage process and uses the same labels.
This GitHub Issue filter finds issues that might need triage.
Triaging an issue
- Validate the issue
- Make sure the issue is about website documentation. Some issues can be closed quickly by answering a question or pointing the reporter to a resource. See the Support requests or code bug reports section for details.
- Assess whether the issue has merit.
- Add the
triage/needs-information
label if the issue doesn't have enough detail to be actionable or the template is not filled out adequately. - Close the issue if it has both the
lifecycle/stale
andtriage/needs-information
labels.
- Add a priority label (the Issue Triage Guidelines define priority labels in detail)
Label | Description |
---|---|
priority/critical-urgent |
Do this right now. |
priority/important-soon |
Do this within 3 months. |
priority/important-longterm |
Do this within 6 months. |
priority/backlog |
Deferrable indefinitely. Do when resources are available. |
priority/awaiting-more-evidence |
Placeholder for a potentially good issue so it doesn't get lost. |
help or good first issue |
Suitable for someone with very little Kubernetes or SIG Docs experience. See Help Wanted and Good First Issue Labels for more information. |
At your discretion, take ownership of an issue and submit a PR for it (especially if it's quick or relates to work you're already doing).
If you have questions about triaging an issue, ask in #sig-docs
on Slack or
the kubernetes-sig-docs mailing list.
Adding and removing issue labels
To add a label, leave a comment in one of the following formats:
/<label-to-add>
(for example,/good-first-issue
)/<label-category> <label-to-add>
(for example,/triage needs-information
or/language ja
)
To remove a label, leave a comment in one of the following formats:
/remove-<label-to-remove>
(for example,/remove-help
)/remove-<label-category> <label-to-remove>
(for example,/remove-triage needs-information
)`
In both cases, the label must already exist. If you try to add a label that does not exist, the command is silently ignored.
For a list of all labels, see the website repository's Labels section. Not all labels are used by SIG Docs.
Issue lifecycle labels
Issues are generally opened and closed quickly. However, sometimes an issue is inactive after its opened. Other times, an issue may need to remain open for longer than 90 days.
Label | Description |
---|---|
lifecycle/stale |
After 90 days with no activity, an issue is automatically labeled as stale. The issue will be automatically closed if the lifecycle is not manually reverted using the /remove-lifecycle stale command. |
lifecycle/frozen |
An issue with this label will not become stale after 90 days of inactivity. A user manually adds this label to issues that need to remain open for much longer than 90 days, such as those with a priority/important-longterm label. |
Handling special issue types
SIG Docs encounters the following types of issues often enough to document how to handle them.
Duplicate issues
If a single problem has one or more issues open for it, combine them into a single issue.
You should decide which issue to keep open (or
open a new issue), then move over all relevant information and link related issues.
Finally, label all other issues that describe the same problem with
triage/duplicate
and close them. Only having a single issue to work on reduces confusion
and avoids duplicate work on the same problem.
Dead link issues
If the dead link issue is in the API or kubectl
documentation, assign them /priority critical-urgent
until the problem is fully understood. Assign all other dead link issues /priority important-longterm
, as they must be manually fixed.
Blog issues
We expect Kubernetes Blog entries to become outdated over time. Therefore, we only maintain blog entries less than a year old. If an issue is related to a blog entry that is more than one year old, close the issue without fixing.
Support requests or code bug reports
Some docs issues are actually issues with the underlying code, or requests for
assistance when something, for example a tutorial, doesn't work.
For issues unrelated to docs, close the issue with the kind/support
label and a comment
directing the requester to support venues (Slack, Stack Overflow) and, if
relevant, the repository to file an issue for bugs with features (kubernetes/kubernetes
is a great place to start).
Sample response to a request for support:
This issue sounds more like a request for support and less
like an issue specifically for docs. I encourage you to bring
your question to the `#kubernetes-users` channel in
[Kubernetes slack](https://slack.k8s.io/). You can also search
resources like
[Stack Overflow](https://stackoverflow.com/questions/tagged/kubernetes)
for answers to similar questions.
You can also open issues for Kubernetes functionality in
https://github.com/kubernetes/kubernetes.
If this is a documentation issue, please re-open this issue.
Sample code bug report response:
This sounds more like an issue with the code than an issue with
the documentation. Please open an issue at
https://github.com/kubernetes/kubernetes/issues.
If this is a documentation issue, please re-open this issue.
7.4 - Localizing Kubernetes documentation
This page shows you how to localize the docs for a different language.
Contribute to an existing localization
You can help add or improve content to an existing localization. In Kubernetes Slack you'll find a channel for each localization. There is also a general SIG Docs Localizations Slack channel where you can say hello.
Find your two-letter language code
First, consult the ISO 639-1 standard to find your localization's two-letter language code. For example, the two-letter code for Korean is ko
.
Fork and clone the repo
First, create your own fork of the kubernetes/website repository.
Then, clone your fork and cd
into it:
git clone https://github.com/<username>/website
cd website
The website content directory includes sub-directories for each language. The localization you want to help out with is inside content/<two-letter-code>
.
Suggest changes
Create or update your chosen localized page based on the English original. See translating content for more details.
If you notice a technical inaccuracy or other problem with the upstream (English) documentation, you should fix the upstream documentation first and then repeat the equivalent fix by updating the localization you're working on.
Please limit pull requests to a single localization, since pull requests that change content in multiple localizations could be difficult to review.
Follow Suggesting Content Improvements to propose changes to that localization. The process is very similar to proposing changes to the upstream (English) content.
Start a new localization
If you want the Kubernetes documentation localized into a new language, here's what you need to do.
Because contributors can't approve their own pull requests, you need at least two contributors to begin a localization.
All localization teams must be self-sustaining. The Kubernetes website is happy to host your work, but it's up to you to translate it and keep existing localized content current.
You'll need to know the two-letter language code for your language. Consult the
ISO 639-1 standard to find your
localization's two-letter language code. For example, the two-letter code for Korean is
ko
.
When you start a new localization, you must localize all the minimum required content before the Kubernetes project can publish your changes to the live website.
SIG Docs can help you work on a separate branch so that you can incrementally work towards that goal.
Find community
Let Kubernetes SIG Docs know you're interested in creating a localization! Join the SIG Docs Slack channel and the SIG Docs Localizations Slack channel. Other localization teams are happy to help you get started and answer any questions you have.
Please also consider participating in the SIG Docs Localization Subgroup meeting. The mission of the SIG Docs localization subgroup is to work across the SIG Docs localization teams to collaborate on defining and documenting the processes for creating localized contribution guides. In addition, the SIG Docs localization subgroup will look for opportunities for the creation and sharing of common tools across localization teams and also serve to identify new requirements to the SIG Docs Leadership team. If you have questions about this meeting, please inquire on the SIG Docs Localizations Slack channel.
You can also create a Slack channel for your localization in the kubernetes/community
repository. For an example of adding a Slack channel, see the PR for adding a channel for Persian.
Join the Kubernetes GitHub organization
Once you've opened a localization PR, you can become members of the Kubernetes GitHub organization. Each person on the team needs to create their own Organization Membership Request in the kubernetes/org
repository.
Add your localization team in GitHub
Next, add your Kubernetes localization team to sig-docs/teams.yaml
. For an example of adding a localization team, see the PR to add the Spanish localization team.
Members of @kubernetes/sig-docs-**-owners
can approve PRs that change content within (and only within) your localization directory: /content/**/
.
For each localization, The @kubernetes/sig-docs-**-reviews
team automates review assignment for new PRs.
Members of @kubernetes/website-maintainers
can create new localization branches to coordinate translation efforts.
Members of @kubernetes/website-milestone-maintainers
can use the /milestone
Prow command to assign a milestone to issues or PRs.
Configure the workflow
Next, add a GitHub label for your localization in the kubernetes/test-infra
repository. A label lets you filter issues and pull requests for your specific language.
For an example of adding a label, see the PR for adding the Italian language label.
Modify the site configuration
The Kubernetes website uses Hugo as its web framework. The website's Hugo configuration resides in the config.toml
file. To support a new localization, you'll need to modify config.toml
.
Add a configuration block for the new language to config.toml
, under the existing [languages]
block. The German block, for example, looks like:
[languages.de]
title = "Kubernetes"
description = "Produktionsreife Container-Verwaltung"
languageName = "Deutsch"
contentDir = "content/de"
weight = 3
When assigning a weight
parameter for your block, find the language block with the highest weight and add 1 to that value.
For more information about Hugo's multilingual support, see "Multilingual Mode".
Add a new localization directory
Add a language-specific subdirectory to the content
folder in the repository. For example, the two-letter code for German is de
:
mkdir content/de
You also need to create a directory inside data/i18n
for
localized strings; look at existing localizations
for an example. To use these new strings, you must also create a symbolic link
from i18n/<localization>.toml
to the actual string configuration in
data/i18n/<localization>/<localization>.toml
(remember to commit the symbolic
link).
For example, for German the strings live in data/i18n/de/de.toml
, and
i18n/de.toml
is a symbolic link to data/i18n/de/de.toml
.
Localize the community code of conduct
Open a PR against the cncf/foundation
repository to add the code of conduct in your language.
Setting up the OWNERS files
To set the roles of each user contributing to the localization, create an OWNERS
file inside the language-specific subdirectory with:
- reviewers: A list of kubernetes teams with reviewer roles, in this case, the
sig-docs-**-reviews
team created in Add your localization team in GitHub. - approvers: A list of kubernetes teams with approvers roles, in this case, the
sig-docs-**-owners
team created in Add your localization team in GitHub. - labels: A list of GitHub labels to automatically apply to a PR, in this case, the language label created in Configure the workflow.
More information about the OWNERS
file can be found at go.k8s.io/owners.
The Spanish OWNERS file, with language code es
, looks like:
# See the OWNERS docs at https://go.k8s.io/owners
# This is the localization project for Spanish.
# Teams and members are visible at https://github.com/orgs/kubernetes/teams.
reviewers:
- sig-docs-es-reviews
approvers:
- sig-docs-es-owners
labels:
- language/es
After adding the language-specific OWNERS
file, update the root OWNERS_ALIASES
file with the new Kubernetes teams for the localization, sig-docs-**-owners
and sig-docs-**-reviews
.
For each team, add the list of GitHub users requested in Add your localization team in GitHub, in alphabetical order.
--- a/OWNERS_ALIASES
+++ b/OWNERS_ALIASES
@@ -48,6 +48,14 @@ aliases:
- stewart-yu
- xiangpengzhao
- zhangxiaoyu-zidif
+ sig-docs-es-owners: # Admins for Spanish content
+ - alexbrand
+ - raelga
+ sig-docs-es-reviews: # PR reviews for Spanish content
+ - alexbrand
+ - electrocucaracha
+ - glo-pena
+ - raelga
sig-docs-fr-owners: # Admins for French content
- perriea
- remyleone
Open a pull request
Next, open a pull request (PR) to add a localization to the kubernetes/website
repository.
The PR must include all of the minimum required content before it can be approved.
For an example of adding a new localization, see the PR to enable docs in French.
Add a localized README file
To guide other localization contributors, add a new README-**.md
to the top level of k/website, where **
is the two-letter language code. For example, a German README file would be README-de.md
.
Provide guidance to localization contributors in the localized README-**.md
file. Include the same information contained in README.md
as well as:
- A point of contact for the localization project
- Any information specific to the localization
After you create the localized README, add a link to the file from the main English README.md
, and include contact information in English. You can provide a GitHub ID, email address, Slack channel, or other method of contact. You must also provide a link to your localized Community Code of Conduct.
Launching your new localization
Once a localization meets requirements for workflow and minimum output, SIG Docs will:
- Enable language selection on the website
- Publicize the localization's availability through Cloud Native Computing Foundation (CNCF) channels, including the Kubernetes blog.
Translating content
Localizing all of the Kubernetes documentation is an enormous task. It's okay to start small and expand over time.
Minimum required content
At a minimum, all localizations must include:
Description | URLs |
---|---|
Home | All heading and subheading URLs |
Setup | All heading and subheading URLs |
Tutorials | Kubernetes Basics, Hello Minikube |
Site strings | All site strings in a new localized TOML file |
Releases | All heading and subheading URLs |
Translated documents must reside in their own content/**/
subdirectory, but otherwise follow the same URL path as the English source. For example, to prepare the Kubernetes Basics tutorial for translation into German, create a subfolder under the content/de/
folder and copy the English source:
mkdir -p content/de/docs/tutorials
cp content/en/docs/tutorials/kubernetes-basics.md content/de/docs/tutorials/kubernetes-basics.md
Translation tools can speed up the translation process. For example, some editors offers plugins to quickly translate text.
To ensure accuracy in grammar and meaning, members of your localization team should carefully review all machine-generated translations before publishing.
Source files
Localizations must be based on the English files from a specific release targeted by the localization team. Each localization team can decide which release to target which is referred to as the target version below.
To find source files for your target version:
- Navigate to the Kubernetes website repository at https://github.com/kubernetes/website.
- Select a branch for your target version from the following table:
Target version Branch Latest version main
Previous version release-1.22
Next version dev-1.24
The main
branch holds content for the current release v1.23
. The release team will create a release-1.23
branch before the next release: v1.24.
Site strings in i18n
Localizations must include the contents of data/i18n/en/en.toml
in a new language-specific file. Using German as an example: data/i18n/de/de.toml
.
Add a new localization directory and file to data/i18n/
. For example, with German (de
):
mkdir -p data/i18n/de
cp data/i18n/en/en.toml data/i18n/de/de.toml
Revise the comments at the top of the file to suit your localization, then translate the value of each string. For example, this is the German-language placeholder text for the search form:
[ui_search_placeholder]
other = "Suchen"
Localizing site strings lets you customize site-wide text and features: for example, the legal copyright text in the footer on each page.
Language specific style guide and glossary
Some language teams have their own language-specific style guide and glossary. For example, see the Korean Localization Guide.
Language specific Zoom meetings
If the localization project needs a separate meeting time, contact a SIG Docs Co-Chair or Tech Lead to create a new reoccurring Zoom meeting and calendar invite. This is only needed when the the team is large enough to sustain and require a separate meeting.
Per CNCF policy, the localization teams must upload their meetings to the SIG Docs YouTube playlist. A SIG Docs Co-Chair or Tech Lead can help with the process until SIG Docs automates it.
Branching strategy
Because localization projects are highly collaborative efforts, we encourage teams to work in shared localization branches - especially when starting out and the localization is not yet live.
To collaborate on a localization branch:
-
A team member of @kubernetes/website-maintainers opens a localization branch from a source branch on https://github.com/kubernetes/website.
Your team approvers joined the
@kubernetes/website-maintainers
team when you added your localization team to thekubernetes/org
repository.We recommend the following branch naming scheme:
dev-<source version>-<language code>.<team milestone>
For example, an approver on a German localization team opens the localization branch
dev-1.12-de.1
directly against the k/website repository, based on the source branch for Kubernetes v1.12. -
Individual contributors open feature branches based on the localization branch.
For example, a German contributor opens a pull request with changes to
kubernetes:dev-1.12-de.1
fromusername:local-branch-name
. -
Approvers review and merge feature branches into the localization branch.
-
Periodically, an approver merges the localization branch to its source branch by opening and approving a new pull request. Be sure to squash the commits before approving the pull request.
Repeat steps 1-4 as needed until the localization is complete. For example, subsequent German localization branches would be: dev-1.12-de.2
, dev-1.12-de.3
, etc.
Teams must merge localized content into the same branch from which the content was sourced.
For example:
- a localization branch sourced from
main
must be merged intomain
. - a localization branch sourced from
release-1.22
must be merged intorelease-1.22
.
main
branch but it is not merged into main
before new release branch release-1.23
created, merge it into both main
and new release branch release-1.23
. To merge your localization branch into new release branch release-1.23
, you need to switch upstream branch of your localization branch to release-1.23
.
At the beginning of every team milestone, it's helpful to open an issue comparing upstream changes between the previous localization branch and the current localization branch. There are two scripts for comparing upstream changes. upstream_changes.py
is useful for checking the changes made to a specific file. And diff_l10n_branches.py
is useful for creating a list of outdated files for a specific localization branch.
While only approvers can open a new localization branch and merge pull requests, anyone can open a pull request for a new localization branch. No special permissions are required.
For more information about working from forks or directly from the repository, see "fork and clone the repo".
Upstream contributions
SIG Docs welcomes upstream contributions and corrections to the English source.
7.5 - Participating in SIG Docs
SIG Docs is one of the special interest groups within the Kubernetes project, focused on writing, updating, and maintaining the documentation for Kubernetes as a whole. See SIG Docs from the community github repo for more information about the SIG.
SIG Docs welcomes content and reviews from all contributors. Anyone can open a pull request (PR), and anyone is welcome to file issues about content or comment on pull requests in progress.
You can also become a member, reviewer, or approver. These roles require greater access and entail certain responsibilities for approving and committing changes. See community-membership for more information on how membership works within the Kubernetes community.
The rest of this document outlines some unique ways these roles function within SIG Docs, which is responsible for maintaining one of the most public-facing aspects of Kubernetes -- the Kubernetes website and documentation.
SIG Docs chairperson
Each SIG, including SIG Docs, selects one or more SIG members to act as chairpersons. These are points of contact between SIG Docs and other parts of the Kubernetes organization. They require extensive knowledge of the structure of the Kubernetes project as a whole and how SIG Docs works within it. See Leadership for the current list of chairpersons.
SIG Docs teams and automation
Automation in SIG Docs relies on two different mechanisms: GitHub teams and OWNERS files.
GitHub teams
There are two categories of SIG Docs teams on GitHub:
@sig-docs-{language}-owners
are approvers and leads@sig-docs-{language}-reviewers
are reviewers
Each can be referenced with their @name
in GitHub comments to communicate with
everyone in that group.
Sometimes Prow and GitHub teams overlap without matching exactly. For
assignment of issues, pull requests, and to support PR approvals, the
automation uses information from OWNERS
files.
OWNERS files and front-matter
The Kubernetes project uses an automation tool called prow for automation related to GitHub issues and pull requests. The Kubernetes website repository uses two prow plugins:
- blunderbuss
- approve
These two plugins use the
OWNERS and
OWNERS_ALIASES
files in the top level of the kubernetes/website
GitHub repository to control
how prow works within the repository.
An OWNERS file contains a list of people who are SIG Docs reviewers and approvers. OWNERS files can also exist in subdirectories, and can override who can act as a reviewer or approver of files in that subdirectory and its descendants. For more information about OWNERS files in general, see OWNERS.
In addition, an individual Markdown file can list reviewers and approvers in its front-matter, either by listing individual GitHub usernames or GitHub groups.
The combination of OWNERS files and front-matter in Markdown files determines the advice PR owners get from automated systems about who to ask for technical and editorial review of their PR.
How merging works
When a pull request is merged to the branch used to publish content, that content is published to http://kubernetes.io. To ensure that the quality of our published content is high, we limit merging pull requests to SIG Docs approvers. Here's how it works.
- When a pull request has both the
lgtm
andapprove
labels, has nohold
labels, and all tests are passing, the pull request merges automatically. - Kubernetes organization members and SIG Docs approvers can add comments to
prevent automatic merging of a given pull request (by adding a
/hold
comment or withholding a/lgtm
comment). - Any Kubernetes member can add the
lgtm
label by adding a/lgtm
comment. - Only SIG Docs approvers can merge a pull request
by adding an
/approve
comment. Some approvers also perform additional specific roles, such as PR Wrangler or SIG Docs chairperson.
What's next
For more information about contributing to the Kubernetes documentation, see:
7.5.1 - Roles and responsibilities
Anyone can contribute to Kubernetes. As your contributions to SIG Docs grow, you can apply for different levels of membership in the community. These roles allow you to take on more responsibility within the community. Each role requires more time and commitment. The roles are:
- Anyone: regular contributors to the Kubernetes documentation
- Members: can assign and triage issues and provide non-binding review on pull requests
- Reviewers: can lead reviews on documentation pull requests and can vouch for a change's quality
- Approvers: can lead reviews on documentation and merge changes
Anyone
Anyone with a GitHub account can contribute to Kubernetes. SIG Docs welcomes all new contributors!
Anyone can:
- Open an issue in any Kubernetes
repository, including
kubernetes/website
- Give non-binding feedback on a pull request
- Contribute to a localization
- Suggest improvements on Slack or the SIG docs mailing list.
After signing the CLA, anyone can also:
- Open a pull request to improve existing content, add new content, or write a blog post or case study
- Create diagrams, graphics assets, and embeddable screencasts and videos
For more information, see contributing new content.
Members
A member is someone who has submitted multiple pull requests to
kubernetes/website
. Members are a part of the
Kubernetes GitHub organization.
Members can:
-
Do everything listed under Anyone
-
Use the
/lgtm
comment to add the LGTM (looks good to me) label to a pull requestNote: Using/lgtm
triggers automation. If you want to provide non-binding approval, commenting "LGTM" works too! -
Use the
/hold
comment to block merging for a pull request -
Use the
/assign
comment to assign a reviewer to a pull request -
Provide non-binding review on pull requests
-
Use automation to triage and categorize issues
-
Document new features
Becoming a member
After submitting at least 5 substantial pull requests and meeting the other requirements:
-
Find two reviewers or approvers to sponsor your membership.
Ask for sponsorship in the #sig-docs channel on Slack or on the SIG Docs mailing list.
Note: Don't send a direct email or Slack direct message to an individual SIG Docs member. You must request sponsorship before submitting your application. -
Open a GitHub issue in the
kubernetes/org
repository. Use the Organization Membership Request issue template. -
Let your sponsors know about the GitHub issue. You can either:
-
Mention their GitHub username in an issue (
@<GitHub-username>
) -
Send them the issue link using Slack or email.
Sponsors will approve your request with a
+1
vote. Once your sponsors approve the request, a Kubernetes GitHub admin adds you as a member. Congratulations!If your membership request is not accepted you will receive feedback. After addressing the feedback, apply again.
-
-
Accept the invitation to the Kubernetes GitHub organization in your email account.
Note: GitHub sends the invitation to the default email address in your account.
Reviewers
Reviewers are responsible for reviewing open pull requests. Unlike member feedback, you must address reviewer feedback. Reviewers are members of the @kubernetes/sig-docs-{language}-reviews GitHub team.
Reviewers can:
-
Review pull requests and provide binding feedback
Note: To provide non-binding feedback, prefix your comments with a phrase like "Optionally: ". -
Edit user-facing strings in code
-
Improve code comments
You can be a SIG Docs reviewer, or a reviewer for docs in a specific subject area.
Assigning reviewers to pull requests
Automation assigns reviewers to all pull requests. You can request a
review from a specific person by commenting: /assign [@_github_handle]
.
If the assigned reviewer has not commented on the PR, another reviewer can step in. You can also assign technical reviewers as needed.
Using /lgtm
LGTM stands for "Looks good to me" and indicates that a pull request is
technically accurate and ready to merge. All PRs need a /lgtm
comment from a
reviewer and a /approve
comment from an approver to merge.
A /lgtm
comment from reviewer is binding and triggers automation that adds the lgtm
label.
Becoming a reviewer
When you meet the requirements, you can become a SIG Docs reviewer. Reviewers in other SIGs must apply separately for reviewer status in SIG Docs.
To apply:
-
Open a pull request that adds your GitHub user name to a section of the OWNERS_ALIASES file in the
kubernetes/website
repository.Note: If you aren't sure where to add yourself, add yourself tosig-docs-en-reviews
. -
Assign the PR to one or more SIG-Docs approvers (user names listed under
sig-docs-{language}-owners
).
If approved, a SIG Docs lead adds you to the appropriate GitHub team. Once added, K8s-ci-robot assigns and suggests you as a reviewer on new pull requests.
Approvers
Approvers review and approve pull requests for merging. Approvers are members of the @kubernetes/sig-docs-{language}-owners GitHub teams.
Approvers can do the following:
- Everything listed under Anyone, Members and Reviewers
- Publish contributor content by approving and merging pull requests using the
/approve
comment - Propose improvements to the style guide
- Propose improvements to docs tests
- Propose improvements to the Kubernetes website or other tooling
If the PR already has a /lgtm
, or if the approver also comments with
/lgtm
, the PR merges automatically. A SIG Docs approver should only leave a
/lgtm
on a change that doesn't need additional technical review.
Approving pull requests
Approvers and SIG Docs leads are the only ones who can merge pull requests into the website repository. This comes with certain responsibilities.
-
Approvers can use the
/approve
command, which merges PRs into the repo.Warning: A careless merge can break the site, so be sure that when you merge something, you mean it. -
Make sure that proposed changes meet the contribution guidelines.
If you ever have a question, or you're not sure about something, feel free to call for additional review.
-
Verify that Netlify tests pass before you
/approve
a PR. -
Visit the Netlify page preview for a PR to make sure things look good before approving.
-
Participate in the PR Wrangler rotation schedule for weekly rotations. SIG Docs expects all approvers to participate in this rotation. See PR wranglers. for more details.
Becoming an approver
When you meet the requirements, you can become a SIG Docs approver. Approvers in other SIGs must apply separately for approver status in SIG Docs.
To apply:
-
Open a pull request adding yourself to a section of the OWNERS_ALIASES file in the
kubernetes/website
repository.Note: If you aren't sure where to add yourself, add yourself tosig-docs-en-owners
. -
Assign the PR to one or more current SIG Docs approvers.
If approved, a SIG Docs lead adds you to the appropriate GitHub team. Once added, @k8s-ci-robot assigns and suggests you as a reviewer on new pull requests.
What's next
- Read about PR wrangling, a role all approvers take on rotation.
7.5.2 - PR wranglers
SIG Docs approvers take week-long shifts managing pull requests for the repository.
This section covers the duties of a PR wrangler. For more information on giving good reviews, see Reviewing changes.
Duties
Each day in a week-long shift as PR Wrangler:
- Triage and tag incoming issues daily. See Triage and categorize issues for guidelines on how SIG Docs uses metadata.
- Review open pull requests for quality and adherence to the Style and Content guides.
- Start with the smallest PRs (
size/XS
) first, and end with the largest (size/XXL
). Review as many PRs as you can.
- Start with the smallest PRs (
- Make sure PR contributors sign the CLA.
- Use this script to remind contributors that haven't signed the CLA to do so.
- Provide feedback on changes and ask for technical reviews from members of other SIGs.
- Provide inline suggestions on the PR for the proposed content changes.
- If you need to verify content, comment on the PR and request more details.
- Assign relevant
sig/
label(s). - If needed, assign reviewers from the
reviewers:
block in the file's front matter. - You can also tag a SIG for a review by commenting
@kubernetes/<sig>-pr-reviews
on the PR.
- Use the
/approve
comment to approve a PR for merging. Merge the PR when ready.- PRs should have a
/lgtm
comment from another member before merging. - Consider accepting technically accurate content that doesn't meet the style guidelines. As you approve the change, open a new issue to address the style concern. You can usually write these style fix issues as good first issues.
- Using style fixups as good first issues is a good way to ensure a supply of easier tasks to help onboard new contributors.
- PRs should have a
Helpful GitHub queries for wranglers
The following queries are helpful when wrangling. After working through these queries, the remaining list of PRs to review is usually small. These queries exclude localization PRs. All queries are against the main branch except the last one.
- No CLA, not eligible to merge: Remind the contributor to sign the CLA. If both the bot and a human have reminded them, close the PR and remind them that they can open it after signing the CLA. Do not review PRs whose authors have not signed the CLA!
- Needs LGTM: Lists PRs that need an LGTM from a member. If the PR needs technical review, loop in one of the reviewers suggested by the bot. If the content needs work, add suggestions and feedback in-line.
- Has LGTM, needs docs approval:
Lists PRs that need an
/approve
comment to merge. - Quick Wins: Lists PRs against the main branch with no clear blockers. (change "XS" in the size label as you work through the PRs [XS, S, M, L, XL, XXL]).
- Not against the primary branch: If the PR is against a
dev-
branch, it's for an upcoming release. Assign the docs release manager using:/assign @<manager's_github-username>
. If the PR is against an old branch, help the author figure out whether it's targeted against the best branch.
Helpful Prow commands for wranglers
# add English label
/language en
# add squash label to PR if more than one commit
/label tide/merge-method-squash
# retitle a PR via Prow (such as a work-in-progress [WIP] or better detail of PR)
/retitle [WIP] <TITLE>
When to close Pull Requests
Reviews and approvals are one tool to keep our PR queue short and current. Another tool is closure.
Close PRs where:
-
The author hasn't signed the CLA for two weeks.
Authors can reopen the PR after signing the CLA. This is a low-risk way to make sure nothing gets merged without a signed CLA.
-
The author has not responded to comments or feedback in 2 or more weeks.
Don't be afraid to close pull requests. Contributors can easily reopen and resume works in progress. Often a closure notice is what spurs an author to resume and finish their contribution.
To close a pull request, leave a /close
comment on the PR.
fejta-bot
bot marks issues as stale after 90 days of inactivity. After 30 more days it marks issues as rotten and closes them. PR wranglers should close issues after 14-30 days of inactivity.
7.6 - Documentation style overview
The topics in this section provide guidance on writing style, content formatting and organization, and using Hugo customizations specific to Kubernetes documentation.
7.6.1 - Documentation Content Guide
This page contains guidelines for Kubernetes documentation.
If you have questions about what's allowed, join the #sig-docs channel in Kubernetes Slack and ask!
You can register for Kubernetes Slack at https://slack.k8s.io/.
For information on creating new content for the Kubernetes docs, follow the style guide.
Overview
Source for the Kubernetes website, including the docs, resides in the kubernetes/website repository.
Located in the kubernetes/website/content/<language_code>/docs
folder, the
majority of Kubernetes documentation is specific to the Kubernetes
project.
What's allowed
Kubernetes docs allow content for third-party projects only when:
- Content documents software in the Kubernetes project
- Content documents software that's out of project but necessary for Kubernetes to function
- Content is canonical on kubernetes.io, or links to canonical content elsewhere
Third party content
Kubernetes documentation includes applied examples of projects in the Kubernetes project—projects that live in the kubernetes and kubernetes-sigs GitHub organizations.
Links to active content in the Kubernetes project are always allowed.
Kubernetes requires some third party content to function. Examples include container runtimes (containerd, CRI-O, Docker), networking policy (CNI plugins), Ingress controllers, and logging.
Docs can link to third-party open source software (OSS) outside the Kubernetes project only if it's necessary for Kubernetes to function.
Dual sourced content
Wherever possible, Kubernetes docs link to canonical sources instead of hosting dual-sourced content.
Dual-sourced content requires double the effort (or more!) to maintain and grows stale more quickly.
More information
If you have questions about allowed content, join the Kubernetes Slack #sig-docs channel and ask!
What's next
- Read the Style guide.
7.6.2 - Documentation Style Guide
This page gives writing style guidelines for the Kubernetes documentation. These are guidelines, not rules. Use your best judgment, and feel free to propose changes to this document in a pull request.
For additional information on creating new content for the Kubernetes documentation, read the Documentation Content Guide.
Changes to the style guide are made by SIG Docs as a group. To propose a change or addition, add it to the agenda for an upcoming SIG Docs meeting, and attend the meeting to participate in the discussion.
Language
Kubernetes documentation has been translated into multiple languages (see Localization READMEs).
The way of localizing the docs for a different language is described in Localizing Kubernetes Documentation.
The English-language documentation uses U.S. English spelling and grammar.
Documentation formatting standards
Use upper camel case for API objects
When you refer specifically to interacting with an API object, use UpperCamelCase, also known as Pascal case. You may see different capitalization, such as "configMap", in the API Reference. When writing general documentation, it's better to use upper camel case, calling it "ConfigMap" instead.
When you are generally discussing an API object, use sentence-style capitalization.
You may use the word "resource", "API", or "object" to clarify a Kubernetes resource type in a sentence.
Don't split an API object name into separate words. For example, use PodTemplateList, not Pod Template List.
The following examples focus on capitalization. For more information about formatting API object names, review the related guidance on Code Style.
Do | Don't |
---|---|
The HorizontalPodAutoscaler resource is responsible for ... | The Horizontal pod autoscaler is responsible for ... |
A PodList object is a list of pods. | A Pod List object is a list of pods. |
The Volume object contains a hostPath field. |
The volume object contains a hostPath field. |
Every ConfigMap object is part of a namespace. | Every configMap object is part of a namespace. |
For managing confidential data, consider using the Secret API. | For managing confidential data, consider using the secret API. |
Use angle brackets for placeholders
Use angle brackets for placeholders. Tell the reader what a placeholder represents, for example:
Display information about a pod:
kubectl describe pod <pod-name> -n <namespace>
If the namespace of the pod is default
, you can omit the '-n' parameter.
Use bold for user interface elements
Do | Don't |
---|---|
Click Fork. | Click "Fork". |
Select Other. | Select "Other". |
Use italics to define or introduce new terms
Do | Don't |
---|---|
A cluster is a set of nodes ... | A "cluster" is a set of nodes ... |
These components form the control plane. | These components form the control plane. |
Use code style for filenames, directories, and paths
Do | Don't |
---|---|
Open the envars.yaml file. |
Open the envars.yaml file. |
Go to the /docs/tutorials directory. |
Go to the /docs/tutorials directory. |
Open the /_data/concepts.yaml file. |
Open the /_data/concepts.yaml file. |
Use the international standard for punctuation inside quotes
Do | Don't |
---|---|
events are recorded with an associated "stage". | events are recorded with an associated "stage." |
The copy is called a "fork". | The copy is called a "fork." |
Inline code formatting
Use code style for inline code, commands, and API objects
For inline code in an HTML document, use the <code>
tag. In a Markdown
document, use the backtick (`
).
Do | Don't |
---|---|
The kubectl run command creates a Pod . |
The "kubectl run" command creates a pod. |
The kubelet on each node acquires a Lease … |
The kubelet on each node acquires a lease… |
A PersistentVolume represents durable storage… |
A Persistent Volume represents durable storage… |
For declarative management, use kubectl apply . |
For declarative management, use "kubectl apply". |
Enclose code samples with triple backticks. (```) | Enclose code samples with any other syntax. |
Use single backticks to enclose inline code. For example, var example = true . |
Use two asterisks (** ) or an underscore (_ ) to enclose inline code. For example, var example = true. |
Use triple backticks before and after a multi-line block of code for fenced code blocks. | Use multi-line blocks of code to create diagrams, flowcharts, or other illustrations. |
Use meaningful variable names that have a context. | Use variable names such as 'foo','bar', and 'baz' that are not meaningful and lack context. |
Remove trailing spaces in the code. | Add trailing spaces in the code, where these are important, because the screen reader will read out the spaces as well. |
Use code style for object field names and namespaces
Do | Don't |
---|---|
Set the value of the replicas field in the configuration file. |
Set the value of the "replicas" field in the configuration file. |
The value of the exec field is an ExecAction object. |
The value of the "exec" field is an ExecAction object. |
Run the process as a DaemonSet in the kube-system namespace. |
Run the process as a DaemonSet in the kube-system namespace. |
Use code style for Kubernetes command tool and component names
Do | Don't |
---|---|
The kubelet preserves node stability. | The kubelet preserves node stability. |
The kubectl handles locating and authenticating to the API server. |
The kubectl handles locating and authenticating to the apiserver. |
Run the process with the certificate, kube-apiserver --client-ca-file=FILENAME . |
Run the process with the certificate, kube-apiserver --client-ca-file=FILENAME. |
Starting a sentence with a component tool or component name
Do | Don't |
---|---|
The kubeadm tool bootstraps and provisions machines in a cluster. |
kubeadm tool bootstraps and provisions machines in a cluster. |
The kube-scheduler is the default scheduler for Kubernetes. | kube-scheduler is the default scheduler for Kubernetes. |
Use a general descriptor over a component name
Do | Don't |
---|---|
The Kubernetes API server offers an OpenAPI spec. | The apiserver offers an OpenAPI spec. |
Aggregated APIs are subordinate API servers. | Aggregated APIs are subordinate APIServers. |
Use normal style for string and integer field values
For field values of type string or integer, use normal style without quotation marks.
Do | Don't |
---|---|
Set the value of imagePullPolicy to Always. |
Set the value of imagePullPolicy to "Always". |
Set the value of image to nginx:1.16. |
Set the value of image to nginx:1.16 . |
Set the value of the replicas field to 2. |
Set the value of the replicas field to 2 . |
Code snippet formatting
Don't include the command prompt
Do | Don't |
---|---|
kubectl get pods | $ kubectl get pods |
Separate commands from output
Verify that the pod is running on your chosen node:
kubectl get pods --output=wide
The output is similar to this:
NAME READY STATUS RESTARTS AGE IP NODE
nginx 1/1 Running 0 13s 10.200.0.4 worker0
Versioning Kubernetes examples
Code examples and configuration examples that include version information should be consistent with the accompanying text.
If the information is version specific, the Kubernetes version needs to be defined in the prerequisites
section of the Task template or the Tutorial template. Once the page is saved, the prerequisites
section is shown as Before you begin.
To specify the Kubernetes version for a task or tutorial page, include min-kubernetes-server-version
in the front matter of the page.
If the example YAML is in a standalone file, find and review the topics that include it as a reference. Verify that any topics using the standalone YAML have the appropriate version information defined. If a stand-alone YAML file is not referenced from any topics, consider deleting it instead of updating it.
For example, if you are writing a tutorial that is relevant to Kubernetes version 1.8, the front-matter of your markdown file should look something like:
---
title: <your tutorial title here>
min-kubernetes-server-version: v1.8
---
In code and configuration examples, do not include comments about alternative versions. Be careful to not include incorrect statements in your examples as comments, such as:
apiVersion: v1 # earlier versions use...
kind: Pod
...
Kubernetes.io word list
A list of Kubernetes-specific terms and words to be used consistently across the site.
Term | Usage |
---|---|
Kubernetes | Kubernetes should always be capitalized. |
Docker | Docker should always be capitalized. |
SIG Docs | SIG Docs rather than SIG-DOCS or other variations. |
On-premises | On-premises or On-prem rather than On-premise or other variations. |
Shortcodes
Hugo Shortcodes help create different rhetorical appeal levels. Our documentation supports three different shortcodes in this category: Note {{< note >}}
, Caution {{< caution >}}
, and Warning {{< warning >}}
.
-
Surround the text with an opening and closing shortcode.
-
Use the following syntax to apply a style:
{{< note >}} No need to include a prefix; the shortcode automatically provides one. (Note:, Caution:, etc.) {{< /note >}}
The output is:
Note: The prefix you choose is the same text for the tag.
Note
Use {{< note >}}
to highlight a tip or a piece of information that may be helpful to know.
For example:
{{< note >}}
You can _still_ use Markdown inside these callouts.
{{< /note >}}
The output is:
You can use a {{< note >}}
in a list:
1. Use the note shortcode in a list
1. A second item with an embedded note
{{< note >}}
Warning, Caution, and Note shortcodes, embedded in lists, need to be indented four spaces. See [Common Shortcode Issues](#common-shortcode-issues).
{{< /note >}}
1. A third item in a list
1. A fourth item in a list
The output is:
-
Use the note shortcode in a list
-
A second item with an embedded note
Note: Warning, Caution, and Note shortcodes, embedded in lists, need to be indented four spaces. See Common Shortcode Issues. -
A third item in a list
-
A fourth item in a list
Caution
Use {{< caution >}}
to call attention to an important piece of information to avoid pitfalls.
For example:
{{< caution >}}
The callout style only applies to the line directly above the tag.
{{< /caution >}}
The output is:
Warning
Use {{< warning >}}
to indicate danger or a piece of information that is crucial to follow.
For example:
{{< warning >}}
Beware.
{{< /warning >}}
The output is:
Katacoda Embedded Live Environment
This button lets users run Minikube in their browser using the Katacoda Terminal. It lowers the barrier of entry by allowing users to use Minikube with one click instead of going through the complete Minikube and Kubectl installation process locally.
The Embedded Live Environment is configured to run minikube start
and lets users complete tutorials in the same window
as the documentation.
For example:
{{< kat-button >}}
The output is:
Common Shortcode Issues
Ordered Lists
Shortcodes will interrupt numbered lists unless you indent four spaces before the notice and the tag.
For example:
1. Preheat oven to 350˚F
1. Prepare the batter, and pour into springform pan.
`{{< note >}}Grease the pan for best results.{{< /note >}}`
1. Bake for 20-25 minutes or until set.
The output is:
-
Preheat oven to 350˚F
-
Prepare the batter, and pour into springform pan.
Note: Grease the pan for best results. -
Bake for 20-25 minutes or until set.
Include Statements
Shortcodes inside include statements will break the build. You must insert them in the parent document, before and after you call the include. For example:
{{< note >}}
{{< include "task-tutorial-prereqs.md" >}}
{{< /note >}}
Markdown elements
Line breaks
Use a single newline to separate block-level content like headings, lists, images, code blocks, and others. The exception is second-level headings, where it should be two newlines. Second-level headings follow the first-level (or the title) without any preceding paragraphs or texts. A two line spacing helps visualize the overall structure of content in a code editor better.
Headings
People accessing this documentation may use a screen reader or other assistive technology (AT). Screen readers are linear output devices, they output items on a page one at a time. If there is a lot of content on a page, you can use headings to give the page an internal structure. A good page structure helps all readers to easily navigate the page or filter topics of interest.
Do | Don't |
---|---|
Update the title in the front matter of the page or blog post. | Use first level heading, as Hugo automatically converts the title in the front matter of the page into a first-level heading. |
Use ordered headings to provide a meaningful high-level outline of your content. | Use headings level 4 through 6, unless it is absolutely necessary. If your content is that detailed, it may need to be broken into separate articles. |
Use pound or hash signs (# ) for non-blog post content. |
Use underlines (--- or === ) to designate first-level headings. |
Use sentence case for headings. For example, Extend kubectl with plugins | Use title case for headings. For example, Extend Kubectl With Plugins |
Paragraphs
Do | Don't |
---|---|
Try to keep paragraphs under 6 sentences. | Indent the first paragraph with space characters. For example, ⋅⋅⋅Three spaces before a paragraph will indent it. |
Use three hyphens (--- ) to create a horizontal rule. Use horizontal rules for breaks in paragraph content. For example, a change of scene in a story, or a shift of topic within a section. |
Use horizontal rules for decoration. |
Links
Do | Don't |
---|---|
Write hyperlinks that give you context for the content they link to. For example: Certain ports are open on your machines. See Check required ports for more details. | Use ambiguous terms such as "click here". For example: Certain ports are open on your machines. See here for more details. |
Write Markdown-style links: [link text](URL) . For example: [Hugo shortcodes](/docs/contribute/style/hugo-shortcodes/#table-captions) and the output is Hugo shortcodes. |
Write HTML-style links: <a href="/media/examples/link-element-example.css" target="_blank">Visit our tutorial!</a> , or create links that open in new tabs or windows. For example: [example website](https://example.com){target="_blank"} |
Lists
Group items in a list that are related to each other and need to appear in a specific order or to indicate a correlation between multiple items. When a screen reader comes across a list—whether it is an ordered or unordered list—it will be announced to the user that there is a group of list items. The user can then use the arrow keys to move up and down between the various items in the list. Website navigation links can also be marked up as list items; after all they are nothing but a group of related links.
-
End each item in a list with a period if one or more items in the list are complete sentences. For the sake of consistency, normally either all items or none should be complete sentences.
Note: Ordered lists that are part of an incomplete introductory sentence can be in lowercase and punctuated as if each item was a part of the introductory sentence. -
Use the number one (
1.
) for ordered lists. -
Use (
+
), (*
), or (-
) for unordered lists. -
Leave a blank line after each list.
-
Indent nested lists with four spaces (for example, ⋅⋅⋅⋅).
-
List items may consist of multiple paragraphs. Each subsequent paragraph in a list item must be indented by either four spaces or one tab.
Tables
The semantic purpose of a data table is to present tabular data. Sighted users can quickly scan the table but a screen reader goes through line by line. A table caption is used to create a descriptive title for a data table. Assistive technologies (AT) use the HTML table caption element to identify the table contents to the user within the page structure.
- Add table captions using Hugo shortcodes for tables.
Content best practices
This section contains suggested best practices for clear, concise, and consistent content.
Use present tense
Do | Don't |
---|---|
This command starts a proxy. | This command will start a proxy. |
Exception: Use future or past tense if it is required to convey the correct meaning.
Use active voice
Do | Don't |
---|---|
You can explore the API using a browser. | The API can be explored using a browser. |
The YAML file specifies the replica count. | The replica count is specified in the YAML file. |
Exception: Use passive voice if active voice leads to an awkward construction.
Use simple and direct language
Use simple and direct language. Avoid using unnecessary phrases, such as saying "please."
Do | Don't |
---|---|
To create a ReplicaSet, ... | In order to create a ReplicaSet, ... |
See the configuration file. | Please see the configuration file. |
View the pods. | With this next command, we'll view the pods. |
Address the reader as "you"
Do | Don't |
---|---|
You can create a Deployment by ... | We'll create a Deployment by ... |
In the preceding output, you can see... | In the preceding output, we can see ... |
Avoid Latin phrases
Prefer English terms over Latin abbreviations.
Do | Don't |
---|---|
For example, ... | e.g., ... |
That is, ... | i.e., ... |
Exception: Use "etc." for et cetera.
Patterns to avoid
Avoid using "we"
Using "we" in a sentence can be confusing, because the reader might not know whether they're part of the "we" you're describing.
Do | Don't |
---|---|
Version 1.4 includes ... | In version 1.4, we have added ... |
Kubernetes provides a new feature for ... | We provide a new feature ... |
This page teaches you how to use pods. | In this page, we are going to learn about pods. |
Avoid jargon and idioms
Some readers speak English as a second language. Avoid jargon and idioms to help them understand better.
Do | Don't |
---|---|
Internally, ... | Under the hood, ... |
Create a new cluster. | Turn up a new cluster. |
Avoid statements about the future
Avoid making promises or giving hints about the future. If you need to talk about an alpha feature, put the text under a heading that identifies it as alpha information.
An exception to this rule is documentation about announced deprecations targeting removal in future versions. One example of documentation like this is the Deprecated API migration guide.
Avoid statements that will soon be out of date
Avoid words like "currently" and "new." A feature that is new today might not be considered new in a few months.
Do | Don't |
---|---|
In version 1.4, ... | In the current version, ... |
The Federation feature provides ... | The new Federation feature provides ... |
Avoid words that assume a specific level of understanding
Avoid words such as "just", "simply", "easy", "easily", or "simple". These words do not add value.
Do | Don't |
---|---|
Include one command in ... | Include just one command in ... |
Run the container ... | Simply run the container ... |
You can remove ... | You can easily remove ... |
These steps ... | These simple steps ... |
What's next
- Learn about writing a new topic.
- Learn about using page templates.
- Learn about creating a pull request.
7.6.3 - Diagram Guide
This guide shows you how to create, edit and share diagrams using the Mermaid
Javascript library. Mermaid.js allows you to generate diagrams using a simple
markdown-like syntax inside Markdown files. You can also use Mermaid to
generate .svg
or .png
image files that you can add to your documentation.
The target audience for this guide is anybody wishing to learn about Mermaid and/or how to create and add diagrams to Kubernetes documentation.
Figure 1 outlines the topics covered in this section.
Figure 1. Topics covered in this section.
All you need to begin working with Mermaid is the following:
- Basic understanding of markdown.
- Using the Mermaid live editor.
- Using Hugo shortcodes.
- Using the Hugo {{< figure >}} shortcode.
- Performing Hugo local previews.
- Familiar with the Contributing new content process.
Why you should use diagrams in documentation
Diagrams improve documentation clarity and comprehension. There are advantages for both the user and the contributor.
The user benefits include:
- Friendly landing spot. A detailed text-only greeting page could intimidate users, in particular, first-time Kubernetes users.
- Faster grasp of concepts. A diagram can help users understand the key points of a complex topic. Your diagram can serve as a visual learning guide to dive into the topic details.
- Better retention. For some, it is easier to recall pictures rather than text.
The contributor benefits include:
- Assist in developing the structure and content of your contribution. For example, you can start with a simple diagram covering the high-level points and then dive into details.
- Expand and grow the user community. Easily consumed documentation augmented with diagrams attracts new users who might previously have been reluctant to engage due to perceived complexities.
You should consider your target audience. In addition to experienced K8s users, you will have many who are new to Kubernetes. Even a simple diagram can assist new users in absorbing Kubernetes concepts. They become emboldened and more confident to further explore Kubernetes and the documentation.
Mermaid
Mermaid is an open source JavaScript library that allows you to create, edit and easily share diagrams using a simple, markdown-like syntax configured inline in Markdown files.
The following lists features of Mermaid:
- Simple code syntax.
- Includes a web-based tool allowing you to code and preview your diagrams.
- Supports multiple formats including flowchart, state and sequence.
- Easy collaboration with colleagues by sharing a per-diagram URL.
- Broad selection of shapes, lines, themes and styling.
The following lists advantages of using Mermaid:
- No need for separate, non-Mermaid diagram tools.
- Adheres to existing PR workflow. You can think of Mermaid code as just Markdown text included in your PR.
- Simple tool builds simple diagrams. You don't want to get bogged down (re)crafting an overly complex and detailed picture. Keep it simple!
Mermaid provides a simple, open and transparent method for the SIG communities to add, edit and collaborate on diagrams for new or existing documentation.
Live editor
The Mermaid live editor is a web-based tool that enables you to create, edit and review diagrams.
The following lists live editor functions:
- Displays Mermaid code and rendered diagram.
- Generates a URL for each saved diagram. The URL is displayed in the URL field of your browser. You can share the URL with colleagues who can access and modify the diagram.
- Option to download
.svg
or.png
files.
Methods for creating diagrams
Figure 2 outlines the three methods to generate and add diagrams.
Figure 2. Methods to create diagrams.
Inline
Figure 3 outlines the steps to follow for adding a diagram using the Inline method.
Figure 3. Inline Method steps.
The following lists the steps you should follow for adding a diagram using the Inline method:
- Create your diagram using the live editor.
- Store the diagram URL somewhere for later access.
- Copy the mermaid code to the location in your
.md
file where you want the diagram to appear. - Add a caption below the diagram using Markdown text.
A Hugo build runs the Mermaid code and turns it into a diagram.
.md
file that the Mermaid code is self-documenting. Contributors can
copy the Mermaid code to and from the live editor for diagram edits.
Here is a sample code snippet contained in an .md
file:
---
title: My PR
---
Figure 17 shows a simple A to B process.
some markdown text
...
{{< mermaid >}}
graph TB
A --> B
{{< /mermaid >}}
Figure 17. A to B
more text
{{< mermaid >}}
, {{< /mermaid >}}
shortcode
tags at the start and end of the Mermaid code block. You should add a diagram
caption below the diagram.
For more details on diagram captions, see How to use captions.
The following lists advantages of the Inline method:
- Live editor tool.
- Easy to copy Mermaid code to and from the live editor and your
.md
file. - No need for separate
.svg
image file handling. - Content text, diagram code and diagram caption contained in the same
.md
file.
You should use the local and Netlify previews to verify the diagram is properly rendered.
Mermaid+SVG
Figure 4 outlines the steps to follow for adding a diagram using the Mermaid+SVG method.
Figure 4. Mermaid+SVG method steps.
The following lists the steps you should follow for adding a diagram using the Mermaid+SVG method:
- Create your diagram using the live editor.
- Store the diagram URL somewhere for later access.
- Generate an
.svg
image file for the diagram and download it to the appropriateimages/
folder. - Use the
{{< figure >}}
shortcode to reference the diagram in the.md
file. - Add a caption using the
{{< figure >}}
shortcode'scaption
parameter.
For example, use the live editor to create a diagram called boxnet
.
Store the diagram URL somewhere for later access. Generate and download a
boxnet.svg
file to the appropriate ../images/
folder.
Use the {{< figure >}}
shortcode in your PR's .md
file to reference
the .svg
image file and add a caption.
{{< figure src="/static/images/boxnet.svg" alt="Boxnet figure" class="diagram-large" caption="Figure 14. Boxnet caption" >}}
For more details on diagram captions, see How to use captions.
{{< figure >}}
shortcode is the preferred method for adding .svg
image files
to your documentation. You can also use the standard markdown image syntax like so:
![my boxnet diagram](static/images/boxnet.svg)
.
And you will need to add a caption below the diagram.
You should add the live editor URL as a comment block in the .svg
image file using a text editor.
For example, you would include the following at the beginning of the .svg
image file:
<!-- To view or edit the mermaid code, use the following URL: -->
<!-- https://mermaid-js.github.io/mermaid-live-editor/edit/#eyJjb ... <remainder of the URL> -->
The following lists advantages of the Mermaid+SVG method:
- Live editor tool.
- Live editor tool supports the most current Mermaid feature set.
- Employ existing K8s/website methods for handling
.svg
image files. - Environment doesn't require Mermaid support.
Be sure to check that your diagram renders properly using the local and Netlify previews.
External tool
Figure 5 outlines the steps to follow for adding a diagram using the External Tool method.
First, use your external tool to create the diagram and save it as an .svg
or .png
image file. After that, use the same steps as the Mermaid+SVG
method for adding .svg
image files.
Figure 5. External Tool method steps
The following lists the steps you should follow for adding a diagram using the External Tool method:
- Use your external tool to create a diagram.
- Save the diagram coordinates for contributor access. For example, your tool
may offer a link to the diagram image, or you could place the source code
file, such as an
.xml
file, in a public repository for later contributor access. - Generate and save the diagram as an
.svg
or.png
image file. Download this file to the appropriate../images/
folder. - Use the
{{< figure >}}
shortcode to reference the diagram in the.md
file. - Add a caption using the
{{< figure >}}
shortcode'scaption
parameter.
Here is the {{< figure >}}
shortcode for the images/apple.svg
diagram:
{{< figure src="/static/images/apple.svg" alt="red-apple-figure" class="diagram-large" caption="Figure 9. A Big Red Apple" >}}
If your external drawing tool permits:
- You can incorporate multiple
.svg
or.png
logos, icons and images into your diagram. However, make sure you observe copyright and follow the Kubernetes documentation guidelines on the use of third party content. - You should save the diagram source coordinates for later contributor access.
For example, your tool may offer a link to the diagram image, or you could
place the source code file, such as an
.xml
file, somewhere for contributor access.
For more information on K8s and CNCF logos and images, check out CNCF Artwork.
The following lists advantages of the External Tool method:
- Contributor familiarity with external tool.
- Diagrams require more detail than what Mermaid can offer.
Don't forget to check that your diagram renders correctly using the local and Netlify previews.
Examples
This section shows several examples of Mermaid diagrams.
{{< mermaid >}}
, {{< /mermaid >}}
shortcode tags. This allows you to copy the code block into the live editor
to experiment on your own.
Note that the live editor doesn't recognize Hugo shortcodes.
Example 1 - Pod topology spread constraints
Figure 6 shows the diagram appearing in the Pod topology pread constraints page.
Figure 6. Pod Topology Spread Constraints.
Code block:
graph TB
subgraph "zoneB"
n3(Node3)
n4(Node4)
end
subgraph "zoneA"
n1(Node1)
n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4 k8s;
class zoneA,zoneB cluster;
Example 2 - Ingress
Figure 7 shows the diagram appearing in the What is Ingress page.
Figure 7. Ingress
Code block:
graph LR;
client([client])-. Ingress-managed <br> load balancer .->ingress[Ingress];
ingress-->|routing rule|service[Service];
subgraph cluster
ingress;
service-->pod1[Pod];
service-->pod2[Pod];
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class ingress,service,pod1,pod2 k8s;
class client plain;
class cluster cluster;
Example 3 - K8s system flow
Figure 8 depicts a Mermaid sequence diagram showing the system flow between K8s components to start a container.
Code block:
%%{init:{"theme":"neutral"}}%%
sequenceDiagram
actor me
participant apiSrv as control plane<br><br>api-server
participant etcd as control plane<br><br>etcd datastore
participant cntrlMgr as control plane<br><br>controller<br>manager
participant sched as control plane<br><br>scheduler
participant kubelet as node<br><br>kubelet
participant container as node<br><br>container<br>runtime
me->>apiSrv: 1. kubectl create -f pod.yaml
apiSrv-->>etcd: 2. save new state
cntrlMgr->>apiSrv: 3. check for changes
sched->>apiSrv: 4. watch for unassigned pods(s)
apiSrv->>sched: 5. notify about pod w nodename=" "
sched->>apiSrv: 6. assign pod to node
apiSrv-->>etcd: 7. save new state
kubelet->>apiSrv: 8. look for newly assigned pod(s)
apiSrv->>kubelet: 9. bind pod to node
kubelet->>container: 10. start container
kubelet->>apiSrv: 11. update pod status
apiSrv-->>etcd: 12. save new state
How to style diagrams
You can style one or more diagram elements using well-known CSS nomenclature. You accomplish this using two types of statements in the Mermaid code.
classDef
defines a class of style attributes.class
defines one or more elements to apply the class to.
In the code for figure 7, you can see examples of both.
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff; // defines style for the k8s class
class ingress,service,pod1,pod2 k8s; // k8s class is applied to elements ingress, service, pod1 and pod2.
You can include one or multiple classDef
and class
statements in your diagram.
You can also use the official K8s #326ce5
hex color code for K8s components in your diagram.
For more information on styling and classes, see Mermaid Styling and classes docs.
How to use captions
A caption is a brief description of a diagram. A title or a short description of the diagram are examples of captions. Captions aren't meant to replace explanatory text you have in your documentation. Rather, they serve as a "context link" between that text and your diagram.
The combination of some text and a diagram tied together with a caption help provide a concise representation of the information you wish to convey to the user.
Without captions, you are asking the user to scan the text above or below the diagram to figure out a meaning. This can be frustrating for the user.
Figure 9 lays out the three components for proper captioning: diagram, diagram caption and the diagram referral.
Figure 9. Caption Components.
Diagram
The Mermaid+SVG
and External Tool
methods generate .svg
image files.
Here is the {{< figure >}}
shortcode for the diagram defined in an
.svg
image file saved to /images/docs/components-of-kubernetes.svg
:
{{< figure src="/images/docs/components-of-kubernetes.svg" alt="Kubernetes pod running inside a cluster" class="diagram-large" caption="Figure 4. Kubernetes Architecture Components >}}
You should pass the src
, alt
, class
and caption
values into the
{{< figure >}}
shortcode. You can adjust the size of the diagram using
diagram-large
, diagram-medium
and diagram-small
classes.
Inline
method don't use the {{< figure >}}
shortcode. The Mermaid code defines how the diagram will render on your page.
See Methods for creating diagrams for more information on the different methods for creating diagrams.
Diagram Caption
Next, add a diagram caption.
If you define your diagram in an .svg
image file, then you should use the
{{< figure >}}
shortcode's caption
parameter.
{{< figure src="/images/docs/components-of-kubernetes.svg" alt="Kubernetes pod running inside a cluster" class="diagram-large" caption="Figure 4. Kubernetes Architecture Components" >}}
If you define your diagram using inline Mermaid code, then you should use Markdown text.
Figure 4. Kubernetes Architecture Components
The following lists several items to consider when adding diagram captions:
- Use the
{{< figure >}}
shortcode to add a diagram caption forMermaid+SVG
andExternal Tool
diagrams. - Use simple Markdown text to add a diagram caption for the
Inline
method. - Prepend your diagram caption with
Figure NUMBER.
. You must useFigure
and the number must be unique for each diagram in your documentation page. Add a period after the number. - Add your diagram caption text after the
Figure NUMBER.
on the same line. You must puncuate the caption with a period. Keep the caption text short. - Position your diagram caption BELOW your diagram.
Diagram Referral
Finally, you can add a diagram referral. This is used inside your text and
should precede the diagram itself. It allows a user to connect your text with
the associated diagram. The Figure NUMBER
in your referral and caption must
match.
You should avoid using spatial references such as ..the image below..
or
..the following figure ..
Here is an example of a diagram referral:
Figure 10 depicts the components of the Kubernetes architecture.
The control plane ...
Diagram referrals are optional and there are cases where they might not be suitable. If you are not sure, add a diagram referral to your text to see if it looks and sounds okay. When in doubt, use a diagram referral.
Complete picture
Figure 10 shows the Kubernetes Architecture diagram that includes the diagram,
diagram caption and diagram referral. The {{< figure >}}
shortcode
renders the diagram, adds the caption and includes the optional link
parameter so you can hyperlink the diagram. The diagram referral is contained
in this paragraph.
Here is the {{< figure >}}
shortcode for this diagram:
{{< figure src="/images/docs/components-of-kubernetes.svg" alt="Kubernetes pod running inside a cluster" class="diagram-large" caption="Figure 10. Kubernetes Architecture." link="https://kubernetes.io/docs/concepts/overview/components/" >}}
Tips
-
Always use the live editor to create/edit your diagram.
-
Always use Hugo local and Netlify previews to check out how the diagram appears in the documentation.
-
Include diagram source pointers such as a URL, source code location, or indicate the code is self-documenting.
-
Always use diagram captions.
-
Very helpful to include the diagram
.svg
or.png
image and/or Mermaid source code in issues and PRs. -
With the
Mermaid+SVG
andExternal Tool
methods, use.svg
image files because they stay sharp when you zoom in on the diagram. -
Best practice for
.svg
files is to load it into an SVG editing tool and use the "Convert text to paths" function. This ensures that the diagram renders the same on all systems, regardless of font availability and font rendering support. -
No Mermaid support for additional icons or artwork.
-
Hugo Mermaid shortcodes don't work in the live editor.
-
Any time you modify a diagram in the live editor, you must save it to generate a new URL for the diagram.
-
Click on the diagrams in this section to view the code and diagram rendering in the live editor.
-
Look over the source code of this page,
diagram-guide.md
, for more examples. -
Check out the Mermaid docs for explanations and examples.
Most important, Keep Diagrams Simple. This will save time for you and fellow contributors, and allow for easier reading by new and experienced users.
7.6.4 - Writing a new topic
This page shows how to create a new topic for the Kubernetes docs.
Before you begin
Create a fork of the Kubernetes documentation repository as described in Open a PR.
Choosing a page type
As you prepare to write a new topic, think about the page type that would fit your content the best:
Type | Description |
---|---|
Concept | A concept page explains some aspect of Kubernetes. For example, a concept page might describe the Kubernetes Deployment object and explain the role it plays as an application while it is deployed, scaled, and updated. Typically, concept pages don't include sequences of steps, but instead provide links to tasks or tutorials. For an example of a concept topic, see Nodes. |
Task | A task page shows how to do a single thing. The idea is to give readers a sequence of steps that they can actually do as they read the page. A task page can be short or long, provided it stays focused on one area. In a task page, it is OK to blend brief explanations with the steps to be performed, but if you need to provide a lengthy explanation, you should do that in a concept topic. Related task and concept topics should link to each other. For an example of a short task page, see Configure a Pod to Use a Volume for Storage. For an example of a longer task page, see Configure Liveness and Readiness Probes |
Tutorial | A tutorial page shows how to accomplish a goal that ties together several Kubernetes features. A tutorial might provide several sequences of steps that readers can actually do as they read the page. Or it might provide explanations of related pieces of code. For example, a tutorial could provide a walkthrough of a code sample. A tutorial can include brief explanations of the Kubernetes features that are being tied together, but should link to related concept topics for deep explanations of individual features. |
Creating a new page
Use a content type for each new page
that you write. The docs site provides templates or
Hugo archetypes to create
new content pages. To create a new type of page, run hugo new
with the path to the file
you want to create. For example:
hugo new docs/concepts/my-first-concept.md
Choosing a title and filename
Choose a title that has the keywords you want search engines to find.
Create a filename that uses the words in your title separated by hyphens.
For example, the topic with title
Using an HTTP Proxy to Access the Kubernetes API
has filename http-proxy-access-api.md
. You don't need to put
"kubernetes" in the filename, because "kubernetes" is already in the
URL for the topic, for example:
/docs/tasks/extend-kubernetes/http-proxy-access-api/
Adding the topic title to the front matter
In your topic, put a title
field in the
front matter.
The front matter is the YAML block that is between the
triple-dashed lines at the top of the page. Here's an example:
---
title: Using an HTTP Proxy to Access the Kubernetes API
---
Choosing a directory
Depending on your page type, put your new file in a subdirectory of one of these:
- /content/en/docs/tasks/
- /content/en/docs/tutorials/
- /content/en/docs/concepts/
You can put your file in an existing subdirectory, or you can create a new subdirectory.
Placing your topic in the table of contents
The table of contents is built dynamically using the directory structure of the
documentation source. The top-level directories under /content/en/docs/
create
top-level navigation, and subdirectories each have entries in the table of
contents.
Each subdirectory has a file _index.md
, which represents the "home" page for
a given subdirectory's content. The _index.md
does not need a template. It
can contain overview content about the topics in the subdirectory.
Other files in a directory are sorted alphabetically by default. This is almost
never the best order. To control the relative sorting of topics in a
subdirectory, set the weight:
front-matter key to an integer. Typically, we
use multiples of 10, to account for adding topics later. For instance, a topic
with weight 10
will come before one with weight 20
.
Embedding code in your topic
If you want to include some code in your topic, you can embed the code in your file directly using the markdown code block syntax. This is recommended for the following cases (not an exhaustive list):
- The code shows the output from a command such as
kubectl get deploy mydeployment -o json | jq '.status'
. - The code is not generic enough for users to try out. As an example, you can embed the YAML file for creating a Pod which depends on a specific FlexVolume implementation.
- The code is an incomplete example because its purpose is to highlight a portion of a larger file. For example, when describing ways to customize the PodSecurityPolicy for some reasons, you can provide a short snippet directly in your topic file.
- The code is not meant for users to try out due to other reasons. For example,
when describing how a new attribute should be added to a resource using the
kubectl edit
command, you can provide a short example that includes only the attribute to add.
Including code from another file
Another way to include code in your topic is to create a new, complete sample file (or group of sample files) and then reference the sample from your topic. Use this method to include sample YAML files when the sample is generic and reusable, and you want the reader to try it out themselves.
When adding a new standalone sample file, such as a YAML file, place the code in
one of the <LANG>/examples/
subdirectories where <LANG>
is the language for
the topic. In your topic file, use the codenew
shortcode:
{{< codenew file="<RELPATH>/my-example-yaml>" >}}
where <RELPATH>
is the path to the file to include, relative to the
examples
directory. The following Hugo shortcode references a YAML
file located at /content/en/examples/pods/storage/gce-volume.yaml
.
{{< codenew file="pods/storage/gce-volume.yaml" >}}
<
and before
the >
characters. View the code for this page for an example.
Showing how to create an API object from a configuration file
If you need to demonstrate how to create an API object based on a
configuration file, place the configuration file in one of the subdirectories
under <LANG>/examples
.
In your topic, show this command:
kubectl create -f https://k8s.io/examples/pods/storage/gce-volume.yaml
<LANG>/examples
directory, make
sure the file is also included into the <LANG>/examples_test.go
file. The
Travis CI for the Website automatically runs this test case when PRs are
submitted to ensure all examples pass the tests.
For an example of a topic that uses this technique, see Running a Single-Instance Stateful Application.
Adding images to a topic
Put image files in the /images
directory. The preferred
image format is SVG.
What's next
- Learn about using page content types.
- Learn about creating a pull request.
7.6.5 - Page content types
The Kubernetes documentation follows several types of page content:
- Concept
- Task
- Tutorial
- Reference
Content sections
Each page content type contains a number of sections defined by
Markdown comments and HTML headings. You can add content headings to
your page with the heading
shortcode. The comments and headings help
maintain the structure of the page content types.
Examples of Markdown comments defining page content sections:
<!-- overview -->
<!-- body -->
To create common headings in your content pages, use the heading
shortcode with
a heading string.
Examples of heading strings:
- whatsnext
- prerequisites
- objectives
- cleanup
- synopsis
- seealso
- options
For example, to create a whatsnext
heading, add the heading shortcode with the "whatsnext" string:
## {{% heading "whatsnext" %}}
You can declare a prerequisites
heading as follows:
## {{% heading "prerequisites" %}}
The heading
shortcode expects one string parameter.
The heading string parameter matches the prefix of a variable in the i18n/<lang>.toml
files.
For example:
i18n/en.toml
:
[whatsnext_heading]
other = "What's next"
i18n/ko.toml
:
[whatsnext_heading]
other = "다음 내용"
Content types
Each content type informally defines its expected page structure. Create page content with the suggested page sections.
Concept
A concept page explains some aspect of Kubernetes. For example, a concept page might describe the Kubernetes Deployment object and explain the role it plays as an application once it is deployed, scaled, and updated. Typically, concept pages don't include sequences of steps, but instead provide links to tasks or tutorials.
To write a new concept page, create a Markdown file in a subdirectory of the
/content/en/docs/concepts
directory, with the following characteristics:
Concept pages are divided into three sections:
Page section |
---|
overview |
body |
whatsnext |
The overview
and body
sections appear as comments in the concept page.
You can add the whatsnext
section to your page with the heading
shortcode.
Fill each section with content. Follow these guidelines:
- Organize content with H2 and H3 headings.
- For
overview
, set the topic's context with a single paragraph. - For
body
, explain the concept. - For
whatsnext
, provide a bulleted list of topics (5 maximum) to learn more about the concept.
Annotations is a published example of a concept page.
Task
A task page shows how to do a single thing, typically by giving a short sequence of steps. Task pages have minimal explanation, but often provide links to conceptual topics that provide related background and knowledge.
To write a new task page, create a Markdown file in a subdirectory of the
/content/en/docs/tasks
directory, with the following characteristics:
Page section |
---|
overview |
prerequisites |
steps |
discussion |
whatsnext |
The overview
, steps
, and discussion
sections appear as comments in the task page.
You can add the prerequisites
and whatsnext
sections to your page
with the heading
shortcode.
Within each section, write your content. Use the following guidelines:
- Use a minimum of H2 headings (with two leading
#
characters). The sections themselves are titled automatically by the template. - For
overview
, use a paragraph to set context for the entire topic. - For
prerequisites
, use bullet lists when possible. Start adding additional prerequisites below theinclude
. The default prerequisites include a running Kubernetes cluster. - For
steps
, use numbered lists. - For discussion, use normal content to expand upon the information covered
in
steps
. - For
whatsnext
, give a bullet list of up to 5 topics the reader might be interested in reading next.
An example of a published task topic is Using an HTTP proxy to access the Kubernetes API.
Tutorial
A tutorial page shows how to accomplish a goal that is larger than a single task. Typically a tutorial page has several sections, each of which has a sequence of steps. For example, a tutorial might provide a walkthrough of a code sample that illustrates a certain feature of Kubernetes. Tutorials can include surface-level explanations, but should link to related concept topics for deep explanations.
To write a new tutorial page, create a Markdown file in a subdirectory of the
/content/en/docs/tutorials
directory, with the following characteristics:
Page section |
---|
overview |
prerequisites |
objectives |
lessoncontent |
cleanup |
whatsnext |
The overview
, objectives
, and lessoncontent
sections appear as comments in the tutorial page.
You can add the prerequisites
, cleanup
, and whatsnext
sections to your page
with the heading
shortcode.
Within each section, write your content. Use the following guidelines:
- Use a minimum of H2 headings (with two leading
#
characters). The sections themselves are titled automatically by the template. - For
overview
, use a paragraph to set context for the entire topic. - For
prerequisites
, use bullet lists when possible. Add additional prerequisites below the ones included by default. - For
objectives
, use bullet lists. - For
lessoncontent
, use a mix of numbered lists and narrative content as appropriate. - For
cleanup
, use numbered lists to describe the steps to clean up the state of the cluster after finishing the task. - For
whatsnext
, give a bullet list of up to 5 topics the reader might be interested in reading next.
An example of a published tutorial topic is Running a Stateless Application Using a Deployment.
Reference
A component tool reference page shows the description and flag options output for a Kubernetes component tool. Each page generates from scripts using the component tool commands.
A tool reference page has several possible sections:
Page section |
---|
synopsis |
options |
options from parent commands |
examples |
seealso |
Examples of published tool reference pages are:
What's next
- Learn about the Style guide
- Learn about the Content guide
- Learn about content organization
7.6.6 - Content organization
This site uses Hugo. In Hugo, content organization is a core concept.
hugo server --navigateToChanged
for content edit-sessions.
Page Lists
Page Order
The documentation side menu, the documentation page browser etc. are listed using Hugo's default sort order, which sorts by weight (from 1), date (newest first), and finally by the link title.
Given that, if you want to move a page or a section up, set a weight in the page's front matter:
title: My Page
weight: 10
Documentation Main Menu
The Documentation
main menu is built from the sections below docs/
with the main_menu
flag set in front matter of the _index.md
section content file:
main_menu: true
Note that the link title is fetched from the page's linkTitle
, so if you want it to be something different than the title, change it in the content file:
main_menu: true
title: Page Title
linkTitle: Title used in links
_index.md
content file in the section folder.
Documentation Side Menu
The documentation side-bar menu is built from the current section tree starting below docs/
.
It will show all sections and their pages.
If you don't want to list a section or page, set the toc_hide
flag to true
in front matter:
toc_hide: true
When you navigate to a section that has content, the specific section or page (e.g. _index.md
) is shown. Else, the first page inside that section is shown.
Documentation Browser
The page browser on the documentation home page is built using all the sections and pages that are directly below the docs section
.
If you don't want to list a section or page, set the toc_hide
flag to true
in front matter:
toc_hide: true
The Main Menu
The site links in the top-right menu -- and also in the footer -- are built by page-lookups. This is to make sure that the page actually exists. So, if the case-studies
section does not exist in a site (language), it will not be linked to.
Page Bundles
In addition to standalone content pages (Markdown files), Hugo supports Page Bundles.
One example is Custom Hugo Shortcodes. It is considered a leaf bundle
. Everything below the directory, including the index.md
, will be part of the bundle. This also includes page-relative links, images that can be processed etc.:
en/docs/home/contribute/includes
├── example1.md
├── example2.md
├── index.md
└── podtemplate.json
Another widely used example is the includes
bundle. It sets headless: true
in front matter, which means that it does not get its own URL. It is only used in other pages.
en/includes
├── default-storage-class-prereqs.md
├── index.md
├── partner-script.js
├── partner-style.css
├── task-tutorial-prereqs.md
├── user-guide-content-moved.md
└── user-guide-migration-notice.md
Some important notes to the files in the bundles:
- For translated bundles, any missing non-content files will be inherited from languages above. This avoids duplication.
- All the files in a bundle are what Hugo calls
Resources
and you can provide metadata per language, such as parameters and title, even if it does not supports front matter (YAML files etc.). See Page Resources Metadata. - The value you get from
.RelPermalink
of aResource
is page-relative. See Permalinks.
Styles
The SASS source of the stylesheets for this site is stored in assets/sass
and is automatically built by Hugo.
What's next
- Learn about custom Hugo shortcodes
- Learn about the Style guide
- Learn about the Content guide
7.6.7 - Custom Hugo Shortcodes
This page explains the custom Hugo shortcodes that can be used in Kubernetes Markdown documentation.
Read more about shortcodes in the Hugo documentation.
Feature state
In a Markdown page (.md
file) on this site, you can add a shortcode to
display version and state of the documented feature.
Feature state demo
Below is a demo of the feature state snippet, which displays the feature as stable in the latest Kubernetes version.
{{< feature-state state="stable" >}}
Renders to:
Kubernetes v1.23 [stable]
The valid values for state
are:
- alpha
- beta
- deprecated
- stable
Feature state code
The displayed Kubernetes version defaults to that of the page or the site. You can change the
feature state version by passing the for_k8s_version
shortcode parameter. For example:
{{< feature-state for_k8s_version="v1.10" state="beta" >}}
Renders to:
Kubernetes v1.10 [beta]
Glossary
There are two glossary shortcodes: glossary_tooltip
and glossary_definition
.
You can reference glossary terms with an inclusion that automatically updates and replaces content with the relevant links from our glossary. When the glossary term is moused-over, the glossary entry displays a tooltip. The glossary term also displays as a link.
As well as inclusions with tooltips, you can reuse the definitions from the glossary in page content.
The raw data for glossary terms is stored at the glossary directory, with a content file for each glossary term.
Glossary demo
For example, the following include within the Markdown renders to cluster with a tooltip:
{{< glossary_tooltip text="cluster" term_id="cluster" >}}
Here's a short glossary definition:
{{< glossary_definition prepend="A cluster is" term_id="cluster" length="short" >}}
which renders as:
A cluster is a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.
You can also include a full definition:
{{< glossary_definition term_id="cluster" length="all" >}}
which renders as:
A set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.
The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.
Links to API Reference
You can link to a page of the Kubernetes API reference using the
api-reference
shortcode, for example to the
Pod reference:
{{< api-reference page="workload-resources/pod-v1" >}}
The content of the page
parameter is the suffix of the URL of the API reference page.
You can link to a specific place into a page by specifying an anchor
parameter, for example to the
PodSpec
reference or the
environment-variables
section of the page:
{{< api-reference page="workload-resources/pod-v1" anchor="PodSpec" >}}
{{< api-reference page="workload-resources/pod-v1" anchor="environment-variables" >}}
You can change the text of the link by specifying a text
parameter, for
example by linking to the
Environment Variables
section of the page:
{{< api-reference page="workload-resources/pod-v1" anchor="environment-variables" text="Environment Variable" >}}
Table captions
You can make tables more accessible to screen readers by adding a table caption. To add a
caption to a table,
enclose the table with a table
shortcode and specify the caption with the caption
parameter.
Here's an example:
{{< table caption="Configuration parameters" >}}
Parameter | Description | Default
:---------|:------------|:-------
`timeout` | The timeout for requests | `30s`
`logLevel` | The log level for log output | `INFO`
{{< /table >}}
The rendered table looks like this:
Parameter | Description | Default |
---|---|---|
timeout |
The timeout for requests | 30s |
logLevel |
The log level for log output | INFO |
If you inspect the HTML for the table, you should see this element immediately
after the opening <table>
element:
<caption style="display: none;">Configuration parameters</caption>
Tabs
In a markdown page (.md
file) on this site, you can add a tab set to display
multiple flavors of a given solution.
The tabs
shortcode takes these parameters:
name
: The name as shown on the tab.codelang
: If you provide inner content to thetab
shortcode, you can tell Hugo what code language to use for highlighting.include
: The file to include in the tab. If the tab lives in a Hugo leaf bundle, the file -- which can be any MIME type supported by Hugo -- is looked up in the bundle itself. If not, the content page that needs to be included is looked up relative to the current page. Note that with theinclude
, you do not have any shortcode inner content and must use the self-closing syntax. For example,{{< tab name="Content File #1" include="example1" />}}
. The language needs to be specified undercodelang
or the language is taken based on the file name. Non-content files are code-highlighted by default.- If your inner content is markdown, you must use the
%
-delimiter to surround the tab. For example,{{% tab name="Tab 1" %}}This is **markdown**{{% /tab %}}
- You can combine the variations mentioned above inside a tab set.
Below is a demo of the tabs shortcode.
tabs
definition must be unique within a content page.
Tabs demo: Code highlighting
{{< tabs name="tab_with_code" >}}
{{{< tab name="Tab 1" codelang="bash" >}}
echo "This is tab 1."
{{< /tab >}}
{{< tab name="Tab 2" codelang="go" >}}
println "This is tab 2."
{{< /tab >}}}
{{< /tabs >}}
Renders to:
echo "This is tab 1."
println "This is tab 2."
Tabs demo: Inline Markdown and HTML
{{< tabs name="tab_with_md" >}}
{{% tab name="Markdown" %}}
This is **some markdown.**
{{< note >}}
It can even contain shortcodes.
{{< /note >}}
{{% /tab %}}
{{< tab name="HTML" >}}
<div>
<h3>Plain HTML</h3>
<p>This is some <i>plain</i> HTML.</p>
</div>
{{< /tab >}}
{{< /tabs >}}
Renders to:
This is some markdown.
Plain HTML
This is some plain HTML.
Tabs demo: File include
{{< tabs name="tab_with_file_include" >}}
{{< tab name="Content File #1" include="example1" />}}
{{< tab name="Content File #2" include="example2" />}}
{{< tab name="JSON File" include="podtemplate" />}}
{{< /tabs >}}
Renders to:
This is an example content file inside the includes leaf bundle.
This is another example content file inside the includes leaf bundle.
{
"apiVersion": "v1",
"kind": "PodTemplate",
"metadata": {
"name": "nginx"
},
"template": {
"metadata": {
"labels": {
"name": "nginx"
},
"generateName": "nginx-"
},
"spec": {
"containers": [{
"name": "nginx",
"image": "dockerfile/nginx",
"ports": [{"containerPort": 80}]
}]
}
}
}
Third party content marker
Running Kubernetes requires third-party software. For example: you usually need to add a DNS server to your cluster so that name resolution works.
When we link to third-party software, or otherwise mention it, we follow the content guide and we also mark those third party items.
Using these shortcodes adds a disclaimer to any documentation page that uses them.
Lists
For a list of several third-party items, add:
{{% thirdparty-content %}}
just below the heading for the section that includes all items.
Items
If you have a list where most of the items refer to in-project software (for example: Kubernetes itself, and the separate Descheduler component), then there is a different form to use.
Add the shortcode:
{{% thirdparty-content single="true" %}}
before the item, or just below the heading for the specific item.
Version strings
To generate a version string for inclusion in the documentation, you can choose from
several version shortcodes. Each version shortcode displays a version string derived from
the value of a version parameter found in the site configuration file, config.toml
.
The two most commonly used version parameters are latest
and version
.
{{< param "version" >}}
The {{< param "version" >}}
shortcode generates the value of the current
version of the Kubernetes documentation from the version
site parameter. The
param
shortcode accepts the name of one site parameter, in this case:
version
.
latest
and version
parameter values
are not equivalent. After a new version is released, latest
is incremented
and the value of version
for the documentation set remains unchanged. For
example, a previously released version of the documentation displays version
as v1.19
and latest
as v1.20
.
Renders to:
v1.23{{< latest-version >}}
The {{< latest-version >}}
shortcode returns the value of the latest
site parameter.
The latest
site parameter is updated when a new version of the documentation is released.
This parameter does not always match the value of version
in a documentation set.
Renders to:
v1.23{{< latest-semver >}}
The {{< latest-semver >}}
shortcode generates the value of latest
without the "v" prefix.
Renders to:
1.23{{< version-check >}}
The {{< version-check >}}
shortcode checks if the min-kubernetes-server-version
page parameter is present and then uses this value to compare to version
.
Renders to:
To check the version, enterkubectl version
.
{{< latest-release-notes >}}
The {{< latest-release-notes >}}
shortcode generates a version string
from latest
and removes the "v" prefix. The shortcode prints a new URL for
the release note CHANGELOG page with the modified version string.
Renders to:
https://git.k8s.io/kubernetes/CHANGELOG/CHANGELOG-1.23.mdWhat's next
- Learn about Hugo.
- Learn about writing a new topic.
- Learn about page content types.
- Learn about opening a pull request.
- Learn about advanced contributing.
7.7 - Reference Docs Overview
The topics in this section document how to generate the Kubernetes reference guides.
To build the reference documentation, see the following guide:
7.7.1 - Contributing to the Upstream Kubernetes Code
This page shows how to contribute to the upstream kubernetes/kubernetes
project.
You can fix bugs found in the Kubernetes API documentation or the content of
the Kubernetes components such as kubeadm
, kube-apiserver
, and kube-controller-manager
.
If you instead want to regenerate the reference documentation for the Kubernetes
API or the kube-*
components from the upstream code, see the following instructions:
- Generating Reference Documentation for the Kubernetes API
- Generating Reference Documentation for the Kubernetes Components and Tools
Before you begin
-
You need to have these tools installed:
-
Your
GOPATH
environment variable must be set, and the location ofetcd
must be in yourPATH
environment variable. -
You need to know how to create a pull request to a GitHub repository. Typically, this involves creating a fork of the repository. For more information, see Creating a Pull Request and GitHub Standard Fork & Pull Request Workflow.
The big picture
The reference documentation for the Kubernetes API and the kube-*
components
such as kube-apiserver
, kube-controller-manager
are automatically generated
from the source code in the upstream Kubernetes.
When you see bugs in the generated documentation, you may want to consider creating a patch to fix it in the upstream project.
Cloning the Kubernetes repository
If you don't already have the kubernetes/kubernetes repository, get it now:
mkdir $GOPATH/src
cd $GOPATH/src
go get github.com/kubernetes/kubernetes
Determine the base directory of your clone of the
kubernetes/kubernetes repository.
For example, if you followed the preceding step to get the repository, your
base directory is $GOPATH/src/github.com/kubernetes/kubernetes.
The remaining steps refer to your base directory as <k8s-base>
.
Determine the base directory of your clone of the
kubernetes-sigs/reference-docs repository.
For example, if you followed the preceding step to get the repository, your
base directory is $GOPATH/src/github.com/kubernetes-sigs/reference-docs.
The remaining steps refer to your base directory as <rdocs-base>
.
Editing the Kubernetes source code
The Kubernetes API reference documentation is automatically generated from an OpenAPI spec, which is generated from the Kubernetes source code. If you want to change the API reference documentation, the first step is to change one or more comments in the Kubernetes source code.
The documentation for the kube-*
components is also generated from the upstream
source code. You must change the code related to the component
you want to fix in order to fix the generated documentation.
Making changes to the upstream source code
Here's an example of editing a comment in the Kubernetes source code.
In your local kubernetes/kubernetes repository, check out the default branch, and make sure it is up to date:
cd <k8s-base>
git checkout master
git pull https://github.com/kubernetes/kubernetes master
Suppose this source file in that default branch has the typo "atmost":
kubernetes/kubernetes/staging/src/k8s.io/api/apps/v1/types.go
In your local environment, open types.go
, and change "atmost" to "at most".
Verify that you have changed the file:
git status
The output shows that you are on the master branch, and that the types.go
source file has been modified:
On branch master
...
modified: staging/src/k8s.io/api/apps/v1/types.go
Committing your edited file
Run git add
and git commit
to commit the changes you have made so far. In the next step,
you will do a second commit. It is important to keep your changes separated into two commits.
Generating the OpenAPI spec and related files
Go to <k8s-base>
and run these scripts:
hack/update-generated-swagger-docs.sh
hack/update-openapi-spec.sh
hack/update-generated-protobuf.sh
Run git status
to see what was generated.
On branch master
...
modified: api/openapi-spec/swagger.json
modified: staging/src/k8s.io/api/apps/v1/generated.proto
modified: staging/src/k8s.io/api/apps/v1/types.go
modified: staging/src/k8s.io/api/apps/v1/types_swagger_doc_generated.go
View the contents of api/openapi-spec/swagger.json
to make sure the typo is fixed.
For example, you could run git diff -a api/openapi-spec/swagger.json
.
This is important, because swagger.json
is the input to the second stage of
the doc generation process.
Run git add
and git commit
to commit your changes. Now you have two commits:
one that contains the edited types.go
file, and one that contains the generated OpenAPI spec
and related files. Keep these two commits separate. That is, do not squash your commits.
Submit your changes as a pull request to the master branch of the kubernetes/kubernetes repository. Monitor your pull request, and respond to reviewer comments as needed. Continue to monitor your pull request until it is merged.
PR 57758 is an example of a pull request that fixes a typo in the Kubernetes source code.
staging
directory
in the kubernetes/kubernetes
repository. But in your situation,the staging
directory
might not be the place to find the authoritative source. For guidance, check the
README
files in
kubernetes/kubernetes
repository and in related repositories, such as
kubernetes/apiserver.
Cherry picking your commit into a release branch
In the preceding section, you edited a file in the master branch and then ran scripts to generate an OpenAPI spec and related files. Then you submitted your changes in a pull request to the master branch of the kubernetes/kubernetes repository. Now suppose you want to backport your change into a release branch. For example, suppose the master branch is being used to develop Kubernetes version 1.23, and you want to backport your change into the release-1.22 branch.
Recall that your pull request has two commits: one for editing types.go
and one for the files generated by scripts. The next step is to propose a cherry pick of your first
commit into the release-1.22 branch. The idea is to cherry pick the commit
that edited types.go
, but not the commit that has the results of running the scripts. For instructions, see
Propose a Cherry Pick.
When you have a pull request in place for cherry picking your one commit into the release-1.22 branch, the next step is to run these scripts in the release-1.22 branch of your local environment.
hack/update-generated-swagger-docs.sh
hack/update-openapi-spec.sh
hack/update-generated-protobuf.sh
hack/update-api-reference-docs.sh
Now add a commit to your cherry-pick pull request that has the recently generated OpenAPI spec and related files. Monitor your pull request until it gets merged into the release-1.22 branch.
At this point, both the master branch and the release-1.22 branch have your updated types.go
file and a set of generated files that reflect the change you made to types.go
. Note that the
generated OpenAPI spec and other generated files in the release-1.22 branch are not necessarily
the same as the generated files in the master branch. The generated files in the release-1.22 branch
contain API elements only from Kubernetes 1.22. The generated files in the master branch might contain
API elements that are not in 1.22, but are under development for 1.23.
Generating the published reference docs
The preceding section showed how to edit a source file and then generate
several files, including api/openapi-spec/swagger.json
in the
kubernetes/kubernetes
repository.
The swagger.json
file is the OpenAPI definition file to use for generating
the API reference documentation.
You are now ready to follow the Generating Reference Documentation for the Kubernetes API guide to generate the published Kubernetes API reference documentation.
What's next
7.7.2 - Quickstart
This page shows how to use the update-imported-docs.py
script to generate
the Kubernetes reference documentation. The script automates
the build setup and generates the reference documentation for a release.
Before you begin
Requirements:
-
You need a machine that is running Linux or macOS.
-
You need to have these tools installed:
-
Your
PATH
environment variable must include the required build tools, such as theGo
binary andpython
. -
You need to know how to create a pull request to a GitHub repository. This involves creating your own fork of the repository. For more information, see Work from a local clone.
Getting the docs repository
Make sure your website
fork is up-to-date with the kubernetes/website
remote on
GitHub (main
branch), and clone your website
fork.
mkdir github.com
cd github.com
git clone git@github.com:<your_github_username>/website.git
Determine the base directory of your clone. For example, if you followed the
preceding step to get the repository, your base directory is
github.com/website.
The remaining steps refer to your base directory as
<web-base>
.
Overview of update-imported-docs
The update-imported-docs.py
script is located in the <web-base>/update-imported-docs/
directory.
The script builds the following references:
- Component and tool reference pages
- The
kubectl
command reference - The Kubernetes API reference
The update-imported-docs.py
script generates the Kubernetes reference documentation
from the Kubernetes source code. The script creates a temporary directory
under /tmp
on your machine and clones the required repositories: kubernetes/kubernetes
and
kubernetes-sigs/reference-docs
into this directory.
The script sets your GOPATH
to this temporary directory.
Three additional environment variables are set:
K8S_RELEASE
K8S_ROOT
K8S_WEBROOT
The script requires two arguments to run successfully:
- A YAML configuration file (
reference.yml
) - A release version, for example:
1.17
The configuration file contains a generate-command
field.
The generate-command
field defines a series of build instructions
from kubernetes-sigs/reference-docs/Makefile
. The K8S_RELEASE
variable
determines the version of the release.
The update-imported-docs.py
script performs the following steps:
- Clones the related repositories specified in a configuration file. For the
purpose of generating reference docs, the repository that is cloned by
default is
kubernetes-sigs/reference-docs
. - Runs commands under the cloned repositories to prepare the docs generator and then generates the HTML and Markdown files.
- Copies the generated HTML and Markdown files to a local clone of the
<web-base>
repository under locations specified in the configuration file. - Updates
kubectl
command links fromkubectl
.md to the refer to the sections in thekubectl
command reference.
When the generated files are in your local clone of the <web-base>
repository, you can submit them in a pull request
to <web-base>
.
Configuration file format
Each configuration file may contain multiple repos that will be imported together. When necessary, you can customize the configuration file by manually editing it. You may create new config files for importing other groups of documents. The following is an example of the YAML configuration file:
repos:
- name: community
remote: https://github.com/kubernetes/community.git
branch: master
files:
- src: contributors/devel/README.md
dst: docs/imported/community/devel.md
- src: contributors/guide/README.md
dst: docs/imported/community/guide.md
Single page Markdown documents, imported by the tool, must adhere to the Documentation Style Guide.
Customizing reference.yml
Open <web-base>/update-imported-docs/reference.yml
for editing.
Do not change the content for the generate-command
field unless you understand
how the command is used to build the references.
You should not need to update reference.yml
. At times, changes in the
upstream source code, may require changes to the configuration file
(for example: golang version dependencies and third-party library changes).
If you encounter build issues, contact the SIG-Docs team on the
#sig-docs Kubernetes Slack channel.
generate-command
is an optional entry, which can be used to run a
given command or a short script to generate the docs from within a repository.
In reference.yml
, files
contains a list of src
and dst
fields.
The src
field contains the location of a generated Markdown file in the cloned
kubernetes-sigs/reference-docs
build directory, and the dst
field specifies
where to copy this file in the cloned kubernetes/website
repository.
For example:
repos:
- name: reference-docs
remote: https://github.com/kubernetes-sigs/reference-docs.git
files:
- src: gen-compdocs/build/kube-apiserver.md
dst: content/en/docs/reference/command-line-tools-reference/kube-apiserver.md
...
Note that when there are many files to be copied from the same source directory
to the same destination directory, you can use wildcards in the value given to
src
. You must provide the directory name as the value for dst
.
For example:
files:
- src: gen-compdocs/build/kubeadm*.md
dst: content/en/docs/reference/setup-tools/kubeadm/generated/
Running the update-imported-docs tool
You can run the update-imported-docs.py
tool as follows:
cd <web-base>/update-imported-docs
./update-imported-docs.py <configuration-file.yml> <release-version>
For example:
./update-imported-docs.py reference.yml 1.17
Fixing Links
The release.yml
configuration file contains instructions to fix relative links.
To fix relative links within your imported files, set thegen-absolute-links
property to true
. You can find an example of this in
release.yml
.
Adding and committing changes in kubernetes/website
List the files that were generated and copied to <web-base>
:
cd <web-base>
git status
The output shows the new and modified files. The generated output varies depending upon changes made to the upstream source code.
Generated component tool files
content/en/docs/reference/command-line-tools-reference/cloud-controller-manager.md
content/en/docs/reference/command-line-tools-reference/kube-apiserver.md
content/en/docs/reference/command-line-tools-reference/kube-controller-manager.md
content/en/docs/reference/command-line-tools-reference/kube-proxy.md
content/en/docs/reference/command-line-tools-reference/kube-scheduler.md
content/en/docs/reference/setup-tools/kubeadm/generated/kubeadm.md
content/en/docs/reference/kubectl/kubectl.md
Generated kubectl command reference files
static/docs/reference/generated/kubectl/kubectl-commands.html
static/docs/reference/generated/kubectl/navData.js
static/docs/reference/generated/kubectl/scroll.js
static/docs/reference/generated/kubectl/stylesheet.css
static/docs/reference/generated/kubectl/tabvisibility.js
static/docs/reference/generated/kubectl/node_modules/bootstrap/dist/css/bootstrap.min.css
static/docs/reference/generated/kubectl/node_modules/highlight.js/styles/default.css
static/docs/reference/generated/kubectl/node_modules/jquery.scrollto/jquery.scrollTo.min.js
static/docs/reference/generated/kubectl/node_modules/jquery/dist/jquery.min.js
static/docs/reference/generated/kubectl/css/font-awesome.min.css
Generated Kubernetes API reference directories and files
static/docs/reference/generated/kubernetes-api/v1.23/index.html
static/docs/reference/generated/kubernetes-api/v1.23/js/navData.js
static/docs/reference/generated/kubernetes-api/v1.23/js/scroll.js
static/docs/reference/generated/kubernetes-api/v1.23/js/query.scrollTo.min.js
static/docs/reference/generated/kubernetes-api/v1.23/css/font-awesome.min.css
static/docs/reference/generated/kubernetes-api/v1.23/css/bootstrap.min.css
static/docs/reference/generated/kubernetes-api/v1.23/css/stylesheet.css
static/docs/reference/generated/kubernetes-api/v1.23/fonts/FontAwesome.otf
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.eot
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.svg
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.ttf
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.woff
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.woff2
Run git add
and git commit
to commit the files.
Creating a pull request
Create a pull request to the kubernetes/website
repository. Monitor your
pull request, and respond to review comments as needed. Continue to monitor
your pull request until it is merged.
A few minutes after your pull request is merged, your updated reference topics will be visible in the published documentation.
What's next
To generate the individual reference documentation by manually setting up the required build repositories and running the build targets, see the following guides:
7.7.3 - Generating Reference Documentation for the Kubernetes API
This page shows how to update the Kubernetes API reference documentation.
The Kubernetes API reference documentation is built from the Kubernetes OpenAPI spec using the kubernetes-sigs/reference-docs generation code.
If you find bugs in the generated documentation, you need to fix them upstream.
If you need only to regenerate the reference documentation from the OpenAPI spec, continue reading this page.
Before you begin
Requirements:
-
You need a machine that is running Linux or macOS.
-
You need to have these tools installed:
-
Your
PATH
environment variable must include the required build tools, such as theGo
binary andpython
. -
You need to know how to create a pull request to a GitHub repository. This involves creating your own fork of the repository. For more information, see Work from a local clone.
Setting up the local repositories
Create a local workspace and set your GOPATH
.
mkdir -p $HOME/<workspace>
export GOPATH=$HOME/<workspace>
Get a local clone of the following repositories:
go get -u github.com/kubernetes-sigs/reference-docs
go get -u github.com/go-openapi/loads
go get -u github.com/go-openapi/spec
If you don't already have the kubernetes/website repository, get it now:
git clone https://github.com/<your-username>/website $GOPATH/src/github.com/<your-username>/website
Get a clone of the kubernetes/kubernetes repository as k8s.io/kubernetes:
git clone https://github.com/kubernetes/kubernetes $GOPATH/src/k8s.io/kubernetes
-
The base directory of your clone of the kubernetes/kubernetes repository is
$GOPATH/src/k8s.io/kubernetes.
The remaining steps refer to your base directory as<k8s-base>
. -
The base directory of your clone of the kubernetes/website repository is
$GOPATH/src/github.com/<your username>/website.
The remaining steps refer to your base directory as<web-base>
. -
The base directory of your clone of the kubernetes-sigs/reference-docs repository is
$GOPATH/src/github.com/kubernetes-sigs/reference-docs.
The remaining steps refer to your base directory as<rdocs-base>
.
Generating the API reference docs
This section shows how to generate the published Kubernetes API reference documentation.
Setting build variables
- Set
K8S_ROOT
to<k8s-base>
. - Set
K8S_WEBROOT
to<web-base>
. - Set
K8S_RELEASE
to the version of the docs you want to build. For example, if you want to build docs for Kubernetes 1.17.0, setK8S_RELEASE
to 1.17.0.
For example:
export K8S_WEBROOT=${GOPATH}/src/github.com/<your-username>/website
export K8S_ROOT=${GOPATH}/src/k8s.io/kubernetes
export K8S_RELEASE=1.17.0
Creating versioned directory and fetching Open API spec
The updateapispec
build target creates the versioned build directory.
After the directory is created, the Open API spec is fetched from the
<k8s-base>
repository. These steps ensure that the version
of the configuration files and Kubernetes Open API spec match the release version.
The versioned directory name follows the pattern of v<major>_<minor>
.
In the <rdocs-base>
directory, run the following build target:
cd <rdocs-base>
make updateapispec
Building the API reference docs
The copyapi
target builds the API reference and
copies the generated files to directories in <web-base>
.
Run the following command in <rdocs-base>
:
cd <rdocs-base>
make copyapi
Verify that these two files have been generated:
[ -e "<rdocs-base>/gen-apidocs/build/index.html" ] && echo "index.html built" || echo "no index.html"
[ -e "<rdocs-base>/gen-apidocs/build/navData.js" ] && echo "navData.js built" || echo "no navData.js"
Go to the base of your local <web-base>
, and
view which files have been modified:
cd <web-base>
git status
The output is similar to:
static/docs/reference/generated/kubernetes-api/v1.23/css/bootstrap.min.css
static/docs/reference/generated/kubernetes-api/v1.23/css/font-awesome.min.css
static/docs/reference/generated/kubernetes-api/v1.23/css/stylesheet.css
static/docs/reference/generated/kubernetes-api/v1.23/fonts/FontAwesome.otf
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.eot
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.svg
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.ttf
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.woff
static/docs/reference/generated/kubernetes-api/v1.23/fonts/fontawesome-webfont.woff2
static/docs/reference/generated/kubernetes-api/v1.23/index.html
static/docs/reference/generated/kubernetes-api/v1.23/js/jquery.scrollTo.min.js
static/docs/reference/generated/kubernetes-api/v1.23/js/navData.js
static/docs/reference/generated/kubernetes-api/v1.23/js/scroll.js
Updating the API reference index pages
When generating reference documentation for a new release, update the file,
<web-base>/content/en/docs/reference/kubernetes-api/api-index.md
with the new
version number.
-
Open
<web-base>/content/en/docs/reference/kubernetes-api/api-index.md
for editing, and update the API reference version number. For example:--- title: v1.17 --- [Kubernetes API v1.17](/docs/reference/generated/kubernetes-api/v1.17/)
-
Open
<web-base>/content/en/docs/reference/_index.md
for editing, and add a new link for the latest API reference. Remove the oldest API reference version. There should be five links to the most recent API references.
Locally test the API reference
Publish a local version of the API reference. Verify the local preview.
cd <web-base>
git submodule update --init --recursive --depth 1 # if not already done
make container-serve
Commit the changes
In <web-base>
run git add
and git commit
to commit the change.
Submit your changes as a pull request to the kubernetes/website repository. Monitor your pull request, and respond to reviewer comments as needed. Continue to monitor your pull request until it has been merged.
What's next
7.7.4 - Generating Reference Documentation for kubectl Commands
This page shows how to generate the kubectl
command reference.
Before you begin
Requirements:
-
You need a machine that is running Linux or macOS.
-
You need to have these tools installed:
-
Your
PATH
environment variable must include the required build tools, such as theGo
binary andpython
. -
You need to know how to create a pull request to a GitHub repository. This involves creating your own fork of the repository. For more information, see Work from a local clone.
Setting up the local repositories
Create a local workspace and set your GOPATH
.
mkdir -p $HOME/<workspace>
export GOPATH=$HOME/<workspace>
Get a local clone of the following repositories:
go get -u github.com/spf13/pflag
go get -u github.com/spf13/cobra
go get -u gopkg.in/yaml.v2
go get -u github.com/kubernetes-sigs/reference-docs
If you don't already have the kubernetes/website repository, get it now:
git clone https://github.com/<your-username>/website $GOPATH/src/github.com/<your-username>/website
Get a clone of the kubernetes/kubernetes repository as k8s.io/kubernetes:
git clone https://github.com/kubernetes/kubernetes $GOPATH/src/k8s.io/kubernetes
Remove the spf13 package from $GOPATH/src/k8s.io/kubernetes/vendor/github.com
.
rm -rf $GOPATH/src/k8s.io/kubernetes/vendor/github.com/spf13
The kubernetes/kubernetes repository provides the kubectl
and kustomize
source code.
-
Determine the base directory of your clone of the kubernetes/kubernetes repository. For example, if you followed the preceding step to get the repository, your base directory is
$GOPATH/src/k8s.io/kubernetes.
The remaining steps refer to your base directory as<k8s-base>
. -
Determine the base directory of your clone of the kubernetes/website repository. For example, if you followed the preceding step to get the repository, your base directory is
$GOPATH/src/github.com/<your-username>/website.
The remaining steps refer to your base directory as<web-base>
. -
Determine the base directory of your clone of the kubernetes-sigs/reference-docs repository. For example, if you followed the preceding step to get the repository, your base directory is
$GOPATH/src/github.com/kubernetes-sigs/reference-docs.
The remaining steps refer to your base directory as<rdocs-base>
.
In your local k8s.io/kubernetes repository, check out the branch of interest, and make sure it is up to date. For example, if you want to generate docs for Kubernetes 1.22.0, you could use these commands:
cd <k8s-base>
git checkout v1.22.0
git pull https://github.com/kubernetes/kubernetes 1.22.0
If you do not need to edit the kubectl
source code, follow the instructions for
Setting build variables.
Editing the kubectl source code
The kubectl command reference documentation is automatically generated from the kubectl source code. If you want to change the reference documentation, the first step is to change one or more comments in the kubectl source code. Make the change in your local kubernetes/kubernetes repository, and then submit a pull request to the master branch of github.com/kubernetes/kubernetes.
PR 56673 is an example of a pull request that fixes a typo in the kubectl source code.
Monitor your pull request, and respond to reviewer comments. Continue to monitor your pull request until it is merged into the target branch of the kubernetes/kubernetes repository.
Cherry picking your change into a release branch
Your change is now in the master branch, which is used for development of the next Kubernetes release. If you want your change to appear in the docs for a Kubernetes version that has already been released, you need to propose that your change be cherry picked into the release branch.
For example, suppose the master branch is being used to develop Kubernetes 1.23 and you want to backport your change to the release-1.22 branch. For instructions on how to do this, see Propose a Cherry Pick.
Monitor your cherry-pick pull request until it is merged into the release branch.
Setting build variables
Go to <rdocs-base>
. On you command line, set the following environment variables.
- Set
K8S_ROOT
to<k8s-base>
. - Set
K8S_WEBROOT
to<web-base>
. - Set
K8S_RELEASE
to the version of the docs you want to build. For example, if you want to build docs for Kubernetes 1.22, setK8S_RELEASE
to 1.22.
For example:
export K8S_WEBROOT=$GOPATH/src/github.com/<your-username>/website
export K8S_ROOT=$GOPATH/src/k8s.io/kubernetes
export K8S_RELEASE=1.22
Creating a versioned directory
The createversiondirs
build target creates a versioned directory
and copies the kubectl reference configuration files to the versioned directory.
The versioned directory name follows the pattern of v<major>_<minor>
.
In the <rdocs-base>
directory, run the following build target:
cd <rdocs-base>
make createversiondirs
Checking out a release tag in k8s.io/kubernetes
In your local <k8s-base>
repository, checkout the branch that has
the version of Kubernetes that you want to document. For example, if you want
to generate docs for Kubernetes 1.22.0, check out the
v1.22
tag. Make sure
you local branch is up to date.
cd <k8s-base>
git checkout v1.22.0
git pull https://github.com/kubernetes/kubernetes v1.22.0
Running the doc generation code
In your local <rdocs-base>
, run the copycli
build target. The command runs as root
:
cd <rdocs-base>
make copycli
The copycli
command cleans the temporary build directory, generates the kubectl command files,
and copies the collated kubectl command reference HTML page and assets to <web-base>
.
Locate the generated files
Verify that these two files have been generated:
[ -e "<rdocs-base>/gen-kubectldocs/generators/build/index.html" ] && echo "index.html built" || echo "no index.html"
[ -e "<rdocs-base>/gen-kubectldocs/generators/build/navData.js" ] && echo "navData.js built" || echo "no navData.js"
Locate the copied files
Verify that all generated files have been copied to your <web-base>
:
cd <web-base>
git status
The output should include the modified files:
static/docs/reference/generated/kubectl/kubectl-commands.html
static/docs/reference/generated/kubectl/navData.js
The output may also include:
static/docs/reference/generated/kubectl/scroll.js
static/docs/reference/generated/kubectl/stylesheet.css
static/docs/reference/generated/kubectl/tabvisibility.js
static/docs/reference/generated/kubectl/node_modules/bootstrap/dist/css/bootstrap.min.css
static/docs/reference/generated/kubectl/node_modules/highlight.js/styles/default.css
static/docs/reference/generated/kubectl/node_modules/jquery.scrollto/jquery.scrollTo.min.js
static/docs/reference/generated/kubectl/node_modules/jquery/dist/jquery.min.js
static/docs/reference/generated/kubectl/node_modules/font-awesome/css/font-awesome.min.css
Locally test the documentation
Build the Kubernetes documentation in your local <web-base>
.
cd <web-base>
git submodule update --init --recursive --depth 1 # if not already done
make container-serve
View the local preview.
Adding and committing changes in kubernetes/website
Run git add
and git commit
to commit the files.
Creating a pull request
Create a pull request to the kubernetes/website
repository. Monitor your
pull request, and respond to review comments as needed. Continue to monitor
your pull request until it is merged.
A few minutes after your pull request is merged, your updated reference topics will be visible in the published documentation.
What's next
7.7.5 - Generating Reference Pages for Kubernetes Components and Tools
This page shows how to build the Kubernetes component and tool reference pages.
Before you begin
Start with the Prerequisites section in the Reference Documentation Quickstart guide.
Follow the Reference Documentation Quickstart to generate the Kubernetes component and tool reference pages.
What's next
7.7.6 -
Requirements:
-
You need a machine that is running Linux or macOS.
-
You need to have these tools installed:
-
Your
PATH
environment variable must include the required build tools, such as theGo
binary andpython
. -
You need to know how to create a pull request to a GitHub repository. This involves creating your own fork of the repository. For more information, see Work from a local clone.
7.8 - Advanced contributing
This page assumes that you understand how to contribute to new content and review others' work, and are ready to learn about more ways to contribute. You need to use the Git command line client and other tools for some of these tasks.
Propose improvements
SIG Docs members can propose improvements.
After you've been contributing to the Kubernetes documentation for a while, you
may have ideas for improving the Style Guide
, the Content Guide, the toolchain used to build
the documentation, the website style, the processes for reviewing and merging
pull requests, or other aspects of the documentation. For maximum transparency,
these types of proposals need to be discussed in a SIG Docs meeting or on the
kubernetes-sig-docs mailing list.
In addition, it can help to have some context about the way things
currently work and why past decisions have been made before proposing sweeping
changes. The quickest way to get answers to questions about how the documentation
currently works is to ask in the #sig-docs
Slack channel on
kubernetes.slack.com
After the discussion has taken place and the SIG is in agreement about the desired outcome, you can work on the proposed changes in the way that is the most appropriate. For instance, an update to the style guide or the website's functionality might involve opening a pull request, while a change related to documentation testing might involve working with sig-testing.
Coordinate docs for a Kubernetes release
SIG Docs approvers can coordinate docs for a Kubernetes release.
Each Kubernetes release is coordinated by a team of people participating in the sig-release Special Interest Group (SIG). Others on the release team for a given release include an overall release lead, as well as representatives from sig-testing and others. To find out more about Kubernetes release processes, refer to https://github.com/kubernetes/sig-release.
The SIG Docs representative for a given release coordinates the following tasks:
- Monitor the feature-tracking spreadsheet for new or changed features with an impact on documentation. If the documentation for a given feature won't be ready for the release, the feature may not be allowed to go into the release.
- Attend sig-release meetings regularly and give updates on the status of the docs for the release.
- Review and copyedit feature documentation drafted by the SIG responsible for implementing the feature.
- Merge release-related pull requests and maintain the Git feature branch for the release.
- Mentor other SIG Docs contributors who want to learn how to do this role in the future. This is known as "shadowing".
- Publish the documentation changes related to the release when the release artifacts are published.
Coordinating a release is typically a 3-4 month commitment, and the duty is rotated among SIG Docs approvers.
Serve as a New Contributor Ambassador
SIG Docs approvers can serve as New Contributor Ambassadors.
New Contributor Ambassadors welcome new contributors to SIG-Docs, suggest PRs to new contributors, and mentor new contributors through their first few PR submissions.
Responsibilities for New Contributor Ambassadors include:
- Monitoring the #sig-docs Slack channel for questions from new contributors.
- Working with PR wranglers to identify good first issues for new contributors.
- Mentoring new contributors through their first few PRs to the docs repo.
- Helping new contributors create the more complex PRs they need to become Kubernetes members.
- Sponsoring contributors on their path to becoming Kubernetes members.
- Hosting a monthly meeting to help and mentor new contributors.
Current New Contributor Ambassadors are announced at each SIG-Docs meeting and in the Kubernetes #sig-docs channel.
Sponsor a new contributor
SIG Docs reviewers can sponsor new contributors.
After a new contributor has successfully submitted 5 substantive pull requests to one or more Kubernetes repositories, they are eligible to apply for membership in the Kubernetes organization. The contributor's membership needs to be backed by two sponsors who are already reviewers.
New docs contributors can request sponsors by asking in the #sig-docs channel on the Kubernetes Slack instance or on the SIG Docs mailing list. If you feel confident about the applicant's work, you volunteer to sponsor them. When they submit their membership application, reply to the application with a "+1" and include details about why you think the applicant is a good fit for membership in the Kubernetes organization.
Serve as a SIG Co-chair
SIG Docs members can serve a term as a co-chair of SIG Docs.
Prerequisites
A Kubernetes member must meet the following requirements to be a co-chair:
- Understand SIG Docs workflows and tooling: git, Hugo, localization, blog subproject
- Understand how other Kubernetes SIGs and repositories affect the SIG Docs workflow, including: teams in k/org, the process in k/community, plugins in k/test-infra, and the role of SIG Architecture. In addition, understand how the Kubernetes docs release process works.
- Approved by the SIG Docs community either directly or via lazy consensus.
- Commit at least 5 hours per week (and often more) to the role for a minimum of 6 months
Responsibilities
The role of co-chair is one of service: co-chairs build contributor capacity, handle process and policy, schedule and run meetings, schedule PR wranglers, advocate for docs in the Kubernetes community, make sure that docs succeed in Kubernetes release cycles, and keep SIG Docs focused on effective priorities.
Responsibilities include:
- Keep SIG Docs focused on maximizing developer happiness through excellent documentation
- Exemplify the community code of conduct and hold SIG members accountable to it
- Learn and set best practices for the SIG by updating contribution guidelines
- Schedule and run SIG meetings: weekly status updates, quarterly retro/planning sessions, and others as needed
- Schedule and run doc sprints at KubeCon events and other conferences
- Recruit for and advocate on behalf of SIG Docs with the CNCF and its platinum partners, including Google, Oracle, Azure, IBM, and Huawei
- Keep the SIG running smoothly
Running effective meetings
To schedule and run effective meetings, these guidelines show what to do, how to do it, and why.
Uphold the community code of conduct:
- Hold respectful, inclusive discussions with respectful, inclusive language.
Set a clear agenda:
- Set a clear agenda of topics
- Publish the agenda in advance
For weekly meetings, copypaste the previous week's notes into the "Past meetings" section of the notes
Collaborate on accurate notes:
- Record the meeting's discussion
- Consider delegating the role of note-taker
Assign action items clearly and accurately:
- Record the action item, who is assigned to it, and the expected completion date
Moderate as needed:
- If discussion strays from the agenda, refocus participants on the current topic
- Make room for different discussion styles while keeping the discussion focused and honoring folks' time
Honor folks' time:
Begin and end meetings on time.
Use Zoom effectively:
- Familiarize yourself with Zoom guidelines for Kubernetes
- Claim the host role when you log in by entering the host key
Recording meetings on Zoom
When you're ready to start the recording, click Record to Cloud.
When you're ready to stop recording, click Stop.
The video uploads automatically to YouTube.
7.9 - Viewing Site Analytics
This page contains information about the kubernetes.io analytics dashboard.
This dashboard is built using Google Data Studio and shows information collected on kubernetes.io using Google Analytics.
Using the dashboard
By default, the dashboard shows all collected analytics for the past 30 days. Use the date selector to see data from a different date range. Other filtering options allow you to view data based on user location, the device used to access the site, the translation of the docs used, and more.
If you notice an issue with this dashboard, or would like to request any improvements, please open an issue.
8 - Docs smoke test page
This page serves two purposes:
- Demonstrate how the Kubernetes documentation uses Markdown
- Provide a "smoke test" document we can use to test HTML, CSS, and template changes that affect the overall documentation.
Heading levels
The above heading is an H2. The page title renders as an H1. The following sections show H3-H6.
H3
This is in an H3 section.
H4
This is in an H4 section.
H5
This is in an H5 section.
H6
This is in an H6 section.
Inline elements
Inline elements show up within the text of paragraph, list item, admonition, or other block-level element.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Inline text styles
- bold
- italic
- bold italic
strikethrough- underline
- underline italic
- underline bold
- underline bold italic
monospace text
monospace bold
Lists
Markdown doesn't have strict rules about how to process lists. When we moved from Jekyll to Hugo, we broke some lists. To fix them, keep the following in mind:
-
Make sure you indent sub-list items 2 spaces.
-
To end a list and start another, you need a HTML comment block on a new line between the lists, flush with the left-hand border. The first list won't end otherwise, no matter how many blank lines you put between it and the second.
Bullet lists
- This is a list item
- This is another list item in the same list
- You can mix
-
and*
- To make a sub-item, indent two spaces.
- This is a sub-sub-item. Indent two more spaces.
- Another sub-item.
- To make a sub-item, indent two spaces.
-
This is a new list. With Hugo, you need to use a HTML comment to separate two consecutive lists. The HTML comment needs to be at the left margin.
-
Bullet lists can have paragraphs or block elements within them.
Indent the content to be the same as the first line of the bullet point. This paragraph and the code block line up with the first
B
inBullet
above.ls -l
- And a sub-list after some block-level content
-
A bullet list item can contain a numbered list.
- Numbered sub-list item 1
- Numbered sub-list item 2
Numbered lists
- This is a list item
- This is another list item in the same list. The number you use in Markdown does not necessarily correlate to the number in the final output. By convention, we keep them in sync.
- Note: For single-digit numbered lists, using two spaces after the period makes interior block-level content line up better along tab-stops.
-
This is a new list. With Hugo, you need to use a HTML comment to separate two consecutive lists. The HTML comment needs to be at the left margin.
-
Numbered lists can have paragraphs or block elements within them.
Indent the content to be the same as the first line of the bullet point. This paragraph and the code block line up with the
N
inNumbered
above.ls -l
- And a sub-list after some block-level content. This is at the same "level" as the paragraph and code block above, despite being indented more.
Tab lists
Tab lists can be used to conditionally display content, e.g., when multiple options must be documented that require distinct instructions or context.
Please select an option.
Tabs may also nest formatting styles.
- Ordered
- (Or unordered)
- Lists
echo 'Tab lists may contain code blocks!'
Header within a tab list
Nested header tags may also be included.
Checklists
Checklists are technically bullet lists, but the bullets are suppressed by CSS.
- This is a checklist item
- This is a selected checklist item
Code blocks
You can create code blocks two different ways by surrounding the code block with three back-tick characters on lines before and after the code block. Only use back-ticks (code fences) for code blocks. This allows you to specify the language of the enclosed code, which enables syntax highlighting. It is also more predictable than using indentation.
this is a code block created by back-ticks
The back-tick method has some advantages.
- It works nearly every time
- It is more compact when viewing the source code.
- It allows you to specify what language the code block is in, for syntax highlighting.
- It has a definite ending. Sometimes, the indentation method breaks with languages where spacing is significant, like Python or YAML.
To specify the language for the code block, put it directly after the first grouping of back-ticks:
ls -l
Common languages used in Kubernetes documentation code blocks include:
bash
/shell
(both work the same)go
json
yaml
xml
none
(disables syntax highlighting for the block)
Code blocks containing Hugo shortcodes
To show raw Hugo shortcodes as in the above example and prevent Hugo
from interpreting them, use C-style comments directly after the <
and before
the >
characters. The following example illustrates this (view the Markdown
source for this page).
{{< codenew file="pods/storage/gce-volume.yaml" >}}
Links
To format a link, put the link text inside square brackets, followed by the link target in parentheses. Link to Kubernetes.io or Relative link to Kubernetes.io
You can also use HTML, but it is not preferred. Link to Kubernetes.io
Images
To format an image, use similar syntax to links, but add a leading !
character. The square brackets contain the image's alt text. Try to always use
alt text so that people using screen readers can get some benefit from the
image.
To specify extended attributes, such as width, title, caption, etc, use the
figure shortcode,
which is preferred to using a HTML <img>
tag. Also, if you need the image to
also be a hyperlink, use the link
attribute, rather than wrapping the whole
figure in Markdown link syntax as shown below.
Even if you choose not to use the figure shortcode, an image can also be a link. This time the pencil icon links to the Kubernetes website. Outer square brackets enclose the entire image tag, and the link target is in the parentheses at the end.
You can also use HTML for images, but it is not preferred.
Tables
Simple tables have one row per line, and columns are separated by |
characters. The header is separated from the body by cells containing nothing
but at least three -
characters. For ease of maintenance, try to keep all the
cell separators even, even if you heed to use extra space.
Heading cell 1 | Heading cell 2 |
---|---|
Body cell 1 | Body cell 2 |
The header is optional. Any text separated by |
will render as a table.
Markdown tables have a hard time with block-level elements within cells, such as list items, code blocks, or multiple paragraphs. For complex or very wide tables, use HTML instead.
Heading cell 1 | Heading cell 2 |
---|---|
Body cell 1 | Body cell 2 |
Visualizations with Mermaid
You can use Mermaid JS visualizations. The Mermaid JS version is specified in /layouts/partials/head.html
{{< mermaid >}}
graph TD;
A-->B;
A-->C;
B-->D;
C-->D;
{{</ mermaid >}}
Produces:
{{< mermaid >}}
sequenceDiagram
Alice ->> Bob: Hello Bob, how are you?
Bob-->>John: How about you John?
Bob--x Alice: I am good thanks!
Bob-x John: I am good thanks!
Note right of John: Bob thinks a long<br/>long time, so long<br/>that the text does<br/>not fit on a row.
Bob-->Alice: Checking with John...
Alice->John: Yes... John, how are you?
{{</ mermaid >}}
Produces:
More examples from the official docs.
Sidebars and Admonitions
Sidebars and admonitions provide ways to add visual importance to text. Use them sparingly.
Sidebars
A sidebar offsets text visually, but without the visual prominence of admonitions.
This is a sidebar.
You can have paragraphs and block-level elements within a sidebar.
You can even have code blocks.
sudo dmesg
Admonitions
Admonitions (notes, warnings, etc) use Hugo shortcodes.
Notes catch the reader's attention without a sense of urgency.
You can have multiple paragraphs and block-level elements inside an admonition.
| Or | a | table |
Includes
To add shortcodes to includes.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds: