Kubernetes 1.22: A New Design for Volume Populators

Authors: Ben Swartzlander (NetApp)

Kubernetes v1.22, released earlier this month, introduced a redesigned approach for volume populators. Originally implemented in v1.18, the API suffered from backwards compatibility issues. Kubernetes v1.22 includes a new API field called dataSourceRef that fixes these problems.

Data sources

Earlier Kubernetes releases already added a dataSource field into the PersistentVolumeClaim API, used for cloning volumes and creating volumes from snapshots. You could use the dataSource field when creating a new PVC, referencing either an existing PVC or a VolumeSnapshot in the same namespace. That also modified the normal provisioning process so that instead of yielding an empty volume, the new PVC contained the same data as either the cloned PVC or the cloned VolumeSnapshot.

Volume populators embrace the same design idea, but extend it to any type of object, as long as there exists a custom resource to define the data source, and a populator controller to implement the logic. Initially, the dataSource field was directly extended to allow arbitrary objects, if the AnyVolumeDataSource feature gate was enabled on a cluster. That change unfortunately caused backwards compatibility problems, and so the new dataSourceRef field was born.

In v1.22 if the AnyVolumeDataSource feature gate is enabled, the dataSourceRef field is added, which behaves similarly to the dataSource field except that it allows arbitrary objects to be specified. The API server ensures that the two fields always have the same contents, and neither of them are mutable. The differences is that at creation time dataSource allows only PVCs or VolumeSnapshots, and ignores all other values, while dataSourceRef allows most types of objects, and in the few cases it doesn't allow an object (core objects other than PVCs) a validation error occurs.

When this API change graduates to stable, we would deprecate using dataSource and recommend using dataSourceRef field for all use cases. In the v1.22 release, dataSourceRef is available (as an alpha feature) specifically for cases where you want to use for custom volume populators.

Using populators

Every volume populator must have one or more CRDs that it supports. Administrators may install the CRD and the populator controller and then PVCs with a dataSourceRef specifies a CR of the type that the populator supports will be handled by the populator controller instead of the CSI driver directly.

Underneath the covers, the CSI driver is still invoked to create an empty volume, which the populator controller fills with the appropriate data. The PVC doesn't bind to the PV until it's fully populated, so it's safe to define a whole application manifest including pod and PVC specs and the pods won't begin running until everything is ready, just as if the PVC was a clone of another PVC or VolumeSnapshot.

How it works

PVCs with data sources are still noticed by the external-provisioner sidecar for the related storage class (assuming a CSI provisioner is used), but because the sidecar doesn't understand the data source kind, it doesn't do anything. The populator controller is also watching for PVCs with data sources of a kind that it understands and when it sees one, it creates a temporary PVC of the same size, volume mode, storage class, and even on the same topology (if topology is used) as the original PVC. The populator controller creates a worker pod that attaches to the volume and writes the necessary data to it, then detaches from the volume and the populator controller rebinds the PV from the temporary PVC to the orignal PVC.

Trying it out

The following things are required to use volume populators:

  • Enable the AnyVolumeDataSource feature gate
  • Install a CRD for the specific data source / populator
  • Install the populator controller itself

Populator controllers may use the lib-volume-populator library to do most of the Kubernetes API level work. Individual populators only need to provide logic for actually writing data into the volume based on a particular CR instance. This library provides a sample populator implementation.

These optional components improve user experience:

  • Install the VolumePopulator CRD
  • Create a VolumePopulator custom respource for each specific data source
  • Install the volume data source validator controller (alpha)

The purpose of these components is to generate warning events on PVCs with data sources for which there is no populator.

Putting it all together

To see how this works, you can install the sample "hello" populator and try it out.

First install the volume-data-source-validator controller.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/volume-data-source-validator/master/client/config/crd/populator.storage.k8s.io_volumepopulators.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/volume-data-source-validator/master/deploy/kubernetes/rbac-data-source-validator.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/volume-data-source-validator/master/deploy/kubernetes/setup-data-source-validator.yaml

Next install the example populator.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/lib-volume-populator/master/example/hello-populator/crd.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/lib-volume-populator/master/example/hello-populator/deploy.yaml

Create an instance of the Hello CR, with some text.

apiVersion: hello.k8s.io/v1alpha1
kind: Hello
metadata:
  name: example-hello
spec:
  fileName: example.txt
  fileContents: Hello, world!

Create a PVC that refers to that CR as its data source.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: example-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Mi
  dataSourceRef:
    apiGroup: hello.k8s.io
    kind: Hello
    name: example-hello
  volumeMode: Filesystem

Next, run a job that reads the file in the PVC.

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  template:
    spec:
      containers:
        - name: example-container
          image: busybox:latest
          command:
            - cat
            - /mnt/example.txt
          volumeMounts:
            - name: vol
              mountPath: /mnt
      restartPolicy: Never
      volumes:
        - name: vol
          persistentVolumeClaim:
            claimName: example-pvc

Wait for the job to complete (including all of its dependencies).

kubectl wait --for=condition=Complete job/example-job

And last examine the log from the job.

kubectl logs job/example-job
Hello, world!

Note that the volume already contained a text file with the string contents from the CR. This is only the simplest example. Actual populators can set up the volume to contain arbitrary contents.

How to write your own volume populator

Developers interested in writing new poplators are encouraged to use the lib-volume-populator library and to only supply a small controller wrapper around the library, and a pod image capable of attaching to volumes and writing the appropriate data to the volume.

Individual populators can be extremely generic such that they work with every type of PVC, or they can do vendor specific things to rapidly fill a volume with data if the volume was provisioned by a specific CSI driver from the same vendor, for example, by communicating directly with the storage for that volume.

The future

As this feature is still in alpha, we expect to update the out of tree controllers with more tests and documentation. The community plans to eventually re-implement the populator library as a sidecar, for ease of operations.

We hope to see some official community-supported populators for some widely-shared use cases. Also, we expect that volume populators will be used by backup vendors as a way to "restore" backups to volumes, and possibly a standardized API to do this will evolve.

How can I learn more?

The enhancement proposal, Volume Populators, includes lots of detail about the history and technical implementation of this feature.

Volume populators and data sources, within the documentation topic about persistent volumes, explains how to use this feature in your cluster.

Please get involved by joining the Kubernetes storage SIG to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!