diff --git a/confidential-containers/attestation.rst b/confidential-containers/attestation.rst index 56528dbc5..70999ee37 100644 --- a/confidential-containers/attestation.rst +++ b/confidential-containers/attestation.rst @@ -69,8 +69,7 @@ Provision Trustee Trustee is an open-source framework used in Confidential Containers to verify attestation evidence and conditionally release secrets. For a full overview of attestation with Trustee, refer to the upstream `Trustee documentation `_. -To provision a Trustee instance, follow the upstream `Install Trustee in Docker `_ guide. -This is the recommended install method. +To provision a Trustee instance, follow the recommended upstream `Install Trustee in Docker `_ guide. .. note:: @@ -86,20 +85,21 @@ After you complete installation, Trustee is configured to use the NVIDIA Remote Configure Workloads for Attestation ==================================== -To enable attestation for your workloads, point them to the Trustee network endpoint, sometimes referred to as the Key Broker Service (KBS) endpoint, by adding the following annotation to your workload pod spec: +To enable attestation for your workloads, point them to the Trustee network endpoint, also called the Key Broker Service (KBS) endpoint, by adding the following annotation to your workload pod spec: .. code-block:: yaml io.katacontainers.config.hypervisor.kernel_params: "agent.aa_kbc_params=cc_kbc::http://:" -Replace ```` with the IP address or hostname at which your Trustee instance is reachable from the worker nodes, and ```` with the port (default: ``8080``). +Replace ```` with the IP address or hostname at which your Trustee instance is reachable from the worker nodes. +Replace ```` with the port that Trustee listens on (default: ``8080``). Refer to the upstream `Setup Confidential Containers `_ documentation for more information on configuring workloads for attestation. .. _customize-attestation: -Customize Attestation Workflows -=============================== +Optional: Customize Attestation Workflows +========================================= After Trustee is provisioned and workloads are configured, you can customize attestation workflows to enforce your desired security policies. This can include configuring the following: @@ -108,7 +108,7 @@ This can include configuring the following: Refer to the upstream documentation on `using the KBS Client Tool `_. * Configure resources: Create resources, or secrets, that your workloads need. Refer to the upstream `Confidential Containers resources `_ documentation for more information on the resources. -* Configure policies: Confidential Containers uses different policy types to secure workload at different layers. +* Configure policies: Confidential Containers uses different policies to secure workload at different layers. Refer to the upstream `Confidential Containers policy `_ documentation for more information on the policy types and configuring policies. Refer to the upstream `Confidential Containers Features `_ documentation for a full list of attestation features and how to configure them. @@ -122,6 +122,5 @@ Use the Trustee log to diagnose the attestation process. Next Steps ========== -* Refer to the :doc:`deployment guide ` for Confidential Containers setup instructions. -* Refer to the upstream `Confidential Containers Features `_ documentation for a complete list of attestation-dependent features. -* Refer to the `NVIDIA Confidential Computing documentation `_ for additional information. +* Refer to the upstream `Confidential Containers Features `_ for complete documentation on attestation features. +* If you haven't already, refer to the :doc:`Confidential Containers deployment guide ` to configure your environment for confidential workloads. diff --git a/confidential-containers/confidential-containers-deploy.rst b/confidential-containers/confidential-containers-deploy.rst index d48fe8797..51df262b4 100644 --- a/confidential-containers/confidential-containers-deploy.rst +++ b/confidential-containers/confidential-containers-deploy.rst @@ -162,7 +162,7 @@ Kubernetes Cluster * ``RuntimeClassInImageCriApi``: Alpha since Kubernetes v1.29 and is not enabled by default. This feature gate is required to support pod deployments that use multiple snapshotters side-by-side. - Add both feature gates to your Kubelet configuration (typically ``/var/lib/kubelet/config.yaml``): + Add both feature gates to your Kubelet configuration (typically ``sudo vi /var/lib/kubelet/config.yaml``): .. code-block:: yaml @@ -180,6 +180,33 @@ Kubernetes Cluster $ sudo systemctl restart kubelet +.. _configure-image-pull-timeouts: + +* Increase kubelet image pull timeouts configuration to 20 minutes to avoid timeouts on large image pulls. + Kubelet can de-allocate your pod if the image pull exceeds the configured timeout before the container transitions to the running state. + This is more likely to happen when using large images. + + Increase ``runtimeRequestTimeout`` in your `kubelet configuration `_ to ``20m`` to match the default values for the Kata shim configurations in Kata Containers. + + Add or update the ``runtimeRequestTimeout`` field in your kubelet configuration (typically ``/var/lib/kubelet/config.yaml``): + + .. code-block:: yaml + :emphasize-lines: 3 + + apiVersion: kubelet.config.k8s.io/v1beta1 + kind: KubeletConfiguration + runtimeRequestTimeout: 20m + + Restart the kubelet service to apply the change: + + .. code-block:: console + + $ sudo systemctl restart kubelet + + If you need a timeout of more than 1200 seconds (20 minutes), you will also need to adjust the Kata Agent's ``image_pull_timeout``, which defaults to 1200s. + This setting also sets the confidential data hub's image pull API timeout in seconds. + To do this, add the ``agent.image_pull_timeout`` kernel parameter to your shim configuration, or pass an explicit value in a pod annotation in the ``io.katacontainers.config.hypervisor.kernel_params: "..."`` annotation. + .. _installation-and-configuration: Installation @@ -461,7 +488,7 @@ For further configuration settings, refer to the following sections: Run a Sample Workload ===================== -A pod manifest for a confidential container GPU workload requires that you specify the ``kata-qemu-nvidia-gpu-snp`` runtime class for SEV-SNP or ``kata-qemu-nvidia-gpu-tdx`` for TDX. +A pod manifest for a confidential container GPU workload requires that you specify the ``kata-qemu-nvidia-gpu-snp`` runtime class for AMD based systems or ``kata-qemu-nvidia-gpu-tdx`` for Intel based systems. 1. Create a file, such as the following ``cuda-vectoradd-kata.yaml`` sample, specifying the kata-qemu-nvidia-gpu-snp runtime class: @@ -474,35 +501,37 @@ A pod manifest for a confidential container GPU workload requires that you speci name: cuda-vectoradd-kata namespace: default spec: - runtimeClassName: kata-qemu-nvidia-gpu-snp + runtimeClassName: kata-qemu-nvidia-gpu-snp # or kata-qemu-nvidia-gpu-tdx restartPolicy: Never containers: - name: cuda-vectoradd image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04" resources: limits: - nvidia.com/pgpu: "1" + nvidia.com/pgpu: "1" # for single GPU passthrough memory: 16Gi The following are Confidential Containers configurations in the sample manifest: - * Set the runtime class to ``kata-qemu-nvidia-gpu-snp`` for SEV-SNP or ``kata-qemu-nvidia-gpu-tdx`` for TDX, depending on the node type where the workloads should run. + * Set the runtime class to ``kata-qemu-nvidia-gpu-snp`` for AMD based systems or ``kata-qemu-nvidia-gpu-tdx`` for Intel based systems, depending on the node type where the workloads should run. * In the sample above, ``nvidia.com/pgpu`` is the default resource type for GPUs. If you are deploying on a heterogeneous cluster, you might want to update the default behavior by specifying the ``P_GPU_ALIAS`` environment variable for the Kata device plugin. Refer to the :ref:`Configuring GPU or NVSwitch Resource Types Name ` section on this page for more details. - * If you have machines that support multi-GPU passthrough, use a pod deployment manifest that specifies 8 PGPU and 4 NVSwitch resources. + * If you have machines that support multi-GPU passthrough, use a pod deployment manifest that specifies 8 PGPU. + If you are using NVIDIA Hopper GPUs with PPCIE mode, also specify 4 NVSwitch resources. .. code-block:: yaml resources: limits: nvidia.com/pgpu: "8" - nvidia.com/nvswitch: "4" + nvidia.com/nvswitch: "4" # Only for NVIDIA Hopper GPUs with PPCIE mode .. note:: - If you are using NVIDIA Hopper GPUs for multi-GPU passthrough, also refer to :ref:`Managing the Confidential Computing Mode ` for details on how to set the ``ppcie`` mode. + If you are using NVIDIA Hopper GPUs for multi-GPU passthrough, you must also set the Confidential Computing mode to ``ppcie`` mode. + Refer to :ref:`Managing the Confidential Computing Mode ` for details. 2. Create the pod: @@ -555,6 +584,7 @@ A pod manifest for a confidential container GPU workload requires that you speci $ kubectl delete -f cuda-vectoradd-kata.yaml + .. _coco-configuration-settings: Common GPU Operator Configuration Settings @@ -664,7 +694,6 @@ When you change the mode, the manager performs the following actions: However, the manager does not drain user workloads. You must make sure that no user workloads are running on the node before you change the mode. -* Unbinds the GPU from the VFIO PCI device driver. * Changes the mode and resets the GPU. * Reschedules the other GPU Operator operands. @@ -807,44 +836,10 @@ Refer to the :ref:`Managing the Confidential Computing Mode `_ to ``20m`` to match the default values for the NVIDIA shim configurations in Kata Containers. - -Add or update the ``runtimeRequestTimeout`` field in your kubelet configuration (typically ``/var/lib/kubelet/config.yaml``): - -.. code-block:: yaml - :emphasize-lines: 3 - - apiVersion: kubelet.config.k8s.io/v1beta1 - kind: KubeletConfiguration - runtimeRequestTimeout: 20m - -Restart the kubelet service to apply the change: - -.. code-block:: console - - $ sudo systemctl restart kubelet - -Additional timeouts to consider updating are the NVIDIA Shim and Kata Agent Policy timeouts. -The NVIDIA shim configurations in Kata Containers use a default ``create_container_timeout`` of 1200 seconds (20 minutes). -This controls the time the shim allows for a container to remain in container creating state. - -If you need a timeout of more than 1200 seconds, you will also need to adjust Kata Agent Policy's ``image_pull_timeout`` value which controls the agent-side timeout for guest-image pull. -To do this, add the ``agent.image_pull_timeout`` kernel parameter to your shim configuration, or pass an explicit value in a pod annotation in the ``io.katacontainers.config.hypervisor.kernel_params: "..."`` annotation. - - Next Steps ========== * Refer to the :doc:`Attestation ` page for more information on configuring attestation. * To help manage the lifecycle of Kata Containers, install the `Kata Lifecycle Manager `_. This Argo Workflows-based tool manages Kata Containers upgrades and day-two operations. -* Refer to the `NVIDIA Confidential Computing documentation `_ for additional information. * Licensing information is available on the :doc:`Licensing ` page. \ No newline at end of file