Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
2cfd3c3
Update release notes and version number
chenopis Oct 2, 2025
cff5c3c
Change to minor version convention vX.Y
chenopis Oct 8, 2025
fb5ca1a
Add note for MIG-backed vGPU with KubeVirt / OpenShift Virtualization
chenopis Oct 9, 2025
bba56fc
Update CDI docs
a-mccarthy Oct 13, 2025
ddbe8c3
Update microk8s re: nvbugs/5541760
chenopis Oct 13, 2025
da41d6c
25.3 Add patch version columns (#278)
chenopis Oct 10, 2025
1a8f5da
Update component matrix
chenopis Oct 13, 2025
9ef4302
Update w/ new microk8s config re: nvbugs/5541760
chenopis Oct 13, 2025
c800968
Fix title overline in cdi.rst
chenopis Oct 13, 2025
b1ff763
Move OCP note to steps-overview.rst
chenopis Oct 13, 2025
1a1f4f1
Update microk8s env vars and descriptions
chenopis Oct 14, 2025
983c1eb
Remove NVIDIA_CONTAINER_RUNTIME_MODES_CDI_ANNOTATION_PREFIXES from ge…
chenopis Oct 14, 2025
dff5253
Update openshift version
chenopis Oct 14, 2025
cd3b657
Update new component version
a-mccarthy Oct 15, 2025
e7e8e68
Update Support Status for Releases
chenopis Oct 17, 2025
8179dd9
update support matrix and release notes
a-mccarthy Oct 20, 2025
6dde110
Add version admonition to Install page and Notes column to Platform S…
chenopis Oct 20, 2025
7cb0448
Use `recommended` variable in kubevirt
chenopis Oct 20, 2025
ffd5a1c
Update wording of version on Install page
chenopis Oct 20, 2025
f516926
update mig profiles on release notes page
a-mccarthy Oct 21, 2025
9f478a3
Update CDI docs for GPU Operator 25.10.0
cdesiniotis Oct 22, 2025
7d611dc
Update the description of the cdi.enabled field
cdesiniotis Oct 22, 2025
ccdd12c
Update CDI bullet in release notes
cdesiniotis Oct 22, 2025
ed3597b
update release notes and support matrix
a-mccarthy Oct 22, 2025
3aaa29b
Update component versions
chenopis Oct 22, 2025
76b5ba6
Revert VGPU_HOST_DRIVER_VERSION example to 580.82.07
chenopis Oct 22, 2025
7736a55
Fix typo -> NVIDIA HGX B300
chenopis Oct 22, 2025
dac477e
Update microk8s install command and custom containerd options
cdesiniotis Oct 22, 2025
ff77a38
add fixed issues and improvements section
a-mccarthy Oct 23, 2025
4434c88
update k8s and opc support
a-mccarthy Oct 23, 2025
412a576
update release notes
a-mccarthy Oct 24, 2025
4dbef79
fix build issue
a-mccarthy Oct 24, 2025
a757db3
add k0s, bump openshift version
a-mccarthy Oct 24, 2025
bc7de3e
update documentation for NLS licensing token use for secret (#285)
shivakunv Oct 24, 2025
3f59ab5
Apply suggestions from code review
a-mccarthy Oct 27, 2025
88bccec
Minor updates for platform support, cdi, and kubevirt
a-mccarthy Oct 27, 2025
5e608fd
add dgx spark to platform support page (#291)
a-mccarthy Oct 27, 2025
0fb4f08
Temporarily remove Confidential Containers documentation (#258)
chenopis Oct 27, 2025
cb221ec
Merge branch 'main' of https://github.com/NVIDIA/cloud-native-docs in…
chenopis Oct 27, 2025
a83d636
Apply suggestions from code review
a-mccarthy Oct 27, 2025
7add3ca
fix incorrect directory
a-mccarthy Oct 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/docs-build-pr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: docs-build-pr

on:
pull_request:
branches: [ main ]
branches: [ main, release-* ]
types: [ opened, synchronize ]

env:
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/docs-build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@ name: docs-build

on:
push:
branches: [ main ]
branches: [ main, release-* ]
tags:
- v*
workflow_dispatch:

env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
TAG: 0.5.0
TAG: 0.5.1
GH_TOKEN: ${{ github.token }}

concurrency:
Expand Down
4 changes: 2 additions & 2 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
variables:
CONTAINER_TEST_IMAGE: "${CI_REGISTRY_IMAGE}:${CI_COMMIT_REF_SLUG}"
CONTAINER_RELEASE_IMAGE: "${CI_REGISTRY_IMAGE}:0.5.0"
BUILDER_IMAGE: ghcr.io/nvidia/cloud-native-docs:0.5.0
CONTAINER_RELEASE_IMAGE: "${CI_REGISTRY_IMAGE}:0.5.1"
BUILDER_IMAGE: ghcr.io/nvidia/cloud-native-docs:0.5.1
PUBLISHER_IMAGE: "${CI_REGISTRY_PUBLISHER}/publisher:3.1.0"

stages:
Expand Down
32 changes: 30 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ Refer to <https://github.com/NVIDIA/cloud-native-docs/tags> to find the most rec

1. Build the docs:

Use the alias `build-docs` or the full command:

```bash
./repo docs
```
Expand All @@ -52,6 +54,8 @@ Refer to <https://github.com/NVIDIA/cloud-native-docs/tags> to find the most rec

The resulting HTML pages are located in the `_build/docs/.../latest/` directory of your repository clone.

If you are using WSL on Windows, the URL looks like <file://wsl.localhost/Ubuntu/home/username/path/to/repo/cloud-native-docs/_build/docs/gpu-operator/latest/index.html>.

More information about the `repo docs` command is available from
<http://omniverse-docs.s3-website-us-east-1.amazonaws.com/repo_docs/0.20.3/index.html>.

Expand Down Expand Up @@ -139,6 +143,20 @@ Always update the openshift docset when there is a new gpu-operator docset versi
The documentation for the older releases is not removed, readers are just
less likely to browse the older releases.

GPU Operator has changed to minor-only version branches.
Consequently, patch releases are documented within the same branch for that minor version.
In the `<component-name>/versions1.json` file, you can use just the first two fields of the semantic version.
For example:

```bash
{
"url": "../25.10",
"version": "25.10"
},
```

The three most-recent minor are supported.

### Tagging for Publication

Changes to the default branch are not published on docs.nvidia.com.
Expand All @@ -150,11 +168,21 @@ Only tags are published to docs.nvidia.com.
*Example*

```text
gpu-operator-v23.3.1
container-toolkit-v1.17.8
```

The first three fields of the semantic version are used.
For a "do over," push a tag like `gpu-operator-v23.3.1-1`.
For a "do over," push a tag like `container-toolkit-v1.17.8-1`.

For GPU Operator, use only the first two fields of the semantic version.

*Example*

```text
gpu-operator-v25.10
```

For a "do over," push a tag like `gpu-operator-v25.10-2`.

Always tag the openshift docset for each new gpu-operator docset release.

Expand Down
21 changes: 20 additions & 1 deletion css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,23 @@
*/
html[data-theme=light] .highlight .go {
font-style:unset
}
}

.bd-page-width {
max-width: 176rem;
}

.bd-main {
flex: 1 1 auto;
}

.bd-main .bd-content .bd-article-container {
max-width: 100%;
}

.bd-sidebar-secondary {
/* flex: 0 0 auto; */
flex-basis: 15%;
min-width: var(--pst-sidebar-secondary);
}

5 changes: 4 additions & 1 deletion docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@ RUN --mount=type=bind,source=.,destination=/x,rw /x/tools/packman/python.sh -m p
-t /tmp/extension \
sphinx-copybutton \
nvidia-sphinx-theme \
pydata-sphinx-theme
pydata-sphinx-theme \
linuxdoc

RUN (cd /tmp/extension; tar cf - . ) | (cd /var/tmp/packman/chk/sphinx/4.5.0.2-py3.7-linux-x86_64/; tar xf -)
RUN rm -rf /tmp/extension

RUN --mount=type=bind,target=/work echo 'alias build-docs="./repo docs"' >> ~/.bashrc
189 changes: 18 additions & 171 deletions gpu-operator/cdi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,86 +16,46 @@

.. headings # #, * *, =, -, ^, "

######################################################
Container Device Interface Support in the GPU Operator
######################################################
############################################################
Container Device Interface (CDI) Support in the GPU Operator
############################################################

************************************
About the Container Device Interface
************************************

The Container Device Interface (CDI) is a specification for container runtimes
such as cri-o, containerd, and podman that standardizes access to complex
devices like NVIDIA GPUs by the container runtimes.
CDI support is provided by the NVIDIA Container Toolkit and the Operator extends
that support for Kubernetes clusters.
The `Container Device Interface (CDI) <https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md>`_
is an open specification for container runtimes that abstracts what access to a device, such as an NVIDIA GPU, means,
and standardizes access across container runtimes. Popular container runtimes can read and process the specification to
ensure that a device is available in a container. CDI simplifies adding support for devices such as NVIDIA GPUs because
the specification is applicable to all container runtimes that support CDI.

Starting with GPU Operator v25.10.0, CDI is used by default for enabling GPU support in containers running on Kubernetes.
Specifically, CDI support in container runtimes, e.g. containerd and cri-o, is used to inject GPU(s) into workload
containers. This differs from prior GPU Operator releases where CDI was used via a CDI-enabled ``nvidia`` runtime class.

Use of CDI is transparent to cluster administrators and application developers.
The benefits of CDI are largely to reduce development and support for runtime-specific
plugins.

When CDI is enabled, two runtime classes, nvidia-cdi and nvidia-legacy, become available.
These two runtime classes are in addition to the default runtime class, nvidia.

If you do not set CDI as the default runtime, the runtime resolves to the
legacy runtime mode that the NVIDIA Container Toolkit provides on x86_64
machines or any architecture that has NVML libraries installed.

Optionally, you can specify the runtime class for a workload.
See :ref:`Optional: Specifying the Runtime Class for a Pod` for an example.


Support for Multi-Instance GPU
==============================

Configuring CDI is supported with Multi-Instance GPU (MIG).
Both the ``single`` and ``mixed`` strategies are supported.


Limitations and Restrictions
============================

* CDI is not supported on Red Hat OpenShift Container Platform.
CDI is supported on all other platforms listed in :ref:`Supported Operating Systems and Kubernetes Platforms`.

* Enabling CDI is not supported with Rancher Kubernetes Engine 2 (RKE2).


********************************
Enabling CDI During Installation
********************************

CDI is enabled by default during installation in GPU Operator v25.10.0 and later.
Follow the instructions for installing the Operator with Helm on the :doc:`getting-started` page.

When you install the Operator with Helm, specify the ``--set cdi.enabled=true`` argument.
Optionally, also specify the ``--set cdi.default=true`` argument to use the CDI runtime class by default for all pods.

CDI is also enabled by default during a Helm upgrade to GPU Operator v25.10.0 and later.

*******************************
Enabling CDI After Installation
*******************************

.. rubric:: Prerequisites

* You installed version 22.3.0 or newer.
* (Optional) Confirm that the only runtime class is ``nvidia`` by running the following command:

.. code-block:: console

$ kubectl get runtimeclasses

**Example Output**

.. code-block:: output

NAME HANDLER AGE
nvidia nvidia 47h

CDI is enabled by default in GPU Operator v25.10.0 and later.
Use the following procedure to enable CDI if you disabled CDI during installation.

.. rubric:: Procedure

To enable CDI support, perform the following steps:

#. Enable CDI by modifying the cluster policy:

.. code-block:: console
Expand All @@ -109,19 +69,6 @@ To enable CDI support, perform the following steps:

clusterpolicy.nvidia.com/cluster-policy patched

#. (Optional) Set the default container runtime mode to CDI by modifying the cluster policy:

.. code-block:: console

$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
-p='[{"op": "replace", "path": "/spec/cdi/default", "value":true}]'

*Example Output*

.. code-block:: output

clusterpolicy.nvidia.com/cluster-policy patched

#. (Optional) Confirm that the container toolkit and device plugin pods restart:

.. code-block:: console
Expand All @@ -134,23 +81,13 @@ To enable CDI support, perform the following steps:
:language: output
:emphasize-lines: 6,9

#. Verify that the runtime classes include nvidia-cdi and nvidia-legacy:

.. code-block:: console

$ kubectl get runtimeclasses

*Example Output*

.. literalinclude:: ./manifests/output/cdi-verify-get-runtime-classes.txt
:language: output


*************
Disabling CDI
*************

To disable CDI support, perform the following steps:
While CDI is the default and recommended mechanism for injecting GPU support into containers, you can
disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the following procedure:

#. If your nodes use the CRI-O container runtime, then temporarily disable the
GPU Operator validator:
Expand Down Expand Up @@ -188,93 +125,3 @@ To disable CDI support, perform the following steps:
nvidia.com/gpu.deploy.operator-validator=true \
nvidia.com/gpu.present=true \
--overwrite

#. (Optional) Verify that the ``nvidia-cdi`` and ``nvidia-legacy`` runtime classes
are no longer available:

.. code-block:: console

$ kubectl get runtimeclass

*Example Output*

.. code-block:: output

NAME HANDLER AGE
nvidia nvidia 11d


************************************************
Optional: Specifying the Runtime Class for a Pod
************************************************

If you enabled CDI mode for the default container runtime, then no action is required to use CDI.
However, you can use the following procedure to specify the legacy mode for a workload if you experience trouble.

If you did not enable CDI mode for the default container runtime, then you can
use the following procedure to verify that CDI is enabled and as a
routine practice to use the CDI mode of the container runtime.

#. Create a file, such as ``cuda-vectoradd-cdi.yaml``, with contents like the following example:

.. literalinclude:: ./manifests/input/cuda-vectoradd-cdi.yaml
:language: yaml
:emphasize-lines: 7

As an alternative, specify ``nvidia-legacy`` to use the legacy mode of the container runtime.

#. (Optional) Create a temporary namespace:

.. code-block:: console

$ kubectl create ns demo

*Example Output*

.. code-block:: output

namespace/demo created

#. Start the pod:

.. code-block:: console

$ kubectl apply -n demo -f cuda-vectoradd-cdi.yaml

*Example Output*

.. code-block:: output

pod/cuda-vectoradd created

#. View the logs from the pod:

.. code-block:: console

$ kubectl logs -n demo cuda-vectoradd

*Example Output*

.. literalinclude:: ./manifests/output/common-cuda-vectoradd-logs.txt
:language: output

#. Delete the temporary namespace:

.. code-block:: console

$ kubectl delete ns demo

*Example Output*

.. code-block:: output

namespace "demo" deleted


*******************
Related Information
*******************

* For more information about CDI, see the container device interface
`repository <https://github.com/container-orchestrated-devices/container-device-interface>`_
on GitHub.
Loading