Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions container-toolkit/arch-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ This component is included in the `nvidia-container-toolkit` package.

This component includes an executable that implements the interface required by a `runC` `prestart` hook. This script is invoked by `runC`
after a container has been created, but before it has been started, and is given access to the `config.json` associated with the container
(e.g. this [config.json](https://github.com/opencontainers/runtime-spec/blob/master/config.md#configuration-schema-example=) ). It then takes
(such as this [config.json](https://github.com/opencontainers/runtime-spec/blob/master/config.md#configuration-schema-example=) ). It then takes
information contained in the `config.json` and uses it to invoke the `nvidia-container-cli` CLI with an appropriate set of flags. One of the
most important flags being which specific GPU devices should be injected into the container.

Expand Down Expand Up @@ -111,7 +111,7 @@ To use Kubernetes with Docker, you need to configure the Docker `daemon.json` to
a reference to the NVIDIA Container Runtime and set this runtime as the default. The NVIDIA Container Toolkit contains a utility to update this file
as highlighted in the `docker`-specific installation instructions.

See the {doc}`install-guide` for more information on installing the NVIDIA Container Toolkit on various Linux distributions.
Refer to the {doc}`install-guide` for more information on installing the NVIDIA Container Toolkit on various Linux distributions.

### Package Repository

Expand All @@ -130,7 +130,7 @@ For the different components:

:::{note}
As of the release of version `1.6.0` of the NVIDIA Container Toolkit the packages for all components are
published to the `libnvidia-container` `repository <https://nvidia.github.io/libnvidia-container/>` listed above. For older package versions please see the documentation archives.
published to the `libnvidia-container` `repository <https://nvidia.github.io/libnvidia-container/>` listed above. For older package versions refer to the documentation archives.
:::

Releases of the software are also hosted on `experimental` branch of the repository and are graduated to `stable` after test/validation. To get access to the latest
Expand Down
151 changes: 116 additions & 35 deletions container-toolkit/cdi-support.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
% Date: November 11 2022

% Author: elezar
% Author: elezar ([email protected])
% Author: ArangoGutierrez ([email protected])

% headings (h1/h2/h3/h4/h5) are # * = -

Expand Down Expand Up @@ -29,54 +30,134 @@ CDI also improves the compatibility of the NVIDIA container stack with certain f

- You installed an NVIDIA GPU Driver.

### Procedure
### Automatic CDI Specification Generation

Two common locations for CDI specifications are `/etc/cdi/` and `/var/run/cdi/`.
The contents of the `/var/run/cdi/` directory are cleared on boot.
As of NVIDIA Container Toolkit `v1.18.0`, the CDI specification is automatically generated and updated by a systemd service called `nvidia-cdi-refresh`. This service:

However, the path to create and use can depend on the container engine that you use.
- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when:
- The NVIDIA Container Toolkit is installed or upgraded
- The NVIDIA GPU drivers are installed or upgraded
- The system is rebooted

1. Generate the CDI specification file:
This ensures that the CDI specifications are up to date for the current driver
and device configuration and that CDI Devices defined in these speciciations are
available when using native CDI support in container engines such as Docker or Podman.

```console
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```

The sample command uses `sudo` to ensure that the file at `/etc/cdi/nvidia.yaml` is created.
You can omit the `--output` argument to print the generated specification to `STDOUT`.
Running the following command will give a list of availble CDI Devices:
```console
nvidia-ctk cdi list
```

*Example Output*
#### Known limitations
The `nvidia-cdi-refresh` service does not currently handle the following situations:

```output
INFO[0000] Auto-detected mode as "nvml"
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0
INFO[0000] Selecting /dev/dri/card1 as /dev/dri/card1
INFO[0000] Selecting /dev/dri/renderD128 as /dev/dri/renderD128
INFO[0000] Using driver version xxx.xxx.xx
...
```
- The removal of NVIDIA GPU drivers
- The reconfiguration of MIG devices

1. (Optional) Check the names of the generated devices:
For these scenarios, the regeneration of CDI specifications must be [manually triggered](#manual-cdi-specification-generation).

```console
$ nvidia-ctk cdi list
```
#### Customizing the Automatic CDI Refresh Service
The behavior of the `nvidia-cdi-refresh` service can be customized by adding
environment variables to `/etc/nvidia-container-toolkit/cdi-refresh.env` to
affect the behavior of the `nvidia-ctk cdi generate` command.

The following example output is for a machine with a single GPU that does not support MIG.
As an example, to enable debug logging the configuration file should be updated
as follows:
```bash
# /etc/nvidia-container-toolkit/cdi-refresh.env
NVIDIA_CTK_DEBUG=1
```

```output
INFO[0000] Found 9 CDI devices
nvidia.com/gpu=all
nvidia.com/gpu=0
```
For a complete list of available environment variables, run `nvidia-ctk cdi generate --help` to see the command's documentation.

```{important}
You must generate a new CDI specification after any of the following changes:
Modifications to the environment file required a systemd reload and restarting the
service to take effect
```

```console
$ sudo systemctl daemon-reload
$ sudo systemctl restart nvidia-cdi-refresh.service
```

#### Managing the CDI Refresh Service

The `nvidia-cdi-refresh` service consists of two systemd units:

- `nvidia-cdi-refresh.path`: Monitors for changes to the system and triggers the service.
- `nvidia-cdi-refresh.service`: Generates the CDI specifications for all available devices based on
the default configuration and any overrides in the environment file.

These services can be managed using standard systemd commands.

When working as expected, the `nvidia-cdi-refresh.path` service will be enabled and active, and the
`nvidia-cdi-refresh.service` will be enabled and have run at least once. For example:

```console
$ sudo systemctl status nvidia-cdi-refresh.path
● nvidia-cdi-refresh.path - Trigger CDI refresh on NVIDIA driver install / uninstall events
Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.path; enabled; preset: enabled)
Active: active (waiting) since Fri 2025-06-27 06:04:54 EDT; 1h 47min ago
Triggers: ● nvidia-cdi-refresh.service
```

```console
$ sudo systemctl status nvidia-cdi-refresh.service
○ nvidia-cdi-refresh.service - Refresh NVIDIA CDI specification file
Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.service; enabled; preset: enabled)
Active: inactive (dead) since Fri 2025-06-27 07:17:26 EDT; 34min ago
TriggeredBy: ● nvidia-cdi-refresh.path
Process: 1317511 ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml (code=exited, status=0/SUCCESS)
Main PID: 1317511 (code=exited, status=0/SUCCESS)
CPU: 562ms
...
```

If these are not enabled as expected, they can be enabled by running:

```console
$ sudo systemctl enable --now nvidia-cdi-refresh.path
$ sudo systemctl enable --now nvidia-cdi-refresh.service
```

#### Troubleshooting CDI Specification Generation and Resolution

If CDI specifications for available devices are not generated / updated as expected, it is
recommended that the logs for the `nvidia-cdi-refresh.service` be checked. This can be
done by running:

```console
$ sudo journalctl -u nvidia-cdi-refresh.service
```

In most cases, restarting the service should be sufficient to trigger the (re)generation
of CDI specifications:

```console
$ sudo systemctl restart nvidia-cdi-refresh.service
```

- You change the device or CUDA driver configuration.
- You use a location such as `/var/run/cdi` that is cleared on boot.
Running:

A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded.
```console
$ nvidia-ctk --debug cdi list
```
will show a list of available CDI Devices as well as any errors that may have
occurred when loading CDI Specifications from `/etc/cdi` or `/var/run/cdi`.

### Manual CDI Specification Generation

As of the NVIDIA Container Toolkit `v1.18.0` the recommended mechanism to regenerate CDI specifications is to restart the `nvidia-cdi-refresh.service`:

```console
$ sudo systemctl restart nvidia-cdi-refresh.service
```

If this does not work, or more flexibility is required, the `nvidia-ctk cdi generate` command
can be used directly:

```console
$ sudo nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
```

## Running a Workload with CDI
Expand Down
2 changes: 1 addition & 1 deletion container-toolkit/docker-specialized.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ The supported constraints are provided below:
- constraint on the compute architectures of the selected GPUs.

* - ``brand``
- constraint on the brand of the selected GPUs (e.g. GeForce, Tesla, GRID).
- constraint on the brand of the selected GPUs (such as GeForce, Tesla, GRID).
```

Multiple constraints can be expressed in a single environment variable: space-separated constraints are ORed,
Expand Down
2 changes: 1 addition & 1 deletion container-toolkit/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,5 @@ The NVIDIA Container Toolkit is a collection of libraries and utilities enabling
## License

The NVIDIA Container Toolkit (and all included components) is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) and
contributions are accepted with a Developer Certificate of Origin (DCO). See the [contributing](https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/CONTRIBUTING.md) document for
contributions are accepted with a Developer Certificate of Origin (DCO). Refer to the [contributing](https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/CONTRIBUTING.md) document for
more information.
27 changes: 20 additions & 7 deletions container-toolkit/install-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Alternatively, you can install the driver by [downloading](https://www.nvidia.co
```{note}
There is a [known issue](troubleshooting.md#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error) on systems
where `systemd` cgroup drivers are used that cause containers to lose access to requested GPUs when
`systemctl daemon reload` is run. Please see the troubleshooting documentation for more information.
`systemctl daemon reload` is run. Refer to the troubleshooting documentation for more information.
```

(installing-with-apt)=
Expand All @@ -31,6 +31,12 @@ where `systemd` cgroup drivers are used that cause containers to lose access to
```{note}
These instructions [should work](./supported-platforms.md) for any Debian-derived distribution.
```
1. Install the prerequisites for the instructions below:
```console
$ sudo apt-get update && apt-get install -y --no-install-recommends \
curl \
gnupg2
```

1. Configure the production repository:

Expand Down Expand Up @@ -78,6 +84,12 @@ where `systemd` cgroup drivers are used that cause containers to lose access to
These instructions [should work](./supported-platforms.md) for many RPM-based distributions.
```

1. Install the prerequisites for the instructions below:
```console
$ sudo dnf install -y \
curl
```

1. Configure the production repository:

```console
Expand Down Expand Up @@ -186,8 +198,10 @@ follow these steps:
$ sudo nvidia-ctk runtime configure --runtime=containerd
```

The `nvidia-ctk` command modifies the `/etc/containerd/config.toml` file on the host.
The file is updated so that containerd can use the NVIDIA Container Runtime.
By default, the `nvidia-ctk` command creates a `/etc/containerd/conf.d/99-nvidia.toml`
drop-in config file and modifies (or creates) the `/etc/containerd/config.toml` file
to ensure that the `imports` config option is updated accordingly. The drop-in file
ensures that containerd can use the NVIDIA Container Runtime.

1. Restart containerd:

Expand All @@ -201,7 +215,7 @@ No additional configuration is needed.
You can just run `nerdctl run --gpus=all`, with root or without root.
You do not need to run the `nvidia-ctk` command mentioned above for Kubernetes.

See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/main/docs/gpu.md).
Refer to the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/main/docs/gpu.md) for more information.

### Configuring CRI-O

Expand All @@ -211,8 +225,8 @@ See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/
$ sudo nvidia-ctk runtime configure --runtime=crio
```

The `nvidia-ctk` command modifies the `/etc/crio/crio.conf` file on the host.
The file is updated so that CRI-O can use the NVIDIA Container Runtime.
By default, the `nvidia-ctk` command creates a `/etc/crio/conf.d/99-nvidia.toml`
drop-in config file. The drop-in file ensures that CRI-O can use the NVIDIA Container Runtime.

1. Restart the CRI-O daemon:

Expand All @@ -229,7 +243,6 @@ See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/

For Podman, NVIDIA recommends using [CDI](./cdi-support.md) for accessing NVIDIA devices in containers.


## Next Steps

- [](./sample-workload.md)
Loading