Skip to content

Commit 042b79a

Browse files
chenopiselezarArangoGutierreza-mccarthy
committed
Container Toolkit 1.18.0 release notes (#242)
* initial draft of release notes and version updates Signed-off-by: Andrew Chen <[email protected]> * update release notes w/ feedback from elezar@ Signed-off-by: Andrew Chen <[email protected]> * Release note updates Signed-off-by: Evan Lezar <[email protected]> * Add documentation for the systemd nvidia-container-toolkit.service (#203) Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]> * Update CDI service instructions Signed-off-by: Evan Lezar <[email protected]> * fix typos Signed-off-by: Andrew Chen <[email protected]> * Changes for style guide compliance Signed-off-by: Andrew Chen <[email protected]> * Update container-toolkit/release-notes.md Co-authored-by: Abigail McCarthy <[email protected]> Signed-off-by: Evan Lezar <[email protected]> * Add curl to install instructions Signed-off-by: Evan Lezar <[email protected]> * Update containerd and crio instructions for drop-in files Signed-off-by: Evan Lezar <[email protected]> --------- Signed-off-by: Andrew Chen <[email protected]> Signed-off-by: Evan Lezar <[email protected]> Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]> Co-authored-by: Evan Lezar <[email protected]> Co-authored-by: Carlos Eduardo Arango Gutierrez <[email protected]> Co-authored-by: Abigail McCarthy <[email protected]> Signed-off-by: chenopis <[email protected]>
1 parent f0df4fa commit 042b79a

File tree

12 files changed

+223
-64
lines changed

12 files changed

+223
-64
lines changed

container-toolkit/arch-overview.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ This component is included in the `nvidia-container-toolkit` package.
7878

7979
This component includes an executable that implements the interface required by a `runC` `prestart` hook. This script is invoked by `runC`
8080
after a container has been created, but before it has been started, and is given access to the `config.json` associated with the container
81-
(e.g. this [config.json](https://github.com/opencontainers/runtime-spec/blob/master/config.md#configuration-schema-example=) ). It then takes
81+
(such as this [config.json](https://github.com/opencontainers/runtime-spec/blob/master/config.md#configuration-schema-example=) ). It then takes
8282
information contained in the `config.json` and uses it to invoke the `nvidia-container-cli` CLI with an appropriate set of flags. One of the
8383
most important flags being which specific GPU devices should be injected into the container.
8484

@@ -111,7 +111,7 @@ To use Kubernetes with Docker, you need to configure the Docker `daemon.json` to
111111
a reference to the NVIDIA Container Runtime and set this runtime as the default. The NVIDIA Container Toolkit contains a utility to update this file
112112
as highlighted in the `docker`-specific installation instructions.
113113

114-
See the {doc}`install-guide` for more information on installing the NVIDIA Container Toolkit on various Linux distributions.
114+
Refer to the {doc}`install-guide` for more information on installing the NVIDIA Container Toolkit on various Linux distributions.
115115

116116
### Package Repository
117117

@@ -130,7 +130,7 @@ For the different components:
130130

131131
:::{note}
132132
As of the release of version `1.6.0` of the NVIDIA Container Toolkit the packages for all components are
133-
published to the `libnvidia-container` `repository <https://nvidia.github.io/libnvidia-container/>` listed above. For older package versions please see the documentation archives.
133+
published to the `libnvidia-container` `repository <https://nvidia.github.io/libnvidia-container/>` listed above. For older package versions refer to the documentation archives.
134134
:::
135135

136136
Releases of the software are also hosted on `experimental` branch of the repository and are graduated to `stable` after test/validation. To get access to the latest

container-toolkit/cdi-support.md

Lines changed: 116 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
% Date: November 11 2022
22

3-
% Author: elezar
3+
% Author: elezar ([email protected])
4+
% Author: ArangoGutierrez ([email protected])
45

56
% headings (h1/h2/h3/h4/h5) are # * = -
67

@@ -29,54 +30,134 @@ CDI also improves the compatibility of the NVIDIA container stack with certain f
2930

3031
- You installed an NVIDIA GPU Driver.
3132

32-
### Procedure
33+
### Automatic CDI Specification Generation
3334

34-
Two common locations for CDI specifications are `/etc/cdi/` and `/var/run/cdi/`.
35-
The contents of the `/var/run/cdi/` directory are cleared on boot.
35+
As of NVIDIA Container Toolkit `v1.18.0`, the CDI specification is automatically generated and updated by a systemd service called `nvidia-cdi-refresh`. This service:
3636

37-
However, the path to create and use can depend on the container engine that you use.
37+
- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when:
38+
- The NVIDIA Container Toolkit is installed or upgraded
39+
- The NVIDIA GPU drivers are installed or upgraded
40+
- The system is rebooted
3841

39-
1. Generate the CDI specification file:
42+
This ensures that the CDI specifications are up to date for the current driver
43+
and device configuration and that CDI Devices defined in these speciciations are
44+
available when using native CDI support in container engines such as Docker or Podman.
4045

41-
```console
42-
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
43-
```
44-
45-
The sample command uses `sudo` to ensure that the file at `/etc/cdi/nvidia.yaml` is created.
46-
You can omit the `--output` argument to print the generated specification to `STDOUT`.
46+
Running the following command will give a list of availble CDI Devices:
47+
```console
48+
nvidia-ctk cdi list
49+
```
4750

48-
*Example Output*
51+
#### Known limitations
52+
The `nvidia-cdi-refresh` service does not currently handle the following situations:
4953

50-
```output
51-
INFO[0000] Auto-detected mode as "nvml"
52-
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0
53-
INFO[0000] Selecting /dev/dri/card1 as /dev/dri/card1
54-
INFO[0000] Selecting /dev/dri/renderD128 as /dev/dri/renderD128
55-
INFO[0000] Using driver version xxx.xxx.xx
56-
...
57-
```
54+
- The removal of NVIDIA GPU drivers
55+
- The reconfiguration of MIG devices
5856

59-
1. (Optional) Check the names of the generated devices:
57+
For these scenarios, the regeneration of CDI specifications must be [manually triggered](#manual-cdi-specification-generation).
6058

61-
```console
62-
$ nvidia-ctk cdi list
63-
```
59+
#### Customizing the Automatic CDI Refresh Service
60+
The behavior of the `nvidia-cdi-refresh` service can be customized by adding
61+
environment variables to `/etc/nvidia-container-toolkit/cdi-refresh.env` to
62+
affect the behavior of the `nvidia-ctk cdi generate` command.
6463

65-
The following example output is for a machine with a single GPU that does not support MIG.
64+
As an example, to enable debug logging the configuration file should be updated
65+
as follows:
66+
```bash
67+
# /etc/nvidia-container-toolkit/cdi-refresh.env
68+
NVIDIA_CTK_DEBUG=1
69+
```
6670

67-
```output
68-
INFO[0000] Found 9 CDI devices
69-
nvidia.com/gpu=all
70-
nvidia.com/gpu=0
71-
```
71+
For a complete list of available environment variables, run `nvidia-ctk cdi generate --help` to see the command's documentation.
7272

7373
```{important}
74-
You must generate a new CDI specification after any of the following changes:
74+
Modifications to the environment file required a systemd reload and restarting the
75+
service to take effect
76+
```
77+
78+
```console
79+
$ sudo systemctl daemon-reload
80+
$ sudo systemctl restart nvidia-cdi-refresh.service
81+
```
82+
83+
#### Managing the CDI Refresh Service
84+
85+
The `nvidia-cdi-refresh` service consists of two systemd units:
86+
87+
- `nvidia-cdi-refresh.path`: Monitors for changes to the system and triggers the service.
88+
- `nvidia-cdi-refresh.service`: Generates the CDI specifications for all available devices based on
89+
the default configuration and any overrides in the environment file.
90+
91+
These services can be managed using standard systemd commands.
92+
93+
When working as expected, the `nvidia-cdi-refresh.path` service will be enabled and active, and the
94+
`nvidia-cdi-refresh.service` will be enabled and have run at least once. For example:
95+
96+
```console
97+
$ sudo systemctl status nvidia-cdi-refresh.path
98+
● nvidia-cdi-refresh.path - Trigger CDI refresh on NVIDIA driver install / uninstall events
99+
Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.path; enabled; preset: enabled)
100+
Active: active (waiting) since Fri 2025-06-27 06:04:54 EDT; 1h 47min ago
101+
Triggers: ● nvidia-cdi-refresh.service
102+
```
103+
104+
```console
105+
$ sudo systemctl status nvidia-cdi-refresh.service
106+
○ nvidia-cdi-refresh.service - Refresh NVIDIA CDI specification file
107+
Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.service; enabled; preset: enabled)
108+
Active: inactive (dead) since Fri 2025-06-27 07:17:26 EDT; 34min ago
109+
TriggeredBy: ● nvidia-cdi-refresh.path
110+
Process: 1317511 ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml (code=exited, status=0/SUCCESS)
111+
Main PID: 1317511 (code=exited, status=0/SUCCESS)
112+
CPU: 562ms
113+
...
114+
```
115+
116+
If these are not enabled as expected, they can be enabled by running:
117+
118+
```console
119+
$ sudo systemctl enable --now nvidia-cdi-refresh.path
120+
$ sudo systemctl enable --now nvidia-cdi-refresh.service
121+
```
122+
123+
#### Troubleshooting CDI Specification Generation and Resolution
124+
125+
If CDI specifications for available devices are not generated / updated as expected, it is
126+
recommended that the logs for the `nvidia-cdi-refresh.service` be checked. This can be
127+
done by running:
128+
129+
```console
130+
$ sudo journalctl -u nvidia-cdi-refresh.service
131+
```
132+
133+
In most cases, restarting the service should be sufficient to trigger the (re)generation
134+
of CDI specifications:
135+
136+
```console
137+
$ sudo systemctl restart nvidia-cdi-refresh.service
138+
```
75139

76-
- You change the device or CUDA driver configuration.
77-
- You use a location such as `/var/run/cdi` that is cleared on boot.
140+
Running:
78141

79-
A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded.
142+
```console
143+
$ nvidia-ctk --debug cdi list
144+
```
145+
will show a list of available CDI Devices as well as any errors that may have
146+
occurred when loading CDI Specifications from `/etc/cdi` or `/var/run/cdi`.
147+
148+
### Manual CDI Specification Generation
149+
150+
As of the NVIDIA Container Toolkit `v1.18.0` the recommended mechanism to regenerate CDI specifications is to restart the `nvidia-cdi-refresh.service`:
151+
152+
```console
153+
$ sudo systemctl restart nvidia-cdi-refresh.service
154+
```
155+
156+
If this does not work, or more flexibility is required, the `nvidia-ctk cdi generate` command
157+
can be used directly:
158+
159+
```console
160+
$ sudo nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
80161
```
81162

82163
## Running a Workload with CDI

container-toolkit/docker-specialized.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@ The supported constraints are provided below:
206206
- constraint on the compute architectures of the selected GPUs.
207207
208208
* - ``brand``
209-
- constraint on the brand of the selected GPUs (e.g. GeForce, Tesla, GRID).
209+
- constraint on the brand of the selected GPUs (such as GeForce, Tesla, GRID).
210210
```
211211
212212
Multiple constraints can be expressed in a single environment variable: space-separated constraints are ORed,

container-toolkit/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,5 +35,5 @@ The NVIDIA Container Toolkit is a collection of libraries and utilities enabling
3535
## License
3636

3737
The NVIDIA Container Toolkit (and all included components) is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) and
38-
contributions are accepted with a Developer Certificate of Origin (DCO). See the [contributing](https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/CONTRIBUTING.md) document for
38+
contributions are accepted with a Developer Certificate of Origin (DCO). Refer to the [contributing](https://github.com/NVIDIA/nvidia-container-toolkit/blob/master/CONTRIBUTING.md) document for
3939
more information.

container-toolkit/install-guide.md

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Alternatively, you can install the driver by [downloading](https://www.nvidia.co
2121
```{note}
2222
There is a [known issue](troubleshooting.md#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error) on systems
2323
where `systemd` cgroup drivers are used that cause containers to lose access to requested GPUs when
24-
`systemctl daemon reload` is run. Please see the troubleshooting documentation for more information.
24+
`systemctl daemon reload` is run. Refer to the troubleshooting documentation for more information.
2525
```
2626

2727
(installing-with-apt)=
@@ -31,6 +31,12 @@ where `systemd` cgroup drivers are used that cause containers to lose access to
3131
```{note}
3232
These instructions [should work](./supported-platforms.md) for any Debian-derived distribution.
3333
```
34+
1. Install the prerequisites for the instructions below:
35+
```console
36+
$ sudo apt-get update && apt-get install -y --no-install-recommends \
37+
curl \
38+
gnupg2
39+
```
3440

3541
1. Configure the production repository:
3642

@@ -78,6 +84,12 @@ where `systemd` cgroup drivers are used that cause containers to lose access to
7884
These instructions [should work](./supported-platforms.md) for many RPM-based distributions.
7985
```
8086

87+
1. Install the prerequisites for the instructions below:
88+
```console
89+
$ sudo dnf install -y \
90+
curl
91+
```
92+
8193
1. Configure the production repository:
8294

8395
```console
@@ -186,8 +198,10 @@ follow these steps:
186198
$ sudo nvidia-ctk runtime configure --runtime=containerd
187199
```
188200

189-
The `nvidia-ctk` command modifies the `/etc/containerd/config.toml` file on the host.
190-
The file is updated so that containerd can use the NVIDIA Container Runtime.
201+
By default, the `nvidia-ctk` command creates a `/etc/containerd/conf.d/99-nvidia.toml`
202+
drop-in config file and modifies (or creates) the `/etc/containerd/config.toml` file
203+
to ensure that the `imports` config option is updated accordingly. The drop-in file
204+
ensures that containerd can use the NVIDIA Container Runtime.
191205

192206
1. Restart containerd:
193207

@@ -201,7 +215,7 @@ No additional configuration is needed.
201215
You can just run `nerdctl run --gpus=all`, with root or without root.
202216
You do not need to run the `nvidia-ctk` command mentioned above for Kubernetes.
203217

204-
See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/main/docs/gpu.md).
218+
Refer to the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/main/docs/gpu.md) for more information.
205219

206220
### Configuring CRI-O
207221

@@ -211,8 +225,8 @@ See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/
211225
$ sudo nvidia-ctk runtime configure --runtime=crio
212226
```
213227

214-
The `nvidia-ctk` command modifies the `/etc/crio/crio.conf` file on the host.
215-
The file is updated so that CRI-O can use the NVIDIA Container Runtime.
228+
By default, the `nvidia-ctk` command creates a `/etc/crio/conf.d/99-nvidia.toml`
229+
drop-in config file. The drop-in file ensures that CRI-O can use the NVIDIA Container Runtime.
216230

217231
1. Restart the CRI-O daemon:
218232

@@ -229,7 +243,6 @@ See also the [nerdctl documentation](https://github.com/containerd/nerdctl/blob/
229243

230244
For Podman, NVIDIA recommends using [CDI](./cdi-support.md) for accessing NVIDIA devices in containers.
231245

232-
233246
## Next Steps
234247

235248
- [](./sample-workload.md)

0 commit comments

Comments
 (0)