|
1 | 1 | % Date: November 11 2022 |
2 | 2 |
|
3 | | -% Author: elezar |
| 3 | +% Author: elezar ( [email protected]) |
| 4 | +% Author: ArangoGutierrez ( [email protected]) |
4 | 5 |
|
5 | 6 | % headings (h1/h2/h3/h4/h5) are # * = - |
6 | 7 |
|
@@ -29,54 +30,134 @@ CDI also improves the compatibility of the NVIDIA container stack with certain f |
29 | 30 |
|
30 | 31 | - You installed an NVIDIA GPU Driver. |
31 | 32 |
|
32 | | -### Procedure |
| 33 | +### Automatic CDI Specification Generation |
33 | 34 |
|
34 | | -Two common locations for CDI specifications are `/etc/cdi/` and `/var/run/cdi/`. |
35 | | -The contents of the `/var/run/cdi/` directory are cleared on boot. |
| 35 | +As of NVIDIA Container Toolkit `v1.18.0`, the CDI specification is automatically generated and updated by a systemd service called `nvidia-cdi-refresh`. This service: |
36 | 36 |
|
37 | | -However, the path to create and use can depend on the container engine that you use. |
| 37 | +- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when: |
| 38 | + - The NVIDIA Container Toolkit is installed or upgraded |
| 39 | + - The NVIDIA GPU drivers are installed or upgraded |
| 40 | + - The system is rebooted |
38 | 41 |
|
39 | | -1. Generate the CDI specification file: |
| 42 | +This ensures that the CDI specifications are up to date for the current driver |
| 43 | +and device configuration and that CDI Devices defined in these speciciations are |
| 44 | +available when using native CDI support in container engines such as Docker or Podman. |
40 | 45 |
|
41 | | - ```console |
42 | | - $ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml |
43 | | - ``` |
44 | | - |
45 | | - The sample command uses `sudo` to ensure that the file at `/etc/cdi/nvidia.yaml` is created. |
46 | | - You can omit the `--output` argument to print the generated specification to `STDOUT`. |
| 46 | +Running the following command will give a list of availble CDI Devices: |
| 47 | +```console |
| 48 | +nvidia-ctk cdi list |
| 49 | +``` |
47 | 50 |
|
48 | | - *Example Output* |
| 51 | +#### Known limitations |
| 52 | +The `nvidia-cdi-refresh` service does not currently handle the following situations: |
49 | 53 |
|
50 | | - ```output |
51 | | - INFO[0000] Auto-detected mode as "nvml" |
52 | | - INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0 |
53 | | - INFO[0000] Selecting /dev/dri/card1 as /dev/dri/card1 |
54 | | - INFO[0000] Selecting /dev/dri/renderD128 as /dev/dri/renderD128 |
55 | | - INFO[0000] Using driver version xxx.xxx.xx |
56 | | - ... |
57 | | - ``` |
| 54 | +- The removal of NVIDIA GPU drivers |
| 55 | +- The reconfiguration of MIG devices |
58 | 56 |
|
59 | | -1. (Optional) Check the names of the generated devices: |
| 57 | +For these scenarios, the regeneration of CDI specifications must be [manually triggered](#manual-cdi-specification-generation). |
60 | 58 |
|
61 | | - ```console |
62 | | - $ nvidia-ctk cdi list |
63 | | - ``` |
| 59 | +#### Customizing the Automatic CDI Refresh Service |
| 60 | +The behavior of the `nvidia-cdi-refresh` service can be customized by adding |
| 61 | +environment variables to `/etc/nvidia-container-toolkit/cdi-refresh.env` to |
| 62 | +affect the behavior of the `nvidia-ctk cdi generate` command. |
64 | 63 |
|
65 | | - The following example output is for a machine with a single GPU that does not support MIG. |
| 64 | +As an example, to enable debug logging the configuration file should be updated |
| 65 | +as follows: |
| 66 | +```bash |
| 67 | +# /etc/nvidia-container-toolkit/cdi-refresh.env |
| 68 | +NVIDIA_CTK_DEBUG=1 |
| 69 | +``` |
66 | 70 |
|
67 | | - ```output |
68 | | - INFO[0000] Found 9 CDI devices |
69 | | - nvidia.com/gpu=all |
70 | | - nvidia.com/gpu=0 |
71 | | - ``` |
| 71 | +For a complete list of available environment variables, run `nvidia-ctk cdi generate --help` to see the command's documentation. |
72 | 72 |
|
73 | 73 | ```{important} |
74 | | -You must generate a new CDI specification after any of the following changes: |
| 74 | +Modifications to the environment file required a systemd reload and restarting the |
| 75 | +service to take effect |
| 76 | +``` |
| 77 | + |
| 78 | +```console |
| 79 | +$ sudo systemctl daemon-reload |
| 80 | +$ sudo systemctl restart nvidia-cdi-refresh.service |
| 81 | +``` |
| 82 | + |
| 83 | +#### Managing the CDI Refresh Service |
| 84 | + |
| 85 | +The `nvidia-cdi-refresh` service consists of two systemd units: |
| 86 | + |
| 87 | +- `nvidia-cdi-refresh.path`: Monitors for changes to the system and triggers the service. |
| 88 | +- `nvidia-cdi-refresh.service`: Generates the CDI specifications for all available devices based on |
| 89 | + the default configuration and any overrides in the environment file. |
| 90 | + |
| 91 | +These services can be managed using standard systemd commands. |
| 92 | + |
| 93 | +When working as expected, the `nvidia-cdi-refresh.path` service will be enabled and active, and the |
| 94 | +`nvidia-cdi-refresh.service` will be enabled and have run at least once. For example: |
| 95 | + |
| 96 | +```console |
| 97 | +$ sudo systemctl status nvidia-cdi-refresh.path |
| 98 | +● nvidia-cdi-refresh.path - Trigger CDI refresh on NVIDIA driver install / uninstall events |
| 99 | + Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.path; enabled; preset: enabled) |
| 100 | + Active: active (waiting) since Fri 2025-06-27 06:04:54 EDT; 1h 47min ago |
| 101 | + Triggers: ● nvidia-cdi-refresh.service |
| 102 | +``` |
| 103 | + |
| 104 | +```console |
| 105 | +$ sudo systemctl status nvidia-cdi-refresh.service |
| 106 | +○ nvidia-cdi-refresh.service - Refresh NVIDIA CDI specification file |
| 107 | + Loaded: loaded (/etc/systemd/system/nvidia-cdi-refresh.service; enabled; preset: enabled) |
| 108 | + Active: inactive (dead) since Fri 2025-06-27 07:17:26 EDT; 34min ago |
| 109 | +TriggeredBy: ● nvidia-cdi-refresh.path |
| 110 | + Process: 1317511 ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml (code=exited, status=0/SUCCESS) |
| 111 | + Main PID: 1317511 (code=exited, status=0/SUCCESS) |
| 112 | + CPU: 562ms |
| 113 | +... |
| 114 | +``` |
| 115 | + |
| 116 | +If these are not enabled as expected, they can be enabled by running: |
| 117 | + |
| 118 | +```console |
| 119 | +$ sudo systemctl enable --now nvidia-cdi-refresh.path |
| 120 | +$ sudo systemctl enable --now nvidia-cdi-refresh.service |
| 121 | +``` |
| 122 | + |
| 123 | +#### Troubleshooting CDI Specification Generation and Resolution |
| 124 | + |
| 125 | +If CDI specifications for available devices are not generated / updated as expected, it is |
| 126 | +recommended that the logs for the `nvidia-cdi-refresh.service` be checked. This can be |
| 127 | +done by running: |
| 128 | + |
| 129 | +```console |
| 130 | +$ sudo journalctl -u nvidia-cdi-refresh.service |
| 131 | +``` |
| 132 | + |
| 133 | +In most cases, restarting the service should be sufficient to trigger the (re)generation |
| 134 | +of CDI specifications: |
| 135 | + |
| 136 | +```console |
| 137 | +$ sudo systemctl restart nvidia-cdi-refresh.service |
| 138 | +``` |
75 | 139 |
|
76 | | -- You change the device or CUDA driver configuration. |
77 | | -- You use a location such as `/var/run/cdi` that is cleared on boot. |
| 140 | +Running: |
78 | 141 |
|
79 | | -A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded. |
| 142 | +```console |
| 143 | +$ nvidia-ctk --debug cdi list |
| 144 | +``` |
| 145 | +will show a list of available CDI Devices as well as any errors that may have |
| 146 | +occurred when loading CDI Specifications from `/etc/cdi` or `/var/run/cdi`. |
| 147 | + |
| 148 | +### Manual CDI Specification Generation |
| 149 | + |
| 150 | +As of the NVIDIA Container Toolkit `v1.18.0` the recommended mechanism to regenerate CDI specifications is to restart the `nvidia-cdi-refresh.service`: |
| 151 | + |
| 152 | +```console |
| 153 | +$ sudo systemctl restart nvidia-cdi-refresh.service |
| 154 | +``` |
| 155 | + |
| 156 | +If this does not work, or more flexibility is required, the `nvidia-ctk cdi generate` command |
| 157 | +can be used directly: |
| 158 | + |
| 159 | +```console |
| 160 | +$ sudo nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml |
80 | 161 | ``` |
81 | 162 |
|
82 | 163 | ## Running a Workload with CDI |
|
0 commit comments