-
Notifications
You must be signed in to change notification settings - Fork 1
Docs
This is a Prometheus Exporter for
exporting NVIDIA GPU metrics. It uses the Go bindings
for NVIDIA Management Library
(NVML) which is a C-based API that can be used for monitoring NVIDIA GPU devices.
Unlike some other similar exporters, it does not call the
nvidia-smi binary.

GPU stats monitoring (https://github.com/swiftdiaries/nvidia_gpu_prometheus_exporter) is an important feature for training and serving. This document describes the design and implementation of components needed.
Monitoring is a crucial component that adds visibility into the infrastructure and it's (near) real-time performance. GPU stats in conjunction with Tensorboard will give a very strong insight into the process of training and separately serving.
Prometheus is widely used for monitoring k8s clusters.
A prometheus exporter is needed so that Prometheus can pull GPU stats.
A dashboard to visualize Prometheus data.
The repository includes nvml.h, so there are no special requirements from the
build environment. go get should be able to build the exporter binary.
go get github.com/mindprince/nvidia_gpu_prometheus_exporter
kubectl create -f https://raw.githubusercontent.com/swiftdiaries/nvidia_gpu_prometheus_exporter/master/nvidia-exporter.yaml
Note: Ensure nvidia-docker is installed.
$ sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Reference: GitHub - NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
Note: It takes a couple of minutes for the drivers to install.
$ kubectl apply --filename https://raw.githubusercontent.com/giantswarm/kubernetes-prometheus/master/manifests-all.yaml
wget https://raw.githubusercontent.com/swiftdiaries/nvidia_gpu_prometheus_exporter/master/Prometheus-GPU-stats-1533769198014.json
Import this JSON to Grafana.
Note: Excuse the flat duty cycle.
- Reduce size of image used for exporter.
- Simpler / manageable YAML for Prometheus.
- ksonnet app for easy deployments / integration with Kubeflow.
Note: priority is not necessarily in that order.
$ make build
$ docker run -p 9445:9445 --rm --runtime=nvidia swiftdiaries/gpu_prom_metrics
Make changes, build, iterate.
Verify:
$ localhost:9445/metrics | grep -i "gpu"
Sample output:
# HELP nvidia_gpu_duty_cycle Percent of time over the past sample period during which one or more kernels were executing on the GPU device
# TYPE nvidia_gpu_duty_cycle gauge
nvidia_gpu_duty_cycle{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 0
# HELP nvidia_gpu_fanspeed_percent Fanspeed of the GPU device as a percent of its maximum
# TYPE nvidia_gpu_fanspeed_percent gauge
nvidia_gpu_fanspeed_percent{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 0
# HELP nvidia_gpu_memory_total_bytes Total memory of the GPU device in bytes
# TYPE nvidia_gpu_memory_total_bytes gauge
nvidia_gpu_memory_total_bytes{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 2.092171264e+09
# HELP nvidia_gpu_memory_used_bytes Memory used by the GPU device in bytes
# TYPE nvidia_gpu_memory_used_bytes gauge
nvidia_gpu_memory_used_bytes{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 1.048576e+06
# HELP nvidia_gpu_num_devices Number of GPU devices
# TYPE nvidia_gpu_num_devices gauge
nvidia_gpu_num_devices 1
# HELP nvidia_gpu_power_usage_milliwatts Power usage of the GPU device in milliwatts
# TYPE nvidia_gpu_power_usage_milliwatts gauge
nvidia_gpu_power_usage_milliwatts{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 13240
# HELP nvidia_gpu_temperature_celsius Temperature of the GPU device in celsius
# TYPE nvidia_gpu_temperature_celsius gauge
nvidia_gpu_temperature_celsius{minor_number="0",name="GeForce GTX 950",uuid="GPU-6e7a0fa1-0770-c210-1a5c-8710bc09ce00"} 34
The exporter requires the following:
- access to NVML library (
libnvidia-ml.so.1). - access to the GPU devices.
To make sure that the exporter can access the NVML libraries, either add them
to the search path for shared libraries. Or set LD_LIBRARY_PATH to point to
their location.
By default the metrics are exposed on port 9445. This can be updated using
the -web.listen-address flag.