Prometheus Slurm Exporter

Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.

Exported Metrics

State of the CPUs

Allocated: CPUs which have been allocated to a job.
Idle: CPUs not allocated to a job and thus available for use.
Other: CPUs which are unavailable for use at the moment.
Total: total number of CPUs.

Information extracted from the SLURM sinfo command.
Slurm CPU Management User and Administrator Guide

State of the GPUs

Allocated: GPUs which have been allocated to a job.
Other: GPUs which are unavailable for use at the moment.
Total: total number of GPUs.
Utilization: total GPU utilization on the cluster.

Information extracted from the SLURM sinfo and sacct commands.
Slurm GRES scheduling

IMPORTANT: GPU accounting is disabled by default. To enable GPU accounting, you must pass the --gpus-acct flag when running the exporter. Without this flag, GPU-related metrics will not be collected or exported.

NOTE: Since version 0.19, you must explicitly enable GPU accounting by adding the --gpus-acct option.

State of the Nodes

Allocated: nodes which have been allocated to one or more jobs.
Completing: all jobs associated with these nodes are in the process of being completed.
Down: nodes which are unavailable for use.
Drain: nodes in a drained or draining state.
Fail: nodes expected to fail soon and unavailable for use.
Error: nodes in an error state and incapable of running jobs.
Idle: nodes not allocated to any jobs.
Maint: nodes marked for maintenance.
Mixed: nodes with some CPUs allocated and others idle.
Planned: nodes held for a multi-node job launch.
Resv: nodes in an advanced reservation.

Information extracted from the SLURM sinfo command.

Additional node usage info

Since version 0.18, information about CPUs and memory (allocated, idle, and total) is also extracted for every node known by Slurm. This includes node labels like hostname and status.

Status of the Jobs

PENDING: Jobs awaiting resource allocation.
PENDING_DEPENDENCY: Jobs awaiting job dependency resolution.
RUNNING: Jobs currently allocated resources.
SUSPENDED: Jobs with suspended execution.
CANCELLED: Jobs cancelled by a user or administrator.
COMPLETING: Jobs in the process of completion.
COMPLETED: Jobs that terminated with an exit code of zero.
CONFIGURING: Jobs waiting for resources to become ready.
FAILED: Jobs that terminated with a non-zero exit code.
TIMEOUT: Jobs terminated upon reaching their time limit.
PREEMPTED: Jobs terminated due to preemption.
NODE_FAIL: Jobs terminated due to node failure.

Information extracted from the SLURM squeue command.

State of the Partitions

Running/suspended jobs per partition, divided by Slurm account and user.
Total/allocated/idle CPUs per partition and per user ID.

Jobs Information per Account and User

Information about running, pending, and suspended jobs per Slurm account and user are also extracted using squeue.

Scheduler Information

Server Thread count: Number of active slurmctld threads.
Queue size: Length of the scheduler queue.
DBD Agent queue size: Length of the SlurmDBD message queue.
Last cycle: Time for the last scheduling cycle (microseconds).
Mean cycle: Mean scheduling cycle time since last reset.
Cycles per minute: Number of scheduling executions per minute.
Backfill metrics: Metrics related to backfilling jobs, including cycle times, depth mean, and total backfilled jobs.

Information extracted from the SLURM sdiag command.

TLS and Basic Authentication

The Prometheus Slurm Exporter supports TLS and Basic Authentication by using a configuration file. To enable these features, you need to specify the path to a configuration file via the --web.config.file flag. For more information on how to configure TLS or Basic Auth, refer to the Exporter Toolkit documentation.

Example:

./slurm_exporter --web.config.file=/path/to/web-config.yml

An example web-config.yml file:

tls_server_config:
  cert_file: /path/to/cert.crt
  key_file: /path/to/cert.key
basic_auth_users:
  admin: $2y$12$EXAMPLE_ENCRYPTED_PASSWORD_HASH

For more details, see the Exporter Toolkit documentation.

Installation

Build the exporter as described in DEVELOPMENT.md and copy the executable bin/slurm_exporter to a node with access to the Slurm CLI.
A Systemd unit file is provided in lib/systemd/prometheus-slurm-exporter.service.
Optionally, you can package the exporter as a Snap. See packages/snap/README.md for details.

Commands to Start the Exporter

Here are the different ways to start the exporter based on your needs:

Basic launch without GPU accounting:

./slurm_exporter --web.listen-address=:8080

Launch with GPU accounting enabled:

./slurm_exporter --web.listen-address=:8080 --gpus-acct

Launch with TLS and Basic Authentication:

./slurm_exporter --web.listen-address=:8080 --web.config.file=/path/to/web-config.yml

For more details on TLS and Basic Authentication configuration, refer to the Exporter Toolkit documentation.

Prometheus Configuration

Configure Prometheus to scrape the Slurm exporter:

scrape_configs:
  - job_name: 'my_slurm_exporter'
    scrape_interval: 30s
    scrape_timeout: 30s
    static_configs:
      - targets: ['slurm_host.fqdn:8080']

scrape_interval: Set to 30 seconds to prevent overloading the Slurm master.
scrape_timeout: Ensure a reasonable timeout to avoid context_deadline_exceeded errors on a busy Slurm master.

Check the Prometheus configuration before reloading:

$ promtool check-config prometheus.yml

Grafana Dashboard

A Grafana dashboard is available to visualize the exported metrics:

License

This project is licensed under the GNU General Public License, version 3 or later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Prometheus Slurm Exporter

Exported Metrics

State of the CPUs

State of the GPUs

State of the Nodes

Additional node usage info

Status of the Jobs

State of the Partitions

Jobs Information per Account and User

Scheduler Information

TLS and Basic Authentication

Installation

Commands to Start the Exporter

Prometheus Configuration

Grafana Dashboard

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Prometheus Slurm Exporter

Exported Metrics

State of the CPUs

State of the GPUs

State of the Nodes

Additional node usage info

Status of the Jobs

State of the Partitions

Jobs Information per Account and User

Scheduler Information

TLS and Basic Authentication

Installation

Commands to Start the Exporter

Prometheus Configuration

Grafana Dashboard

License