Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
ad09bd7
feat: GPU telemetry milestone 1
ilana-n Sep 17, 2025
8123deb
fix: address copilot comments
ilana-n Sep 17, 2025
4087aa7
feat: addressing review comments
ilana-n Sep 19, 2025
4024fc4
feat: add universal constants file
ilana-n Sep 19, 2025
82b524f
fix: minor review comments
ilana-n Sep 19, 2025
06e41ae
fix: address review comments, especially about setting up metric unit…
ilana-n Sep 19, 2025
035226e
feat: address @acasagrande review comments
ilana-n Sep 23, 2025
4bc748c
feat: update unit tests and add new metric class types
ilana-n Sep 23, 2025
5e0a900
feat: clean up comments and make sure TelemetryManager callback for r…
ilana-n Sep 23, 2025
b67d23c
fix: formatting and linting
ilana-n Sep 23, 2025
0aeb065
fix: add prometheous client import to pyproject.toml so that tests on…
ilana-n Sep 23, 2025
e3ab6a9
fix: address github copilot review comments
ilana-n Sep 23, 2025
2885a12
feat: end-to-end functionality
ilana-n Sep 24, 2025
030f22d
Resolve merge conflicts in system_controller.py
ilana-n Sep 26, 2025
0df4180
Cleanup: simplify telemetry results initialization
ilana-n Sep 29, 2025
a72f431
feat: GPU Telemetry Milestone 1 (#274)
ilana-n Sep 24, 2025
98fa3aa
address copilot and coderabbit comments
ilana-n Sep 30, 2025
c8b571c
address coderabbit nitpicks
ilana-n Sep 30, 2025
7af1c4c
Merge feature/gpu-telemetry into gpu-telemetry-cli
ilana-n Sep 30, 2025
25bf7de
feat: GPU Telemetry Integration - Milestone 2 (#316)
ilana-n Oct 7, 2025
9e3ad8d
fix: add hack to support --gpu-telemetry without any arguments
ajcasagrande Oct 7, 2025
f2c16bb
feat: GPU telemetry milestone 1
ilana-n Sep 17, 2025
cbed4e2
fix: address copilot comments
ilana-n Sep 17, 2025
158b54e
feat: addressing review comments
ilana-n Sep 19, 2025
a1f4e8a
feat: add universal constants file
ilana-n Sep 19, 2025
ea971ea
fix: minor review comments
ilana-n Sep 19, 2025
3633be9
fix: address review comments, especially about setting up metric unit…
ilana-n Sep 19, 2025
b4274cd
feat: address @acasagrande review comments
ilana-n Sep 23, 2025
0d052ec
feat: update unit tests and add new metric class types
ilana-n Sep 23, 2025
47a1875
feat: clean up comments and make sure TelemetryManager callback for r…
ilana-n Sep 23, 2025
9c92435
fix: formatting and linting
ilana-n Sep 23, 2025
dbd886d
fix: add prometheous client import to pyproject.toml so that tests on…
ilana-n Sep 23, 2025
24ff8af
fix: address github copilot review comments
ilana-n Sep 23, 2025
16ecf7a
feat: end-to-end functionality
ilana-n Sep 24, 2025
6598a1a
Resolve merge conflicts in system_controller.py
ilana-n Sep 26, 2025
b1fcdea
Cleanup: simplify telemetry results initialization
ilana-n Sep 29, 2025
c68f30d
address copilot and coderabbit comments
ilana-n Sep 30, 2025
e330347
address coderabbit nitpicks
ilana-n Sep 30, 2025
e67188a
feat: GPU Telemetry Milestone 1 (#274)
ilana-n Sep 24, 2025
2631368
feat: GPU Telemetry Integration - Milestone 2 (#316)
ilana-n Oct 7, 2025
2221280
fix: add hack to support --gpu-telemetry without any arguments
ajcasagrande Oct 7, 2025
b891b3f
fix: minor push v pull
ilana-n Oct 7, 2025
03ee7b3
merge pull fix
ilana-n Oct 7, 2025
99fdac2
feat: add final essential metrics
ilana-n Oct 7, 2025
28c562b
feat: add user documentation for gpu telemetry feature (#332)
ilana-n Oct 8, 2025
b238cba
fix: remove extra info logging in telemetry manager and clean up csv …
ilana-n Oct 8, 2025
c979f68
Merge branch 'feature/gpu-telemetry' of https://github.com/ai-dynamo/…
ilana-n Oct 8, 2025
469eb18
fix: deduplicate identical url endpoints
ilana-n Oct 8, 2025
ccb8e26
fix: copilot nits and add example to documentation
ilana-n Oct 8, 2025
120ec76
fix: coderabbit comments and remove documentation to put in its own pr
ilana-n Oct 8, 2025
1ef1ed3
fix: clear error counts across profiling runs for telemetry data
ilana-n Oct 8, 2025
5688ea0
Merge branch 'main' into feature/gpu-telemetry
ilana-n Oct 8, 2025
b7c5d21
fix: telemetry storage group by uuid
ilana-n Oct 8, 2025
6e769bf
fix: telemetry storage group by uuid
ilana-n Oct 8, 2025
700cdd1
Merge branch 'feature/gpu-telemetry' of https://github.com/ai-dynamo/…
ilana-n Oct 8, 2025
79927f5
Apply suggestion from @coderabbitai[bot]
ilana-n Oct 8, 2025
1ffff3a
fix: increase unit test code coverage
ilana-n Oct 8, 2025
800b742
Merge branch 'feature/gpu-telemetry' of https://github.com/ai-dynamo/…
ilana-n Oct 8, 2025
52cc83d
Update tests/data_exporters/test_json_exporter.py
ilana-n Oct 8, 2025
3685f0e
fix: increase unit test code coverage
ilana-n Oct 8, 2025
ec34766
fix: make telemetry metrics configurable
ilana-n Oct 9, 2025
ca70f06
fix: remove docs from this commit/pr
ilana-n Oct 9, 2025
df47bea
fix: revert unwanted doc and ci test changes in this branch
ilana-n Oct 9, 2025
2c7516d
fix: unit tests
ilana-n Oct 9, 2025
2e9bfbe
fix: autoregister
ilana-n Oct 10, 2025
58e4338
fix: telemetry manager add 5s wait in between status message and shut…
ilana-n Oct 10, 2025
41953d0
Merge branch 'main' into feature/gpu-telemetry
ilana-n Oct 10, 2025
66a8259
fix: separate metrics processors and telemetry processors in records …
ilana-n Oct 10, 2025
791b60f
fix: add more unit test coverage for Telemetry Manager
ilana-n Oct 10, 2025
379dc0c
fix: address code rabbit unit test comments
ilana-n Oct 10, 2025
a21b7fe
Merge branch 'main' into feature/gpu-telemetry
ilana-n Oct 10, 2025
540ce2e
Merge remote-tracking branch 'origin/main' into feature/gpu-telemetry
ilana-n Oct 13, 2025
875bf6d
fix: import and unit tests for csv and json exporters
ilana-n Oct 13, 2025
f5459af
feat: gpu telemetry ci end-to-end tests
ilana-n Oct 13, 2025
2983c5b
fix: verification output dir
ilana-n Oct 13, 2025
ef13407
fix: use docker
ilana-n Oct 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions aiperf/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

import sys

from aiperf.cli import app
from aiperf.gpu_telemetry.constants import DEFAULT_DCGM_ENDPOINT


def main() -> int:
# TODO: HACK: Remove this once we can upgrade to v4 of cyclopts
# This is a hack to allow the --gpu-telemetry flag to be used without a value
# and it will be set to the default endpoint, which will inform the telemetry
# exporter to print the telemetry to the console
if "--gpu-telemetry" in sys.argv:
idx = sys.argv.index("--gpu-telemetry")
if idx >= len(sys.argv) - 1 or sys.argv[idx + 1].startswith("-"):
sys.argv.insert(idx + 1, DEFAULT_DCGM_ENDPOINT)
return app(sys.argv[1:])


if __name__ == "__main__":
sys.exit(main())
7 changes: 0 additions & 7 deletions aiperf/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@
# will cause a performance penalty during this process.
################################################################################

import sys

from cyclopts import App

from aiperf.cli_utils import exit_on_error
Expand All @@ -34,9 +32,4 @@ def profile(
from aiperf.common.config import load_service_config

service_config = service_config or load_service_config()

run_system_controller(user_config, service_config)


if __name__ == "__main__":
sys.exit(app())
1 change: 1 addition & 0 deletions aiperf/common/config/groups.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ class Groups:
AUDIO_INPUT = Group.create_ordered("Audio Input")
IMAGE_INPUT = Group.create_ordered("Image Input")
SERVICE = Group.create_ordered("Service")
TELEMETRY = Group.create_ordered("Telemetry")
UI = Group.create_ordered("UI")
WORKERS = Group.create_ordered("Workers")
DEVELOPER = Group.create_ordered("Developer")
Expand Down
21 changes: 18 additions & 3 deletions aiperf/common/config/user_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,15 @@
from typing import Annotated, Any

from orjson import JSONDecodeError
from pydantic import Field, model_validator
from pydantic import BeforeValidator, Field, model_validator
from typing_extensions import Self

from aiperf.common.aiperf_logger import AIPerfLogger
from aiperf.common.config.base_config import BaseConfig
from aiperf.common.config.cli_parameter import DisableCLI
from aiperf.common.config.config_validators import coerce_value
from aiperf.common.config.cli_parameter import CLIParameter, DisableCLI
from aiperf.common.config.config_validators import coerce_value, parse_str_or_list
from aiperf.common.config.endpoint_config import EndpointConfig
from aiperf.common.config.groups import Groups
from aiperf.common.config.input_config import InputConfig
from aiperf.common.config.loadgen_config import LoadGeneratorConfig
from aiperf.common.config.output_config import OutputConfig
Expand Down Expand Up @@ -210,6 +211,20 @@ def _count_dataset_entries(self) -> int:
DisableCLI(reason="This is automatically set by the CLI"),
] = None

gpu_telemetry: Annotated[
list[str] | None,
Field(
default=None,
description="Enable GPU telemetry console display and optionally specify custom DCGM exporter URLs (e.g., http://node1:9401/metrics http://node2:9401/metrics). Default localhost:9401 is always attempted",
),
BeforeValidator(parse_str_or_list),
CLIParameter(
name=("--gpu-telemetry",),
consume_multiple=True,
group=Groups.TELEMETRY,
),
]

@model_validator(mode="after")
def _compute_config(self) -> Self:
"""Compute additional configuration.
Expand Down
16 changes: 16 additions & 0 deletions aiperf/common/enums/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@
from aiperf.common.enums.metric_enums import (
BaseMetricUnit,
BaseMetricUnitInfo,
EnergyMetricUnit,
EnergyMetricUnitInfo,
FrequencyMetricUnit,
FrequencyMetricUnitInfo,
GenericMetricUnit,
MetricFlags,
MetricOverTimeUnit,
Expand All @@ -65,6 +69,10 @@
MetricValueType,
MetricValueTypeInfo,
MetricValueTypeVarT,
PowerMetricUnit,
PowerMetricUnitInfo,
TemperatureMetricUnit,
TemperatureMetricUnitInfo,
)
from aiperf.common.enums.model_enums import (
ModelSelectionStrategy,
Expand Down Expand Up @@ -122,7 +130,11 @@
"EndpointServiceKind",
"EndpointType",
"EndpointTypeInfo",
"EnergyMetricUnit",
"EnergyMetricUnitInfo",
"ExportLevel",
"FrequencyMetricUnit",
"FrequencyMetricUnitInfo",
"GenericMetricUnit",
"ImageFormat",
"LifecycleState",
Expand All @@ -141,6 +153,8 @@
"MetricValueTypeVarT",
"ModelSelectionStrategy",
"OpenAIObjectType",
"PowerMetricUnit",
"PowerMetricUnitInfo",
"PromptSource",
"PublicDatasetType",
"RecordProcessorType",
Expand All @@ -151,6 +165,8 @@
"ServiceRunType",
"ServiceType",
"SystemState",
"TemperatureMetricUnit",
"TemperatureMetricUnitInfo",
"TimingMode",
"WorkerStatus",
"ZMQProxyType",
Expand Down
1 change: 1 addition & 0 deletions aiperf/common/enums/data_exporter_enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ class ConsoleExporterType(CaseInsensitiveStrEnum):
EXPERIMENTAL_METRICS = "experimental_metrics"
INTERNAL_METRICS = "internal_metrics"
METRICS = "metrics"
TELEMETRY = "telemetry"


class DataExporterType(CaseInsensitiveStrEnum):
Expand Down
3 changes: 3 additions & 0 deletions aiperf/common/enums/message_enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,14 @@ class MessageType(CaseInsensitiveStrEnum):
PARSED_INFERENCE_RESULTS = "parsed_inference_results"
PROCESSING_STATS = "processing_stats"
PROCESS_RECORDS_RESULT = "process_records_result"
PROCESS_TELEMETRY_RESULT = "process_telemetry_result"
PROFILE_PROGRESS = "profile_progress"
PROFILE_RESULTS = "profile_results"
REALTIME_METRICS = "realtime_metrics"
REGISTRATION = "registration"
SERVICE_ERROR = "service_error"
STATUS = "status"
TELEMETRY_RECORDS = "telemetry_records"
TELEMETRY_STATUS = "telemetry_status"
WORKER_HEALTH = "worker_health"
WORKER_STATUS_SUMMARY = "worker_status_summary"
Loading