Skip to content

Conversation

@ajcasagrande
Copy link
Contributor

@ajcasagrande ajcasagrande commented Oct 23, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • System now supports environment variable configuration for all major settings including timeouts, concurrency limits, and developer mode options.
  • Refactor

    • Migrated configuration system from hardcoded constants to centralized environment-based management for improved runtime customization.
    • Simplified developer mode and internal configuration handling.

@coderabbitai
Copy link

coderabbitai bot commented Oct 23, 2025

Walkthrough

This PR establishes a centralized, hierarchical configuration system via Environment using Pydantic BaseSettings, replacing scattered constants and config classes throughout the codebase. It removes developer-only classes (DeveloperConfig, DeveloperOnlyCLI, DevDefaults), converts ServiceConfig from BaseSettings to BaseConfig, and updates ~70 files to source configuration from Environment.

Changes

Cohort / File(s) Summary
New Environment Configuration System
src/aiperf/common/environment.py
Introduces comprehensive hierarchical environment configuration with nested subsystems (DATASET, DEV, GPU, HTTP, LOGGING, METRICS, RECORD, SERVICE, UI, WORKER, ZMQ), each with typed fields, defaults, and environment variable prefixes. Exposes singleton Environment instance with post-validation hooks for DEV constraints.
Configuration Module Restructuring
src/aiperf/common/config/__init__.py, config/dev_config.py, config/cli_parameter.py, config/config_defaults.py, config/service_config.py
Removes DeveloperConfig, DeveloperOnlyCLI, and DevDefaults classes from public API. ServiceConfig now inherits from BaseConfig instead of BaseSettings; eliminates model_config and developer field. Removes developer-mode initialization and import-time side effects.
Removed Public Constants
src/aiperf/common/constants.py
Removes ~30 default timeouts and configuration constants (e.g., GRACEFUL_SHUTDOWN_TIMEOUT, DEFAULT_HEARTBEAT_INTERVAL, DEFAULT_MAX_WORKERS_CAP, AIPERF_DEV_MODE). Retains fundamental constants (nanoseconds/milliseconds conversions, BYTES_PER_MIB, STAT_KEYS).
Import Path Updates
src/aiperf/common/config/config_validators.py
Updates ServiceType import from aiperf.common.enums to aiperf.common.enums.service_enums.
Core Service/Component Updates
src/aiperf/common/base_component_service.py, bootstrap.py, logging.py
Replace hardcoded constants with Environment equivalents for heartbeat intervals, registration attempts/timeouts, uvloop control, and logging queue sizes. Updated conditional logic to use Environment flags (DEV.ENABLE_YAPPI, DEV.TRACE_SERVICES).
Command & Communication Handlers
src/aiperf/common/mixins/command_handler_mixin.py, message_bus_mixin.py, task_manager_mixin.py, protocols.py
Replace timeout defaults (DEFAULT_COMMAND_RESPONSE_TIMEOUT, DEFAULT_CONNECTION_PROBE_*, TASK_CANCEL_TIMEOUT_SHORT) with Environment.SERVICE equivalents in method signatures.
Service Manager Updates
src/aiperf/controller/base_service_manager.py, kubernetes_service_manager.py, multiprocess_service_manager.py, system_controller.py
Replace DEFAULT_SERVICE_REGISTRATION_TIMEOUT, DEFAULT_SERVICE_START_TIMEOUT with Environment.SERVICE equivalents. system_controller now prints developer-mode warning based on Environment.DEV.MODE.
Proxy & ZMQ Management
src/aiperf/controller/proxy_manager.py, records/record_processor_service.py, zmq/dealer_request_client.py, zmq/pull_client.py, zmq/push_client.py, zmq/zmq_defaults.py
Replace DEFAULT_ZMQ_CONTEXT_TERM_TIMEOUT, DEFAULT_PULL_CLIENT_MAX_CONCURRENCY, DEFAULT_COMMS_REQUEST_TIMEOUT, and ZMQ retry/delay constants with Environment.ZMQ and Environment.SERVICE equivalents.
GPU Telemetry Configuration
src/aiperf/gpu_telemetry/__init__.py, constants.py, telemetry_data_collector.py, telemetry_manager.py, src/aiperf/__main__.py
Removes DEFAULT_DCGM_ENDPOINTS, DEFAULT_COLLECTION_INTERVAL, URL_REACHABILITY_TIMEOUT, THREAD_JOIN_TIMEOUT from public surface. TelemetryDataCollector signature changed to accept `collection_interval: float
Dataset & Export Management
src/aiperf/dataset/__init__.py, dataset_manager.py, exporters/experimental_metrics_console_exporter.py, exporters/internal_metrics_console_exporter.py
Replace DATASET_CONFIGURATION_TIMEOUT with Environment.DATASET.CONFIGURATION_TIMEOUT. Update dev-mode flags from AIPERF_DEV_MODE and service_config to Environment.DEV.MODE and Environment.DEV.SHOW_*_METRICS.
Records & Metrics Processing
src/aiperf/records/records_manager.py, src/aiperf/metrics/metric_dicts.py, post_processors/record_export_results_processor.py, timing/credit_issuing_strategy.py
Replace DEFAULT_PULL_CLIENT_MAX_CONCURRENCY, DEFAULT_REALTIME_METRICS_INTERVAL, DEFAULT_RECORDS_PROGRESS_REPORT_INTERVAL, DEFAULT_RECORD_EXPORT_BATCH_SIZE, DEFAULT_CREDIT_PROGRESS_REPORT_INTERVAL with Environment equivalents. MetricArray.init updated with Environment.METRICS.ARRAY_INITIAL_CAPACITY.
HTTP Transport & Socket Defaults
src/aiperf/transports/aiohttp_client.py, http_defaults.py, tests/clients/http/test_tcp_connector.py
Removes hardcoded socket option constants from SocketDefaults (TCP_KEEPIDLE, TCP_KEEPINTVL, SO_RCVBUF, SO_SNDTIMEO, etc.). AioHttpDefaults gains new get_default_kwargs() classmethod. All socket/connection settings now sourced from Environment.HTTP.
CLI & Error Handling
src/aiperf/cli_utils.py
Adds new print_developer_mode_warning() function with styled Rich Panel. Enhances error handling to print detailed exception traceback via console.print_exception with locals before error panel.
UI Dashboard Updates
src/aiperf/ui/dashboard/aiperf_textual_app.py, progress_dashboard.py, realtime_metrics_dashboard.py, rich_log_viewer.py, src/aiperf/ui/tqdm_ui.py
Replace NOTIFICATION_TIMEOUT, SPINNER_REFRESH_RATE, LOG_REFRESH_INTERVAL, DEFAULT_UI_MIN_UPDATE_PERCENT with Environment.UI equivalents. Update dev-mode checks to use Environment.DEV.MODE and Environment.DEV.SHOW_INTERNAL_METRICS.
Worker Management
src/aiperf/workers/worker.py, worker_manager.py
Replace AIPERF_HTTP_CONNECTION_LIMIT, DEFAULT_WORKER_HEALTH_CHECK_INTERVAL, and numerous DEFAULT_WORKER_* constants with Environment.HTTP and Environment.WORKER equivalents for health checks, recovery timeouts, and scaling.
ZMQ Package Updates
src/aiperf/zmq/__init__.py
Removes MAX_PUSH_RETRIES and RETRY_DELAY_INTERVAL_SEC from public exports.
Test Updates
tests/common/conftest.py, tests/config/test_dev_mode.py, tests/data_exporters/test_console_exporter.py, tests/gpu_telemetry/test_telemetry_manager.py, tests/post_processors/test_record_export_results_processor.py, tests/test_main.py, tests/ui/test_realtime_metrics_dashboard.py, tests/workers/test_worker_manager.py
Update fixtures and tests to use Environment instead of constants/DeveloperConfig. Replace capsys-based output validation with Environment flag assertions. Use monkeypatch to override Environment values in tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Rationale: Sweeping refactor affecting 70+ files with significant scope, but organized into coherent, mostly repetitive patterns. Primary complexity stems from (1) new hierarchical Environment system design with nested subsystems and validation hooks, (2) removal of old config architecture (ServiceConfig inheritance change, DeveloperConfig elimination), (3) heterogeneous changes across service layers (command handlers, ZMQ, GPU telemetry, workers), and (4) transport/socket defaults restructuring. While many individual changes follow a straightforward constant-replacement pattern, the interdependencies between Environment subsystems, post-validation logic, and the architectural shift from BaseSettings to BaseConfig require careful reasoning. Logic density is moderate; change variety spans configuration plumbing, protocol defaults, and feature-flag migration.

Poem

🐰 Whiskers twitch with glee,
Constants tamed to harmony,
Environment now leads the way,
Configuration flows each day,
Seventy files sing in tune—
The refactor's sweetest boon! 🌟

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.32% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat: move constants to environment pydantic settings" accurately reflects the primary and most significant change in this changeset. The raw summary demonstrates extensive refactoring across the entire codebase where hardcoded constants previously defined in src/aiperf/common/constants.py have been systematically replaced with environment-based configuration values accessed through a new centralized Environment system using Pydantic BaseSettings. The title is specific, clear, and concise—a teammate reviewing the git history would immediately understand that this PR consolidates configuration management from scattered hardcoded constants into an environment-driven approach.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/aiperf/controller/multiprocess_service_manager.py (1)

217-226: Timeout parameter not enforced.

The timeout_seconds parameter is defined but never used in the method body. The warning message states this feature "is not implemented for multiprocessing," but the parameter signature suggests it should work.

Either implement the timeout or remove the parameter to clarify that timeouts are unsupported in multiprocessing mode.

Option 1: Remove the unused parameter

     async def wait_for_all_services_start(
         self,
         stop_event: asyncio.Event,
-        timeout_seconds: float = Environment.SERVICE.START_TIMEOUT,
     ) -> None:
         """Wait for all required services to be started."""
         self.debug("Waiting for all required services to start...")
         self.warning(
-            "Waiting for all required services to start is not implemented for multiprocessing"
+            "Timeout enforcement for service start is not implemented for multiprocessing"
         )

Option 2: Implement the timeout (if feasible)

     async def wait_for_all_services_start(
         self,
         stop_event: asyncio.Event,
         timeout_seconds: float = Environment.SERVICE.START_TIMEOUT,
     ) -> None:
         """Wait for all required services to be started."""
         self.debug("Waiting for all required services to start...")
-        self.warning(
-            "Waiting for all required services to start is not implemented for multiprocessing"
-        )
+        # TODO: Implement actual wait logic with timeout
+        await asyncio.sleep(0)  # Placeholder
src/aiperf/common/mixins/task_manager_mixin.py (1)

38-51: Critical: timeout parameter is declared but never used.

The timeout parameter is defined with a default value but is never applied when cancelling tasks. After calling task.cancel() on line 51, the method should wait for tasks to complete or timeout.

Apply this diff to properly implement the timeout:

 async def cancel_all_tasks(
         self, timeout: float = Environment.SERVICE.TASK_CANCEL_TIMEOUT_SHORT
     ) -> None:
         """Cancel all tasks in the set and wait for up to timeout seconds for them to complete.
 
         Args:
             timeout: The timeout to wait for the tasks to complete.
         """
         if not self.tasks:
             return
 
         task_list = list(self.tasks)
         for task in task_list:
             task.cancel()
+
+        try:
+            await asyncio.wait_for(
+                asyncio.gather(*task_list, return_exceptions=True),
+                timeout=timeout
+            )
+        except asyncio.TimeoutError:
+            self.warning(f"Task cancellation timed out after {timeout}s")
🧹 Nitpick comments (21)
src/aiperf/common/mixins/message_bus_mixin.py (1)

90-92: Add guards/constraints for probe timings to avoid misconfig pitfalls.
If CONNECTION_PROBE_INTERVAL <= 0, inner wait_for immediately times out and the loop can spin; if CONNECTION_PROBE_TIMEOUT <= 0, outer wait_for cancels instantly. Recommend enforcing gt>0 via pydantic Field constraints in Environment.SERVICE and optionally clamp locally.

Also applies to: 99-101

src/aiperf/common/environment.py (4)

481-487: Tighten BaseSettings extras handling.
Root model_config uses extra="allow". This can silently accept misspelled env-derived keys. Prefer extra="ignore" (or "forbid" if feasible) to fail fast on typos.

-    model_config = SettingsConfigDict(
-        env_prefix="AIPERF_",
-        env_file=".env",
-        env_file_encoding="utf-8",
-        extra="allow",
-    )
+    model_config = SettingsConfigDict(
+        env_prefix="AIPERF_",
+        env_file=".env",
+        env_file_encoding="utf-8",
+        extra="ignore",
+    )

277-334: Add numeric constraints to SERVICE timings to prevent invalid configs.
Several intervals/timeouts should be > 0 to avoid spins or instant timeouts (e.g., CONNECTION_PROBE_INTERVAL, HEARTBEAT_INTERVAL). Enforce with gt>0 (and sensible max if desired).

-    COMMAND_RESPONSE_TIMEOUT: float = Field(
+    COMMAND_RESPONSE_TIMEOUT: float = Field(
+        gt=0,
         default=30.0,
         description="Timeout in seconds for command responses",
     )
-    COMMS_REQUEST_TIMEOUT: float = Field(
+    COMMS_REQUEST_TIMEOUT: float = Field(
+        gt=0,
         default=90.0,
         description="Timeout in seconds for requests from req_clients to rep_clients",
     )
-    CONNECTION_PROBE_INTERVAL: float = Field(
+    CONNECTION_PROBE_INTERVAL: float = Field(
+        gt=0,
         default=0.1,
         description="Interval in seconds for connection probes while waiting for initial connection to the zmq message bus",
     )
-    CONNECTION_PROBE_TIMEOUT: float = Field(
+    CONNECTION_PROBE_TIMEOUT: float = Field(
+        gt=0,
         default=90.0,
         description="Maximum time in seconds to wait for connection probe response while waiting for initial connection to the zmq message bus",
     )
-    HEARTBEAT_INTERVAL: float = Field(
+    HEARTBEAT_INTERVAL: float = Field(
+        gt=0,
         default=5.0,
         description="Interval in seconds between heartbeat messages for component services",
     )
-    REGISTRATION_INTERVAL: float = Field(
+    REGISTRATION_INTERVAL: float = Field(
+        gt=0,
         default=1.0,
         description="Interval in seconds between registration attempts for component services",
     )
-    START_TIMEOUT: float = Field(
+    START_TIMEOUT: float = Field(
+        gt=0,
         default=30.0,
         description="Timeout in seconds for service start operations",
     )
-    TASK_CANCEL_TIMEOUT_SHORT: float = Field(
+    TASK_CANCEL_TIMEOUT_SHORT: float = Field(
+        gt=0,
         default=2.0,
         description="Maximum time in seconds to wait for simple tasks to complete when cancelling",
     )

126-133: Normalize DEFAULT_DCGM_ENDPOINTS to list[str].
parse_str_or_csv_list may return list[Any]; type here is str | list[str]. To avoid downstream branching/coercion, consider:

  • Make the field Annotated[list[str], BeforeValidator(parse_str_or_csv_list)] (always a list).
  • Adjust parse_str_or_csv_list to cast all elements to str.

Based on learnings (see src/aiperf/common/config/config_validators.py lines 48-73).


552-553: Consider a .reload() utility for test/runtime overrides.
A helper that rebuilds the singleton from env (and revalidates) avoids ad-hoc attribute mutation outside tests.

def reload_environment() -> None:
    global Environment
    Environment = _Environment()
src/aiperf/timing/credit_issuing_strategy.py (1)

343-343: Guard against zero/negative progress interval.
If CREDIT_PROGRESS_REPORT_INTERVAL <= 0, this loop can starve the event loop. Clamp or enforce gt>0 in Environment.SERVICE (preferred).

-            await asyncio.sleep(Environment.SERVICE.CREDIT_PROGRESS_REPORT_INTERVAL)
+            await asyncio.sleep(max(0.01, Environment.SERVICE.CREDIT_PROGRESS_REPORT_INTERVAL))
src/aiperf/dataset/dataset_manager.py (1)

356-358: Handle timeout with a clearer service error.
If configuration times out, asyncio.TimeoutError bubbles up. Convert to a service error with context.

-            await asyncio.wait_for(
-                self.dataset_configured.wait(),
-                timeout=Environment.DATASET.CONFIGURATION_TIMEOUT,
-            )
+            try:
+                await asyncio.wait_for(
+                    self.dataset_configured.wait(),
+                    timeout=Environment.DATASET.CONFIGURATION_TIMEOUT,
+                )
+            except asyncio.TimeoutError as e:
+                raise self._service_error(
+                    f"Dataset configuration timed out after {Environment.DATASET.CONFIGURATION_TIMEOUT}s"
+                ) from e
tests/common/conftest.py (1)

39-46: Implementation is sound; verify the fixture naming is intentional.

The verification confirms bootstrap.py properly reads Environment.SERVICE.DISABLE_UVLOOP at lines 122–127 and toggles between uvloop and asyncio accordingly. The monkeypatch approach in the fixture correctly sets this flag for tests.

Consider whether the fixture name service_config_no_uvloop accurately reflects that it's toggling an environment variable rather than modifying the config object itself—a name like disable_uvloop_env may better communicate intent.

src/aiperf/common/config/worker_config.py (1)

30-34: Avoid binding help text to runtime env at import time

The Field description f-string snapshots Environment.WORKER.MAX_WORKERS_CAP at import; if env differs between build and run, the help text can mislead. Consider a static reference (no f-string) or compute help text lazily in CLI/help generation.

src/aiperf/controller/proxy_manager.py (1)

55-63: Confirm timeout units and narrow exception scope

  • Ensure Environment.ZMQ.CONTEXT_TERM_TIMEOUT is a finite number in seconds (float/int). If it’s 0/None/negative, wait_for will misbehave. Add validation or a floor.
  • Avoid catching BaseException; it swallows asyncio.CancelledError. Prefer Exception, and optionally handle asyncio.TimeoutError explicitly.
-        except BaseException as e:
+        except asyncio.TimeoutError as e:
+            self.warning(f"Timed out terminating ZMQ context after {Environment.ZMQ.CONTEXT_TERM_TIMEOUT}s: {e}")
+        except Exception as e:
             self.warning(f"Error terminating ZMQ context: {e}")
tests/clients/http/test_tcp_connector.py (1)

117-147: Patch Linux TCP symbols before invoking the factory

In test_socket_factory_linux_specific_options, the patch of socket. happens after socket_factory(addr_info) runs, so it doesn’t influence the hasattr checks inside apply_to_socket. Move the patch before invoking socket_factory (or construct socket_factory after patching) to deterministically exercise those branches.

-                socket_factory = mock_connector_class.call_args[1]["socket_factory"]
-                with patch("socket.socket") as mock_socket_class:
+                socket_factory = mock_connector_class.call_args[1]["socket_factory"]
+                # Patch OS symbol before creating/using the socket to ensure the branch is taken
+                if has_attribute and expected_value is not None:
+                    with patch.object(socket, attribute_name, tcp_option, create=True):
+                        pass
+                with patch("socket.socket") as mock_socket_class:
                     ...
-                    socket_factory(addr_info)
+                    socket_factory(addr_info)
-                    if has_attribute and expected_value is not None:
-                        with patch.object(socket, attribute_name, expected_value, create=True):
-                            mock_socket.setsockopt.assert_any_call(
-                                socket.SOL_TCP, tcp_option, expected_value
-                            )
+                    if has_attribute and expected_value is not None:
+                        mock_socket.setsockopt.assert_any_call(
+                            socket.SOL_TCP, tcp_option, expected_value
+                        )
src/aiperf/workers/worker_manager.py (3)

71-79: Auto max-workers formula is fine; consider edge-proofing

The min(int(cpu * factor) - 1, MAX_WORKERS_CAP) then max(..., 1) is sensible. Recommend logging inputs (cpu_count, factor, cap) at trace level to aid tuning and diagnosing odd env values.


91-95: Validate min/max consistency

If users set workers.min > workers.max, max is silently raised to min here. Consider validating and warning (or erroring) early to surface misconfig.

-        self.max_workers = max(
+        if (self.service_config.workers.min or 1) > self.max_workers:
+            self.warning("workers.min exceeds workers.max; raising max to min")
+        self.max_workers = max(
             self.max_workers,
             self.service_config.workers.min or 1,
         )

149-160: Status timing math OK; minor readability tweak

The repeated (time.time_ns() - (ts or 0)) / NANOS_PER_SECOND appears multiple times. Consider factoring to a helper for clarity and to avoid subtle unit mixups.

src/aiperf/metrics/metric_dicts.py (1)

71-76: Docstring out of sync with parameters

to_display_dict now has show_internal and show_experimental, but docstring only mentions show_internal. Update for accuracy.

-        Args:
-            registry: MetricRegistry class for looking up metric definitions
-            show_internal: If True, include experimental/internal metrics
+        Args:
+            registry: MetricRegistry class for metric definitions.
+            show_internal: Include metrics flagged INTERNAL when True.
+            show_experimental: Include metrics flagged EXPERIMENTAL when True.
src/aiperf/transports/http_defaults.py (2)

33-48: Validate env-provided socket values

Environment.HTTP.TCP_KEEPIDLE/INTVL/KEEPCNT/SO_RCVBUF/SO_SNDBUF/TCP_USER_TIMEOUT should be validated (positive, sensible ranges). Bad env values can raise or degrade performance.

Consider guarding with max(..., 1) where appropriate and logging when clamping.


71-73: Clarify limit_per_host semantics

Comment says “0 will set to LIMIT,” but in aiohttp 3.x, limit_per_host=0 means “no per‑host limit,” not “inherit LIMIT.” Suggest updating the comment to avoid confusion.

-    LIMIT_PER_HOST = (
-        0  # Maximum number of concurrent connections per host (0 will set to LIMIT)
-    )
+    LIMIT_PER_HOST = 0  # Per-host limit; 0 disables per-host limiting (uses global LIMIT only)
tests/test_main.py (1)

78-82: Consider simplifying assertion logic for clarity.

The ternary expression is functional but hard to read. A more explicit assertion would improve maintainability:

-        assert (
-            captured_user_config.gpu_telemetry == expected_gpu_telemetry
-            if captured_user_config
-            else []
-        )
+        assert captured_user_config is not None, "Mock was not called"
+        assert captured_user_config.gpu_telemetry == expected_gpu_telemetry
src/aiperf/common/bootstrap.py (1)

122-127: Note: Double negative in condition.

The condition not Environment.SERVICE.DISABLE_UVLOOP works correctly but uses a double negative. Consider renaming to Environment.SERVICE.ENABLE_UVLOOP in a future refactor for improved clarity.

src/aiperf/zmq/push_client.py (1)

69-84: Consider using direct default for consistency and simplicity.

The lazy resolution pattern (accepting None and resolving at runtime) adds complexity compared to using Environment.ZMQ.PUSH_MAX_RETRIES directly as the default value. This is inconsistent with Line 99, which directly references Environment.ZMQ.PUSH_RETRY_DELAY without lazy resolution.

Consider simplifying to:

     async def _push_message(
         self,
         message: Message,
         retry_count: int = 0,
-        max_retries: int | None = None,
+        max_retries: int = Environment.ZMQ.PUSH_MAX_RETRIES,
     ) -> None:
         """Push a message to the socket. Will retry up to max_retries times.
 
         Args:
             message: Message to be sent must be a Message object
             retry_count: Current retry count
-            max_retries: Maximum number of times to retry pushing the message (defaults to Environment.ZMQ.PUSH_MAX_RETRIES)
+            max_retries: Maximum number of times to retry pushing the message
         """
-        if max_retries is None:
-            max_retries = Environment.ZMQ.PUSH_MAX_RETRIES
-
         try:

This matches the pattern used in command_handler_mixin.py (lines 169-172, 194) and reduces the maintenance burden.

src/aiperf/gpu_telemetry/telemetry_data_collector.py (1)

46-61: Consider using direct default for consistency and simplicity.

Similar to push_client.py, this lazy resolution pattern (accepting None and resolving in __init__) adds complexity compared to using Environment.GPU.COLLECTION_INTERVAL directly as the default. This is inconsistent with Lines 76 and 122, which directly reference Environment.GPU.REACHABILITY_TIMEOUT without lazy resolution.

Consider simplifying to:

     def __init__(
         self,
         dcgm_url: str,
-        collection_interval: float | None = None,
+        collection_interval: float = Environment.GPU.COLLECTION_INTERVAL,
         record_callback: Callable[[list[TelemetryRecord], str], Awaitable[None] | None]
         | None = None,
         error_callback: Callable[[ErrorDetails, str], Awaitable[None] | None]
         | None = None,
         collector_id: str = "telemetry_collector",
     ) -> None:
         self._dcgm_url = dcgm_url
-        self._collection_interval = (
-            collection_interval
-            if collection_interval is not None
-            else Environment.GPU.COLLECTION_INTERVAL
-        )
+        self._collection_interval = collection_interval

This matches the simpler pattern used in command_handler_mixin.py and reduces maintenance burden.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ff91a7 and dbb8b1e.

📒 Files selected for processing (59)
  • src/aiperf/__main__.py (2 hunks)
  • src/aiperf/cli_utils.py (2 hunks)
  • src/aiperf/common/base_component_service.py (3 hunks)
  • src/aiperf/common/bootstrap.py (3 hunks)
  • src/aiperf/common/config/__init__.py (0 hunks)
  • src/aiperf/common/config/cli_parameter.py (0 hunks)
  • src/aiperf/common/config/config_defaults.py (0 hunks)
  • src/aiperf/common/config/config_validators.py (1 hunks)
  • src/aiperf/common/config/dev_config.py (0 hunks)
  • src/aiperf/common/config/service_config.py (2 hunks)
  • src/aiperf/common/config/worker_config.py (2 hunks)
  • src/aiperf/common/constants.py (0 hunks)
  • src/aiperf/common/environment.py (1 hunks)
  • src/aiperf/common/logging.py (3 hunks)
  • src/aiperf/common/mixins/command_handler_mixin.py (3 hunks)
  • src/aiperf/common/mixins/message_bus_mixin.py (3 hunks)
  • src/aiperf/common/mixins/task_manager_mixin.py (2 hunks)
  • src/aiperf/common/protocols.py (3 hunks)
  • src/aiperf/controller/base_service_manager.py (2 hunks)
  • src/aiperf/controller/kubernetes_service_manager.py (3 hunks)
  • src/aiperf/controller/multiprocess_service_manager.py (4 hunks)
  • src/aiperf/controller/proxy_manager.py (2 hunks)
  • src/aiperf/controller/system_controller.py (6 hunks)
  • src/aiperf/dataset/__init__.py (0 hunks)
  • src/aiperf/dataset/dataset_manager.py (2 hunks)
  • src/aiperf/exporters/experimental_metrics_console_exporter.py (2 hunks)
  • src/aiperf/exporters/internal_metrics_console_exporter.py (2 hunks)
  • src/aiperf/gpu_telemetry/__init__.py (0 hunks)
  • src/aiperf/gpu_telemetry/constants.py (0 hunks)
  • src/aiperf/gpu_telemetry/telemetry_data_collector.py (4 hunks)
  • src/aiperf/gpu_telemetry/telemetry_manager.py (4 hunks)
  • src/aiperf/metrics/metric_dicts.py (4 hunks)
  • src/aiperf/post_processors/record_export_results_processor.py (3 hunks)
  • src/aiperf/records/record_processor_service.py (2 hunks)
  • src/aiperf/records/records_manager.py (5 hunks)
  • src/aiperf/timing/credit_issuing_strategy.py (2 hunks)
  • src/aiperf/transports/aiohttp_client.py (1 hunks)
  • src/aiperf/transports/http_defaults.py (3 hunks)
  • src/aiperf/ui/dashboard/aiperf_textual_app.py (3 hunks)
  • src/aiperf/ui/dashboard/progress_dashboard.py (2 hunks)
  • src/aiperf/ui/dashboard/realtime_metrics_dashboard.py (2 hunks)
  • src/aiperf/ui/dashboard/rich_log_viewer.py (2 hunks)
  • src/aiperf/ui/tqdm_ui.py (2 hunks)
  • src/aiperf/workers/worker.py (2 hunks)
  • src/aiperf/workers/worker_manager.py (4 hunks)
  • src/aiperf/zmq/__init__.py (0 hunks)
  • src/aiperf/zmq/dealer_request_client.py (2 hunks)
  • src/aiperf/zmq/pull_client.py (2 hunks)
  • src/aiperf/zmq/push_client.py (3 hunks)
  • src/aiperf/zmq/zmq_defaults.py (2 hunks)
  • tests/clients/http/test_tcp_connector.py (4 hunks)
  • tests/common/conftest.py (1 hunks)
  • tests/config/test_dev_mode.py (1 hunks)
  • tests/data_exporters/test_console_exporter.py (1 hunks)
  • tests/gpu_telemetry/test_telemetry_manager.py (17 hunks)
  • tests/post_processors/test_record_export_results_processor.py (4 hunks)
  • tests/test_main.py (3 hunks)
  • tests/ui/test_realtime_metrics_dashboard.py (2 hunks)
  • tests/workers/test_worker_manager.py (3 hunks)
💤 Files with no reviewable changes (9)
  • src/aiperf/common/config/cli_parameter.py
  • src/aiperf/common/config/config_defaults.py
  • src/aiperf/dataset/init.py
  • src/aiperf/zmq/init.py
  • src/aiperf/common/config/init.py
  • src/aiperf/gpu_telemetry/init.py
  • src/aiperf/gpu_telemetry/constants.py
  • src/aiperf/common/constants.py
  • src/aiperf/common/config/dev_config.py
🧰 Additional context used
🧬 Code graph analysis (15)
src/aiperf/transports/aiohttp_client.py (1)
src/aiperf/transports/http_defaults.py (2)
  • AioHttpDefaults (64-94)
  • get_default_kwargs (82-94)
tests/data_exporters/test_console_exporter.py (3)
tests/post_processors/test_record_export_results_processor.py (1)
  • service_config (56-58)
tests/conftest.py (1)
  • service_config (186-190)
src/aiperf/common/config/service_config.py (1)
  • ServiceConfig (28-160)
src/aiperf/timing/credit_issuing_strategy.py (1)
tests/utils/time_traveler.py (1)
  • sleep (37-43)
src/aiperf/workers/worker_manager.py (1)
tests/utils/time_traveler.py (2)
  • time (48-49)
  • time_ns (45-46)
tests/post_processors/test_record_export_results_processor.py (1)
tests/conftest.py (1)
  • mock_aiofiles_stringio (354-382)
src/aiperf/common/protocols.py (3)
src/aiperf/controller/base_service_manager.py (1)
  • wait_for_all_services_start (112-117)
src/aiperf/controller/kubernetes_service_manager.py (1)
  • wait_for_all_services_start (81-93)
src/aiperf/controller/multiprocess_service_manager.py (1)
  • wait_for_all_services_start (217-226)
src/aiperf/gpu_telemetry/telemetry_manager.py (1)
tests/utils/time_traveler.py (1)
  • sleep (37-43)
tests/common/conftest.py (2)
tests/conftest.py (1)
  • service_config (186-190)
src/aiperf/common/config/service_config.py (1)
  • ServiceConfig (28-160)
src/aiperf/common/environment.py (1)
src/aiperf/common/config/config_validators.py (2)
  • parse_service_types (77-86)
  • parse_str_or_csv_list (49-74)
src/aiperf/zmq/push_client.py (1)
tests/utils/time_traveler.py (1)
  • sleep (37-43)
src/aiperf/records/records_manager.py (1)
tests/utils/time_traveler.py (1)
  • sleep (37-43)
src/aiperf/controller/system_controller.py (2)
src/aiperf/cli_utils.py (1)
  • print_developer_mode_warning (127-144)
src/aiperf/common/config/config_defaults.py (1)
  • OutputDefaults (129-137)
src/aiperf/controller/base_service_manager.py (3)
src/aiperf/common/protocols.py (1)
  • wait_for_all_services_start (465-469)
src/aiperf/controller/kubernetes_service_manager.py (1)
  • wait_for_all_services_start (81-93)
src/aiperf/controller/multiprocess_service_manager.py (1)
  • wait_for_all_services_start (217-226)
tests/ui/test_realtime_metrics_dashboard.py (2)
src/aiperf/common/config/service_config.py (1)
  • ServiceConfig (28-160)
src/aiperf/ui/dashboard/realtime_metrics_dashboard.py (1)
  • _should_skip (63-73)
src/aiperf/common/base_component_service.py (1)
src/aiperf/common/mixins/command_handler_mixin.py (1)
  • send_command_and_wait_for_response (169-188)
🪛 Ruff (0.14.1)
src/aiperf/common/mixins/task_manager_mixin.py

39-39: Unused method argument: timeout

(ARG002)

src/aiperf/controller/multiprocess_service_manager.py

220-220: Unused method argument: timeout_seconds

(ARG002)

src/aiperf/controller/kubernetes_service_manager.py

70-70: Unused method argument: timeout_seconds

(ARG002)


84-84: Unused method argument: timeout_seconds

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: integration-tests (ubuntu-latest, 3.12)
  • GitHub Check: integration-tests (ubuntu-latest, 3.11)
  • GitHub Check: integration-tests (ubuntu-latest, 3.10)
  • GitHub Check: build (ubuntu-latest, 3.10)
  • GitHub Check: build (macos-latest, 3.10)
  • GitHub Check: build (ubuntu-latest, 3.11)
  • GitHub Check: build (macos-latest, 3.11)
  • GitHub Check: build (ubuntu-latest, 3.12)
  • GitHub Check: build (macos-latest, 3.12)
🔇 Additional comments (60)
src/aiperf/common/config/config_validators.py (1)

12-12: LGTM! More specific import improves clarity.

The import path change from a general module to a specific submodule makes the dependency more explicit and aligns with the PR's organizational improvements.

src/aiperf/common/mixins/message_bus_mixin.py (1)

13-13: LGTM: switch to Environment-driven config.
Consistent with the PR goals; import is correct.

src/aiperf/common/environment.py (1)

535-549: Validator behavior is sound.
Silently downgrading DEV SHOW_* flags when DEV.MODE is false with a warning is a good safeguard.

src/aiperf/timing/credit_issuing_strategy.py (1)

8-11: LGTM: replaced constant with Environment and kept NANOS_PER_SECOND import.
Matches the new configuration approach.

src/aiperf/dataset/dataset_manager.py (1)

22-22: LGTM: move dataset timeout source to Environment.
Keeps behavior configurable without code changes.

tests/workers/test_worker_manager.py (2)

15-15: Good switch to env-driven thresholds

Tests now follow Environment.WORKER.*; keeps expectations aligned with runtime config.


190-205: Boundary and recovery checks match implementation semantics

  • Using cpu_usage == Environment.WORKER.HIGH_LOAD_CPU_USAGE to assert “no warning” matches the code’s strict “>” comparison.
  • Recovery parametrization using Environment.WORKER.HIGH_LOAD_RECOVERY_TIME validates both sides of the threshold.

If HIGH_LOAD_CPU_USAGE becomes configurable per test run, consider parametrizing it once via a fixture to avoid repetition.

tests/clients/http/test_tcp_connector.py (1)

38-38: Defaults assertion looks good

Verifying limit against Environment.HTTP.CONNECTION_LIMIT keeps the test resilient to env changes.

src/aiperf/workers/worker.py (1)

86-87: LGTM on env-driven health interval

Reading Environment.WORKER.HEALTH_CHECK_INTERVAL centralizes tuning.

src/aiperf/metrics/metric_dicts.py (1)

149-151: LGTM on env-driven initial capacity

MetricArray now sources initial_capacity from Environment.METRICS.ARRAY_INITIAL_CAPACITY and validates > 0. Looks good.

src/aiperf/transports/http_defaults.py (1)

81-94: New helper API looks good

AioHttpDefaults.get_default_kwargs centralizes aiohttp defaults. Ensure create_tcp_connector uses this to keep tests and runtime in sync.

src/aiperf/transports/aiohttp_client.py (1)

267-268: LGTM! Centralized configuration approach.

The change successfully moves default TCP connector configuration to AioHttpDefaults.get_default_kwargs(), aligning with the PR's environment-based configuration strategy while preserving the socket_factory injection logic.

src/aiperf/ui/dashboard/aiperf_textual_app.py (2)

85-86: LGTM! Environment-based developer mode detection.

The switch to Environment.DEV.MODE centralizes developer mode configuration and aligns with the PR's objectives.


142-145: Good documentation improvement.

The added comments clarify the importance of signal forwarding to prevent process hangs.

src/aiperf/ui/dashboard/realtime_metrics_dashboard.py (1)

72-73: LGTM! Centralized internal metrics visibility control.

The change to Environment.DEV.SHOW_INTERNAL_METRICS aligns with the environment-based configuration approach and removes the dependency on per-service developer flags.

src/aiperf/ui/tqdm_ui.py (1)

42-42: LGTM! Environment-based UI update threshold.

The change to Environment.UI.MIN_UPDATE_PERCENT enables runtime configuration of progress bar update frequency.

src/aiperf/zmq/pull_client.py (1)

81-82: LGTM! Environment-based concurrency configuration.

The change to Environment.ZMQ.PULL_MAX_CONCURRENCY enables runtime configuration of pull client concurrency limits.

tests/test_main.py (1)

28-41: LGTM! Environment-based test expectations.

The test parameters correctly use Environment.GPU.DEFAULT_DCGM_ENDPOINTS for default GPU telemetry endpoints, aligning with the environment-based configuration approach.

src/aiperf/exporters/internal_metrics_console_exporter.py (1)

30-32: LGTM! Environment-based internal metrics control.

The change to Environment.DEV.MODE and Environment.DEV.SHOW_INTERNAL_METRICS centralizes developer-mode feature flags and removes dependency on service-level configuration.

src/aiperf/common/base_component_service.py (2)

56-56: LGTM! Environment-based heartbeat interval.

The change to Environment.SERVICE.HEARTBEAT_INTERVAL enables runtime configuration of heartbeat frequency.


82-87: LGTM! Environment-based registration configuration.

The changes to Environment.SERVICE.REGISTRATION_MAX_ATTEMPTS and Environment.SERVICE.REGISTRATION_INTERVAL provide consistent, centralized control over service registration behavior.

src/aiperf/records/record_processor_service.py (1)

13-13: LGTM! Clean migration to environment-based configuration.

The replacement of DEFAULT_PULL_CLIENT_MAX_CONCURRENCY with Environment.ZMQ.PULL_MAX_CONCURRENCY aligns with the PR's goal of centralizing configuration via the Environment singleton.

Also applies to: 62-62

src/aiperf/cli_utils.py (1)

127-144: LGTM! Well-structured developer mode warning.

The new print_developer_mode_warning() function provides clear, styled feedback to users when developer mode is active.

src/aiperf/exporters/experimental_metrics_console_exporter.py (1)

10-10: LGTM! Clean migration to environment-based feature flags.

The gating logic now correctly uses Environment.DEV.MODE and Environment.DEV.SHOW_EXPERIMENTAL_METRICS, replacing the previous scattered configuration approach.

Also applies to: 29-31

tests/post_processors/test_record_export_results_processor.py (1)

18-18: LGTM! Test correctly updated for environment-based configuration.

The test now patches Environment.DEV.* attributes and uses Environment.RECORD.EXPORT_BATCH_SIZE, aligning with the centralized configuration approach.

Also applies to: 183-186, 600-600, 613-613, 618-618

src/aiperf/controller/multiprocess_service_manager.py (1)

15-15: LGTM! Clean migration to environment-based timeouts.

The timeout values are now sourced from Environment.SERVICE.*, providing centralized configuration for service lifecycle management.

Also applies to: 138-138, 206-206

tests/ui/test_realtime_metrics_dashboard.py (1)

4-4: LGTM! Clean test refactoring to environment-based configuration.

The test now uses patch.object to control Environment.DEV.SHOW_INTERNAL_METRICS, replacing the fixture-based approach with DeveloperConfig. This aligns with the centralized environment configuration pattern.

Also applies to: 9-9, 51-62

tests/data_exporters/test_console_exporter.py (1)

130-130: LGTM! Simplified test configuration.

The test now relies on default ServiceConfig() behavior, which sources developer mode settings from the Environment singleton rather than explicit configuration.

src/aiperf/__main__.py (1)

7-7: LGTM! Clean migration to environment-based GPU configuration.

The DCGM endpoint defaults are now sourced from Environment.GPU.DEFAULT_DCGM_ENDPOINTS, consistent with the centralized configuration approach.

Also applies to: 18-18

src/aiperf/common/logging.py (2)

34-36: LGTM: Queue size now environment-configurable.

The migration from a hardcoded constant to Environment.LOGGING.QUEUE_MAXSIZE provides better configurability for the logging queue.


102-109: LGTM: Debug/trace service filtering now environment-driven.

The migration to Environment.DEV.TRACE_SERVICES and Environment.DEV.DEBUG_SERVICES aligns with the centralized configuration approach.

src/aiperf/ui/dashboard/progress_dashboard.py (1)

81-81: LGTM: UI refresh rate now environment-configurable.

Replacing the hardcoded SPINNER_REFRESH_RATE with Environment.UI.SPINNER_REFRESH_RATE enables runtime configuration of UI timing.

src/aiperf/common/bootstrap.py (1)

56-57: LGTM: YAPPI profiling now environment-controlled.

The migration to Environment.DEV.ENABLE_YAPPI provides a cleaner way to control profiling behavior.

src/aiperf/post_processors/record_export_results_processor.py (2)

46-51: LGTM: Metric visibility flags now environment-driven.

The migration to Environment.DEV.SHOW_INTERNAL_METRICS and introduction of Environment.DEV.SHOW_EXPERIMENTAL_METRICS provides flexible control over metric visibility in exports.


58-58: LGTM: Export batch size now configurable.

Using Environment.RECORD.EXPORT_BATCH_SIZE enables tuning of batch processing performance.

src/aiperf/ui/dashboard/rich_log_viewer.py (1)

109-109: LGTM: Log refresh interval now environment-configurable.

The migration to Environment.UI.LOG_REFRESH_INTERVAL enables tuning of log consumption frequency.

src/aiperf/zmq/dealer_request_client.py (1)

130-130: LGTM: Request timeout now environment-configurable.

The migration to Environment.SERVICE.COMMS_REQUEST_TIMEOUT enables runtime configuration of communication timeouts. This is a public API change, but should be transparent if the environment default matches the previous constant value.

tests/gpu_telemetry/test_telemetry_manager.py (1)

53-53: LGTM: Test suite updated for environment-based configuration.

All test assertions now reference Environment.GPU.DEFAULT_DCGM_ENDPOINTS, ensuring tests validate the centralized configuration system.

src/aiperf/gpu_telemetry/telemetry_manager.py (4)

100-111: LGTM: Endpoint aggregation now environment-driven.

The migration to Environment.GPU.DEFAULT_DCGM_ENDPOINTS for default endpoint handling maintains the existing deduplication and ordering logic while enabling centralized configuration.


113-113: LGTM: Collection interval now configurable.

Using Environment.GPU.COLLECTION_INTERVAL enables runtime tuning of telemetry collection frequency.


206-210: LGTM: Reachable defaults computation updated.

Correctly references Environment.GPU.DEFAULT_DCGM_ENDPOINTS for filtering reachable default endpoints.


295-295: LGTM: Shutdown delay now configurable.

The migration to Environment.GPU.SHUTDOWN_DELAY provides flexibility for adjusting the delay before service shutdown.

src/aiperf/zmq/zmq_defaults.py (1)

4-4: LGTM! Clean migration to Environment-based configuration.

The ZMQ socket defaults are now sourced from the centralized Environment configuration, allowing runtime tunability. The updated docstring clearly documents this behavior change.

Also applies to: 32-42

src/aiperf/controller/system_controller.py (4)

84-86: LGTM! Developer mode warning at initialization.

The developer mode check now uses Environment.DEV.MODE, providing a centralized way to gate developer-only features.


211-211: LGTM! Profile timeouts now environment-driven.

The profile configuration and start timeouts are now sourced from Environment.SERVICE, enabling runtime configuration.

Also applies to: 225-225


412-414: LGTM! Record processor scaling now environment-driven.

The scaling factor is sourced from Environment.RECORD.PROCESSOR_SCALE_FACTOR, making it configurable at runtime.


564-566: LGTM! Developer mode warning on exit.

Consistent with the initialization warning, this ensures developers are reminded that developer mode is active throughout the execution lifecycle.

src/aiperf/controller/base_service_manager.py (1)

8-8: LGTM! Service manager timeouts now environment-driven.

The abstract method signatures now use Environment.SERVICE.REGISTRATION_TIMEOUT and Environment.SERVICE.START_TIMEOUT, ensuring consistent timeout defaults across all service manager implementations.

Also applies to: 104-116

src/aiperf/common/protocols.py (1)

13-13: LGTM! Protocol timeouts now environment-driven.

The protocol definitions for RequestClientProtocol and ServiceManagerProtocol now use Environment-based timeout defaults, ensuring all implementing classes inherit consistent, configurable behavior.

Also applies to: 170-170, 462-462, 468-468

src/aiperf/controller/kubernetes_service_manager.py (1)

10-10: LGTM! Kubernetes service manager aligned with Environment-based configuration.

The method signatures now use Environment.SERVICE.REGISTRATION_TIMEOUT and Environment.SERVICE.START_TIMEOUT. The unused parameter warnings from static analysis are expected since these are stub methods that raise NotImplementedError.

Also applies to: 67-93

tests/config/test_dev_mode.py (1)

9-25: LGTM! Tests updated for Environment-based configuration.

The tests now properly reload the environment module after setting environment variables via monkeypatch, then assert against Environment.DEV.MODE. This correctly validates the environment-driven developer mode configuration.

src/aiperf/common/config/service_config.py (1)

9-9: LGTM! ServiceConfig migrated from BaseSettings to BaseConfig.

The change from BaseSettings to BaseConfig aligns with the shift to centralized environment-driven configuration via the Environment singleton, removing the need for per-config environment variable loading.

Also applies to: 28-28

src/aiperf/zmq/push_client.py (1)

99-99: LGTM!

The direct use of Environment.ZMQ.PUSH_RETRY_DELAY is correct and follows best practices for environment-driven configuration.

src/aiperf/records/records_manager.py (3)

82-82: LGTM!

The direct use of Environment.ZMQ.PULL_MAX_CONCURRENCY properly centralizes the concurrency configuration.


378-380: LGTM!

Using Environment.RECORD.PROGRESS_REPORT_INTERVAL for the background task interval correctly externalizes the reporting frequency configuration.


426-426: LGTM!

The use of Environment.UI.REALTIME_METRICS_INTERVAL correctly makes the realtime metrics update interval configurable via environment settings.

src/aiperf/common/mixins/command_handler_mixin.py (2)

169-172: LGTM! Excellent pattern.

The direct default timeout: float = Environment.SERVICE.COMMAND_RESPONSE_TIMEOUT is clean and straightforward. This pattern avoids the complexity of lazy resolution while achieving the same goal of environment-driven configuration.


194-194: LGTM!

Consistent use of Environment.SERVICE.COMMAND_RESPONSE_TIMEOUT as the default timeout value.

src/aiperf/gpu_telemetry/telemetry_data_collector.py (2)

76-76: LGTM!

The direct use of Environment.GPU.REACHABILITY_TIMEOUT correctly centralizes the timeout configuration for the HTTP client.


122-122: LGTM!

Consistent use of Environment.GPU.REACHABILITY_TIMEOUT for the temporary session timeout.

@ajcasagrande ajcasagrande self-assigned this Oct 23, 2025
@ilana-n
Copy link
Contributor

ilana-n commented Oct 24, 2025

nice! I like the consolidation, makes it easier to find constants in the future.

@ajcasagrande ajcasagrande merged commit 282f61e into main Oct 24, 2025
12 checks passed
@ajcasagrande ajcasagrande deleted the ajc/env branch October 24, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants