-
Notifications
You must be signed in to change notification settings - Fork 3
feat: move constants to environment pydantic settings #390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
WalkthroughThis PR establishes a centralized, hierarchical configuration system via Environment using Pydantic BaseSettings, replacing scattered constants and config classes throughout the codebase. It removes developer-only classes (DeveloperConfig, DeveloperOnlyCLI, DevDefaults), converts ServiceConfig from BaseSettings to BaseConfig, and updates ~70 files to source configuration from Environment. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Rationale: Sweeping refactor affecting 70+ files with significant scope, but organized into coherent, mostly repetitive patterns. Primary complexity stems from (1) new hierarchical Environment system design with nested subsystems and validation hooks, (2) removal of old config architecture (ServiceConfig inheritance change, DeveloperConfig elimination), (3) heterogeneous changes across service layers (command handlers, ZMQ, GPU telemetry, workers), and (4) transport/socket defaults restructuring. While many individual changes follow a straightforward constant-replacement pattern, the interdependencies between Environment subsystems, post-validation logic, and the architectural shift from BaseSettings to BaseConfig require careful reasoning. Logic density is moderate; change variety spans configuration plumbing, protocol defaults, and feature-flag migration. Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/aiperf/controller/multiprocess_service_manager.py (1)
217-226: Timeout parameter not enforced.The
timeout_secondsparameter is defined but never used in the method body. The warning message states this feature "is not implemented for multiprocessing," but the parameter signature suggests it should work.Either implement the timeout or remove the parameter to clarify that timeouts are unsupported in multiprocessing mode.
Option 1: Remove the unused parameter
async def wait_for_all_services_start( self, stop_event: asyncio.Event, - timeout_seconds: float = Environment.SERVICE.START_TIMEOUT, ) -> None: """Wait for all required services to be started.""" self.debug("Waiting for all required services to start...") self.warning( - "Waiting for all required services to start is not implemented for multiprocessing" + "Timeout enforcement for service start is not implemented for multiprocessing" )Option 2: Implement the timeout (if feasible)
async def wait_for_all_services_start( self, stop_event: asyncio.Event, timeout_seconds: float = Environment.SERVICE.START_TIMEOUT, ) -> None: """Wait for all required services to be started.""" self.debug("Waiting for all required services to start...") - self.warning( - "Waiting for all required services to start is not implemented for multiprocessing" - ) + # TODO: Implement actual wait logic with timeout + await asyncio.sleep(0) # Placeholdersrc/aiperf/common/mixins/task_manager_mixin.py (1)
38-51: Critical: timeout parameter is declared but never used.The
timeoutparameter is defined with a default value but is never applied when cancelling tasks. After callingtask.cancel()on line 51, the method should wait for tasks to complete or timeout.Apply this diff to properly implement the timeout:
async def cancel_all_tasks( self, timeout: float = Environment.SERVICE.TASK_CANCEL_TIMEOUT_SHORT ) -> None: """Cancel all tasks in the set and wait for up to timeout seconds for them to complete. Args: timeout: The timeout to wait for the tasks to complete. """ if not self.tasks: return task_list = list(self.tasks) for task in task_list: task.cancel() + + try: + await asyncio.wait_for( + asyncio.gather(*task_list, return_exceptions=True), + timeout=timeout + ) + except asyncio.TimeoutError: + self.warning(f"Task cancellation timed out after {timeout}s")
🧹 Nitpick comments (21)
src/aiperf/common/mixins/message_bus_mixin.py (1)
90-92: Add guards/constraints for probe timings to avoid misconfig pitfalls.
If CONNECTION_PROBE_INTERVAL <= 0, inner wait_for immediately times out and the loop can spin; if CONNECTION_PROBE_TIMEOUT <= 0, outer wait_for cancels instantly. Recommend enforcing gt>0 via pydantic Field constraints in Environment.SERVICE and optionally clamp locally.Also applies to: 99-101
src/aiperf/common/environment.py (4)
481-487: Tighten BaseSettings extras handling.
Root model_config uses extra="allow". This can silently accept misspelled env-derived keys. Prefer extra="ignore" (or "forbid" if feasible) to fail fast on typos.- model_config = SettingsConfigDict( - env_prefix="AIPERF_", - env_file=".env", - env_file_encoding="utf-8", - extra="allow", - ) + model_config = SettingsConfigDict( + env_prefix="AIPERF_", + env_file=".env", + env_file_encoding="utf-8", + extra="ignore", + )
277-334: Add numeric constraints to SERVICE timings to prevent invalid configs.
Several intervals/timeouts should be > 0 to avoid spins or instant timeouts (e.g., CONNECTION_PROBE_INTERVAL, HEARTBEAT_INTERVAL). Enforce with gt>0 (and sensible max if desired).- COMMAND_RESPONSE_TIMEOUT: float = Field( + COMMAND_RESPONSE_TIMEOUT: float = Field( + gt=0, default=30.0, description="Timeout in seconds for command responses", ) - COMMS_REQUEST_TIMEOUT: float = Field( + COMMS_REQUEST_TIMEOUT: float = Field( + gt=0, default=90.0, description="Timeout in seconds for requests from req_clients to rep_clients", ) - CONNECTION_PROBE_INTERVAL: float = Field( + CONNECTION_PROBE_INTERVAL: float = Field( + gt=0, default=0.1, description="Interval in seconds for connection probes while waiting for initial connection to the zmq message bus", ) - CONNECTION_PROBE_TIMEOUT: float = Field( + CONNECTION_PROBE_TIMEOUT: float = Field( + gt=0, default=90.0, description="Maximum time in seconds to wait for connection probe response while waiting for initial connection to the zmq message bus", ) - HEARTBEAT_INTERVAL: float = Field( + HEARTBEAT_INTERVAL: float = Field( + gt=0, default=5.0, description="Interval in seconds between heartbeat messages for component services", ) - REGISTRATION_INTERVAL: float = Field( + REGISTRATION_INTERVAL: float = Field( + gt=0, default=1.0, description="Interval in seconds between registration attempts for component services", ) - START_TIMEOUT: float = Field( + START_TIMEOUT: float = Field( + gt=0, default=30.0, description="Timeout in seconds for service start operations", ) - TASK_CANCEL_TIMEOUT_SHORT: float = Field( + TASK_CANCEL_TIMEOUT_SHORT: float = Field( + gt=0, default=2.0, description="Maximum time in seconds to wait for simple tasks to complete when cancelling", )
126-133: Normalize DEFAULT_DCGM_ENDPOINTS to list[str].
parse_str_or_csv_list may return list[Any]; type here is str | list[str]. To avoid downstream branching/coercion, consider:
- Make the field Annotated[list[str], BeforeValidator(parse_str_or_csv_list)] (always a list).
- Adjust parse_str_or_csv_list to cast all elements to str.
Based on learnings (see src/aiperf/common/config/config_validators.py lines 48-73).
552-553: Consider a.reload()utility for test/runtime overrides.
A helper that rebuilds the singleton from env (and revalidates) avoids ad-hoc attribute mutation outside tests.def reload_environment() -> None: global Environment Environment = _Environment()src/aiperf/timing/credit_issuing_strategy.py (1)
343-343: Guard against zero/negative progress interval.
If CREDIT_PROGRESS_REPORT_INTERVAL <= 0, this loop can starve the event loop. Clamp or enforce gt>0 in Environment.SERVICE (preferred).- await asyncio.sleep(Environment.SERVICE.CREDIT_PROGRESS_REPORT_INTERVAL) + await asyncio.sleep(max(0.01, Environment.SERVICE.CREDIT_PROGRESS_REPORT_INTERVAL))src/aiperf/dataset/dataset_manager.py (1)
356-358: Handle timeout with a clearer service error.
If configuration times out, asyncio.TimeoutError bubbles up. Convert to a service error with context.- await asyncio.wait_for( - self.dataset_configured.wait(), - timeout=Environment.DATASET.CONFIGURATION_TIMEOUT, - ) + try: + await asyncio.wait_for( + self.dataset_configured.wait(), + timeout=Environment.DATASET.CONFIGURATION_TIMEOUT, + ) + except asyncio.TimeoutError as e: + raise self._service_error( + f"Dataset configuration timed out after {Environment.DATASET.CONFIGURATION_TIMEOUT}s" + ) from etests/common/conftest.py (1)
39-46: Implementation is sound; verify the fixture naming is intentional.The verification confirms bootstrap.py properly reads
Environment.SERVICE.DISABLE_UVLOOPat lines 122–127 and toggles between uvloop and asyncio accordingly. The monkeypatch approach in the fixture correctly sets this flag for tests.Consider whether the fixture name
service_config_no_uvloopaccurately reflects that it's toggling an environment variable rather than modifying the config object itself—a name likedisable_uvloop_envmay better communicate intent.src/aiperf/common/config/worker_config.py (1)
30-34: Avoid binding help text to runtime env at import timeThe Field description f-string snapshots Environment.WORKER.MAX_WORKERS_CAP at import; if env differs between build and run, the help text can mislead. Consider a static reference (no f-string) or compute help text lazily in CLI/help generation.
src/aiperf/controller/proxy_manager.py (1)
55-63: Confirm timeout units and narrow exception scope
- Ensure Environment.ZMQ.CONTEXT_TERM_TIMEOUT is a finite number in seconds (float/int). If it’s 0/None/negative, wait_for will misbehave. Add validation or a floor.
- Avoid catching BaseException; it swallows asyncio.CancelledError. Prefer Exception, and optionally handle asyncio.TimeoutError explicitly.
- except BaseException as e: + except asyncio.TimeoutError as e: + self.warning(f"Timed out terminating ZMQ context after {Environment.ZMQ.CONTEXT_TERM_TIMEOUT}s: {e}") + except Exception as e: self.warning(f"Error terminating ZMQ context: {e}")tests/clients/http/test_tcp_connector.py (1)
117-147: Patch Linux TCP symbols before invoking the factoryIn test_socket_factory_linux_specific_options, the patch of socket. happens after socket_factory(addr_info) runs, so it doesn’t influence the hasattr checks inside apply_to_socket. Move the patch before invoking socket_factory (or construct socket_factory after patching) to deterministically exercise those branches.
- socket_factory = mock_connector_class.call_args[1]["socket_factory"] - with patch("socket.socket") as mock_socket_class: + socket_factory = mock_connector_class.call_args[1]["socket_factory"] + # Patch OS symbol before creating/using the socket to ensure the branch is taken + if has_attribute and expected_value is not None: + with patch.object(socket, attribute_name, tcp_option, create=True): + pass + with patch("socket.socket") as mock_socket_class: ... - socket_factory(addr_info) + socket_factory(addr_info) - if has_attribute and expected_value is not None: - with patch.object(socket, attribute_name, expected_value, create=True): - mock_socket.setsockopt.assert_any_call( - socket.SOL_TCP, tcp_option, expected_value - ) + if has_attribute and expected_value is not None: + mock_socket.setsockopt.assert_any_call( + socket.SOL_TCP, tcp_option, expected_value + )src/aiperf/workers/worker_manager.py (3)
71-79: Auto max-workers formula is fine; consider edge-proofingThe min(int(cpu * factor) - 1, MAX_WORKERS_CAP) then max(..., 1) is sensible. Recommend logging inputs (cpu_count, factor, cap) at trace level to aid tuning and diagnosing odd env values.
91-95: Validate min/max consistencyIf users set workers.min > workers.max, max is silently raised to min here. Consider validating and warning (or erroring) early to surface misconfig.
- self.max_workers = max( + if (self.service_config.workers.min or 1) > self.max_workers: + self.warning("workers.min exceeds workers.max; raising max to min") + self.max_workers = max( self.max_workers, self.service_config.workers.min or 1, )
149-160: Status timing math OK; minor readability tweakThe repeated (time.time_ns() - (ts or 0)) / NANOS_PER_SECOND appears multiple times. Consider factoring to a helper for clarity and to avoid subtle unit mixups.
src/aiperf/metrics/metric_dicts.py (1)
71-76: Docstring out of sync with parametersto_display_dict now has show_internal and show_experimental, but docstring only mentions show_internal. Update for accuracy.
- Args: - registry: MetricRegistry class for looking up metric definitions - show_internal: If True, include experimental/internal metrics + Args: + registry: MetricRegistry class for metric definitions. + show_internal: Include metrics flagged INTERNAL when True. + show_experimental: Include metrics flagged EXPERIMENTAL when True.src/aiperf/transports/http_defaults.py (2)
33-48: Validate env-provided socket valuesEnvironment.HTTP.TCP_KEEPIDLE/INTVL/KEEPCNT/SO_RCVBUF/SO_SNDBUF/TCP_USER_TIMEOUT should be validated (positive, sensible ranges). Bad env values can raise or degrade performance.
Consider guarding with max(..., 1) where appropriate and logging when clamping.
71-73: Clarify limit_per_host semanticsComment says “0 will set to LIMIT,” but in aiohttp 3.x, limit_per_host=0 means “no per‑host limit,” not “inherit LIMIT.” Suggest updating the comment to avoid confusion.
- LIMIT_PER_HOST = ( - 0 # Maximum number of concurrent connections per host (0 will set to LIMIT) - ) + LIMIT_PER_HOST = 0 # Per-host limit; 0 disables per-host limiting (uses global LIMIT only)tests/test_main.py (1)
78-82: Consider simplifying assertion logic for clarity.The ternary expression is functional but hard to read. A more explicit assertion would improve maintainability:
- assert ( - captured_user_config.gpu_telemetry == expected_gpu_telemetry - if captured_user_config - else [] - ) + assert captured_user_config is not None, "Mock was not called" + assert captured_user_config.gpu_telemetry == expected_gpu_telemetrysrc/aiperf/common/bootstrap.py (1)
122-127: Note: Double negative in condition.The condition
not Environment.SERVICE.DISABLE_UVLOOPworks correctly but uses a double negative. Consider renaming toEnvironment.SERVICE.ENABLE_UVLOOPin a future refactor for improved clarity.src/aiperf/zmq/push_client.py (1)
69-84: Consider using direct default for consistency and simplicity.The lazy resolution pattern (accepting
Noneand resolving at runtime) adds complexity compared to usingEnvironment.ZMQ.PUSH_MAX_RETRIESdirectly as the default value. This is inconsistent with Line 99, which directly referencesEnvironment.ZMQ.PUSH_RETRY_DELAYwithout lazy resolution.Consider simplifying to:
async def _push_message( self, message: Message, retry_count: int = 0, - max_retries: int | None = None, + max_retries: int = Environment.ZMQ.PUSH_MAX_RETRIES, ) -> None: """Push a message to the socket. Will retry up to max_retries times. Args: message: Message to be sent must be a Message object retry_count: Current retry count - max_retries: Maximum number of times to retry pushing the message (defaults to Environment.ZMQ.PUSH_MAX_RETRIES) + max_retries: Maximum number of times to retry pushing the message """ - if max_retries is None: - max_retries = Environment.ZMQ.PUSH_MAX_RETRIES - try:This matches the pattern used in
command_handler_mixin.py(lines 169-172, 194) and reduces the maintenance burden.src/aiperf/gpu_telemetry/telemetry_data_collector.py (1)
46-61: Consider using direct default for consistency and simplicity.Similar to
push_client.py, this lazy resolution pattern (acceptingNoneand resolving in__init__) adds complexity compared to usingEnvironment.GPU.COLLECTION_INTERVALdirectly as the default. This is inconsistent with Lines 76 and 122, which directly referenceEnvironment.GPU.REACHABILITY_TIMEOUTwithout lazy resolution.Consider simplifying to:
def __init__( self, dcgm_url: str, - collection_interval: float | None = None, + collection_interval: float = Environment.GPU.COLLECTION_INTERVAL, record_callback: Callable[[list[TelemetryRecord], str], Awaitable[None] | None] | None = None, error_callback: Callable[[ErrorDetails, str], Awaitable[None] | None] | None = None, collector_id: str = "telemetry_collector", ) -> None: self._dcgm_url = dcgm_url - self._collection_interval = ( - collection_interval - if collection_interval is not None - else Environment.GPU.COLLECTION_INTERVAL - ) + self._collection_interval = collection_intervalThis matches the simpler pattern used in
command_handler_mixin.pyand reduces maintenance burden.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (59)
src/aiperf/__main__.py(2 hunks)src/aiperf/cli_utils.py(2 hunks)src/aiperf/common/base_component_service.py(3 hunks)src/aiperf/common/bootstrap.py(3 hunks)src/aiperf/common/config/__init__.py(0 hunks)src/aiperf/common/config/cli_parameter.py(0 hunks)src/aiperf/common/config/config_defaults.py(0 hunks)src/aiperf/common/config/config_validators.py(1 hunks)src/aiperf/common/config/dev_config.py(0 hunks)src/aiperf/common/config/service_config.py(2 hunks)src/aiperf/common/config/worker_config.py(2 hunks)src/aiperf/common/constants.py(0 hunks)src/aiperf/common/environment.py(1 hunks)src/aiperf/common/logging.py(3 hunks)src/aiperf/common/mixins/command_handler_mixin.py(3 hunks)src/aiperf/common/mixins/message_bus_mixin.py(3 hunks)src/aiperf/common/mixins/task_manager_mixin.py(2 hunks)src/aiperf/common/protocols.py(3 hunks)src/aiperf/controller/base_service_manager.py(2 hunks)src/aiperf/controller/kubernetes_service_manager.py(3 hunks)src/aiperf/controller/multiprocess_service_manager.py(4 hunks)src/aiperf/controller/proxy_manager.py(2 hunks)src/aiperf/controller/system_controller.py(6 hunks)src/aiperf/dataset/__init__.py(0 hunks)src/aiperf/dataset/dataset_manager.py(2 hunks)src/aiperf/exporters/experimental_metrics_console_exporter.py(2 hunks)src/aiperf/exporters/internal_metrics_console_exporter.py(2 hunks)src/aiperf/gpu_telemetry/__init__.py(0 hunks)src/aiperf/gpu_telemetry/constants.py(0 hunks)src/aiperf/gpu_telemetry/telemetry_data_collector.py(4 hunks)src/aiperf/gpu_telemetry/telemetry_manager.py(4 hunks)src/aiperf/metrics/metric_dicts.py(4 hunks)src/aiperf/post_processors/record_export_results_processor.py(3 hunks)src/aiperf/records/record_processor_service.py(2 hunks)src/aiperf/records/records_manager.py(5 hunks)src/aiperf/timing/credit_issuing_strategy.py(2 hunks)src/aiperf/transports/aiohttp_client.py(1 hunks)src/aiperf/transports/http_defaults.py(3 hunks)src/aiperf/ui/dashboard/aiperf_textual_app.py(3 hunks)src/aiperf/ui/dashboard/progress_dashboard.py(2 hunks)src/aiperf/ui/dashboard/realtime_metrics_dashboard.py(2 hunks)src/aiperf/ui/dashboard/rich_log_viewer.py(2 hunks)src/aiperf/ui/tqdm_ui.py(2 hunks)src/aiperf/workers/worker.py(2 hunks)src/aiperf/workers/worker_manager.py(4 hunks)src/aiperf/zmq/__init__.py(0 hunks)src/aiperf/zmq/dealer_request_client.py(2 hunks)src/aiperf/zmq/pull_client.py(2 hunks)src/aiperf/zmq/push_client.py(3 hunks)src/aiperf/zmq/zmq_defaults.py(2 hunks)tests/clients/http/test_tcp_connector.py(4 hunks)tests/common/conftest.py(1 hunks)tests/config/test_dev_mode.py(1 hunks)tests/data_exporters/test_console_exporter.py(1 hunks)tests/gpu_telemetry/test_telemetry_manager.py(17 hunks)tests/post_processors/test_record_export_results_processor.py(4 hunks)tests/test_main.py(3 hunks)tests/ui/test_realtime_metrics_dashboard.py(2 hunks)tests/workers/test_worker_manager.py(3 hunks)
💤 Files with no reviewable changes (9)
- src/aiperf/common/config/cli_parameter.py
- src/aiperf/common/config/config_defaults.py
- src/aiperf/dataset/init.py
- src/aiperf/zmq/init.py
- src/aiperf/common/config/init.py
- src/aiperf/gpu_telemetry/init.py
- src/aiperf/gpu_telemetry/constants.py
- src/aiperf/common/constants.py
- src/aiperf/common/config/dev_config.py
🧰 Additional context used
🧬 Code graph analysis (15)
src/aiperf/transports/aiohttp_client.py (1)
src/aiperf/transports/http_defaults.py (2)
AioHttpDefaults(64-94)get_default_kwargs(82-94)
tests/data_exporters/test_console_exporter.py (3)
tests/post_processors/test_record_export_results_processor.py (1)
service_config(56-58)tests/conftest.py (1)
service_config(186-190)src/aiperf/common/config/service_config.py (1)
ServiceConfig(28-160)
src/aiperf/timing/credit_issuing_strategy.py (1)
tests/utils/time_traveler.py (1)
sleep(37-43)
src/aiperf/workers/worker_manager.py (1)
tests/utils/time_traveler.py (2)
time(48-49)time_ns(45-46)
tests/post_processors/test_record_export_results_processor.py (1)
tests/conftest.py (1)
mock_aiofiles_stringio(354-382)
src/aiperf/common/protocols.py (3)
src/aiperf/controller/base_service_manager.py (1)
wait_for_all_services_start(112-117)src/aiperf/controller/kubernetes_service_manager.py (1)
wait_for_all_services_start(81-93)src/aiperf/controller/multiprocess_service_manager.py (1)
wait_for_all_services_start(217-226)
src/aiperf/gpu_telemetry/telemetry_manager.py (1)
tests/utils/time_traveler.py (1)
sleep(37-43)
tests/common/conftest.py (2)
tests/conftest.py (1)
service_config(186-190)src/aiperf/common/config/service_config.py (1)
ServiceConfig(28-160)
src/aiperf/common/environment.py (1)
src/aiperf/common/config/config_validators.py (2)
parse_service_types(77-86)parse_str_or_csv_list(49-74)
src/aiperf/zmq/push_client.py (1)
tests/utils/time_traveler.py (1)
sleep(37-43)
src/aiperf/records/records_manager.py (1)
tests/utils/time_traveler.py (1)
sleep(37-43)
src/aiperf/controller/system_controller.py (2)
src/aiperf/cli_utils.py (1)
print_developer_mode_warning(127-144)src/aiperf/common/config/config_defaults.py (1)
OutputDefaults(129-137)
src/aiperf/controller/base_service_manager.py (3)
src/aiperf/common/protocols.py (1)
wait_for_all_services_start(465-469)src/aiperf/controller/kubernetes_service_manager.py (1)
wait_for_all_services_start(81-93)src/aiperf/controller/multiprocess_service_manager.py (1)
wait_for_all_services_start(217-226)
tests/ui/test_realtime_metrics_dashboard.py (2)
src/aiperf/common/config/service_config.py (1)
ServiceConfig(28-160)src/aiperf/ui/dashboard/realtime_metrics_dashboard.py (1)
_should_skip(63-73)
src/aiperf/common/base_component_service.py (1)
src/aiperf/common/mixins/command_handler_mixin.py (1)
send_command_and_wait_for_response(169-188)
🪛 Ruff (0.14.1)
src/aiperf/common/mixins/task_manager_mixin.py
39-39: Unused method argument: timeout
(ARG002)
src/aiperf/controller/multiprocess_service_manager.py
220-220: Unused method argument: timeout_seconds
(ARG002)
src/aiperf/controller/kubernetes_service_manager.py
70-70: Unused method argument: timeout_seconds
(ARG002)
84-84: Unused method argument: timeout_seconds
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: integration-tests (ubuntu-latest, 3.12)
- GitHub Check: integration-tests (ubuntu-latest, 3.11)
- GitHub Check: integration-tests (ubuntu-latest, 3.10)
- GitHub Check: build (ubuntu-latest, 3.10)
- GitHub Check: build (macos-latest, 3.10)
- GitHub Check: build (ubuntu-latest, 3.11)
- GitHub Check: build (macos-latest, 3.11)
- GitHub Check: build (ubuntu-latest, 3.12)
- GitHub Check: build (macos-latest, 3.12)
🔇 Additional comments (60)
src/aiperf/common/config/config_validators.py (1)
12-12: LGTM! More specific import improves clarity.The import path change from a general module to a specific submodule makes the dependency more explicit and aligns with the PR's organizational improvements.
src/aiperf/common/mixins/message_bus_mixin.py (1)
13-13: LGTM: switch to Environment-driven config.
Consistent with the PR goals; import is correct.src/aiperf/common/environment.py (1)
535-549: Validator behavior is sound.
Silently downgrading DEV SHOW_* flags when DEV.MODE is false with a warning is a good safeguard.src/aiperf/timing/credit_issuing_strategy.py (1)
8-11: LGTM: replaced constant with Environment and kept NANOS_PER_SECOND import.
Matches the new configuration approach.src/aiperf/dataset/dataset_manager.py (1)
22-22: LGTM: move dataset timeout source to Environment.
Keeps behavior configurable without code changes.tests/workers/test_worker_manager.py (2)
15-15: Good switch to env-driven thresholdsTests now follow Environment.WORKER.*; keeps expectations aligned with runtime config.
190-205: Boundary and recovery checks match implementation semantics
- Using cpu_usage == Environment.WORKER.HIGH_LOAD_CPU_USAGE to assert “no warning” matches the code’s strict “>” comparison.
- Recovery parametrization using Environment.WORKER.HIGH_LOAD_RECOVERY_TIME validates both sides of the threshold.
If HIGH_LOAD_CPU_USAGE becomes configurable per test run, consider parametrizing it once via a fixture to avoid repetition.
tests/clients/http/test_tcp_connector.py (1)
38-38: Defaults assertion looks goodVerifying limit against Environment.HTTP.CONNECTION_LIMIT keeps the test resilient to env changes.
src/aiperf/workers/worker.py (1)
86-87: LGTM on env-driven health intervalReading Environment.WORKER.HEALTH_CHECK_INTERVAL centralizes tuning.
src/aiperf/metrics/metric_dicts.py (1)
149-151: LGTM on env-driven initial capacityMetricArray now sources initial_capacity from Environment.METRICS.ARRAY_INITIAL_CAPACITY and validates > 0. Looks good.
src/aiperf/transports/http_defaults.py (1)
81-94: New helper API looks goodAioHttpDefaults.get_default_kwargs centralizes aiohttp defaults. Ensure create_tcp_connector uses this to keep tests and runtime in sync.
src/aiperf/transports/aiohttp_client.py (1)
267-268: LGTM! Centralized configuration approach.The change successfully moves default TCP connector configuration to
AioHttpDefaults.get_default_kwargs(), aligning with the PR's environment-based configuration strategy while preserving the socket_factory injection logic.src/aiperf/ui/dashboard/aiperf_textual_app.py (2)
85-86: LGTM! Environment-based developer mode detection.The switch to
Environment.DEV.MODEcentralizes developer mode configuration and aligns with the PR's objectives.
142-145: Good documentation improvement.The added comments clarify the importance of signal forwarding to prevent process hangs.
src/aiperf/ui/dashboard/realtime_metrics_dashboard.py (1)
72-73: LGTM! Centralized internal metrics visibility control.The change to
Environment.DEV.SHOW_INTERNAL_METRICSaligns with the environment-based configuration approach and removes the dependency on per-service developer flags.src/aiperf/ui/tqdm_ui.py (1)
42-42: LGTM! Environment-based UI update threshold.The change to
Environment.UI.MIN_UPDATE_PERCENTenables runtime configuration of progress bar update frequency.src/aiperf/zmq/pull_client.py (1)
81-82: LGTM! Environment-based concurrency configuration.The change to
Environment.ZMQ.PULL_MAX_CONCURRENCYenables runtime configuration of pull client concurrency limits.tests/test_main.py (1)
28-41: LGTM! Environment-based test expectations.The test parameters correctly use
Environment.GPU.DEFAULT_DCGM_ENDPOINTSfor default GPU telemetry endpoints, aligning with the environment-based configuration approach.src/aiperf/exporters/internal_metrics_console_exporter.py (1)
30-32: LGTM! Environment-based internal metrics control.The change to
Environment.DEV.MODEandEnvironment.DEV.SHOW_INTERNAL_METRICScentralizes developer-mode feature flags and removes dependency on service-level configuration.src/aiperf/common/base_component_service.py (2)
56-56: LGTM! Environment-based heartbeat interval.The change to
Environment.SERVICE.HEARTBEAT_INTERVALenables runtime configuration of heartbeat frequency.
82-87: LGTM! Environment-based registration configuration.The changes to
Environment.SERVICE.REGISTRATION_MAX_ATTEMPTSandEnvironment.SERVICE.REGISTRATION_INTERVALprovide consistent, centralized control over service registration behavior.src/aiperf/records/record_processor_service.py (1)
13-13: LGTM! Clean migration to environment-based configuration.The replacement of
DEFAULT_PULL_CLIENT_MAX_CONCURRENCYwithEnvironment.ZMQ.PULL_MAX_CONCURRENCYaligns with the PR's goal of centralizing configuration via the Environment singleton.Also applies to: 62-62
src/aiperf/cli_utils.py (1)
127-144: LGTM! Well-structured developer mode warning.The new
print_developer_mode_warning()function provides clear, styled feedback to users when developer mode is active.src/aiperf/exporters/experimental_metrics_console_exporter.py (1)
10-10: LGTM! Clean migration to environment-based feature flags.The gating logic now correctly uses
Environment.DEV.MODEandEnvironment.DEV.SHOW_EXPERIMENTAL_METRICS, replacing the previous scattered configuration approach.Also applies to: 29-31
tests/post_processors/test_record_export_results_processor.py (1)
18-18: LGTM! Test correctly updated for environment-based configuration.The test now patches
Environment.DEV.*attributes and usesEnvironment.RECORD.EXPORT_BATCH_SIZE, aligning with the centralized configuration approach.Also applies to: 183-186, 600-600, 613-613, 618-618
src/aiperf/controller/multiprocess_service_manager.py (1)
15-15: LGTM! Clean migration to environment-based timeouts.The timeout values are now sourced from
Environment.SERVICE.*, providing centralized configuration for service lifecycle management.Also applies to: 138-138, 206-206
tests/ui/test_realtime_metrics_dashboard.py (1)
4-4: LGTM! Clean test refactoring to environment-based configuration.The test now uses
patch.objectto controlEnvironment.DEV.SHOW_INTERNAL_METRICS, replacing the fixture-based approach withDeveloperConfig. This aligns with the centralized environment configuration pattern.Also applies to: 9-9, 51-62
tests/data_exporters/test_console_exporter.py (1)
130-130: LGTM! Simplified test configuration.The test now relies on default
ServiceConfig()behavior, which sources developer mode settings from theEnvironmentsingleton rather than explicit configuration.src/aiperf/__main__.py (1)
7-7: LGTM! Clean migration to environment-based GPU configuration.The DCGM endpoint defaults are now sourced from
Environment.GPU.DEFAULT_DCGM_ENDPOINTS, consistent with the centralized configuration approach.Also applies to: 18-18
src/aiperf/common/logging.py (2)
34-36: LGTM: Queue size now environment-configurable.The migration from a hardcoded constant to
Environment.LOGGING.QUEUE_MAXSIZEprovides better configurability for the logging queue.
102-109: LGTM: Debug/trace service filtering now environment-driven.The migration to
Environment.DEV.TRACE_SERVICESandEnvironment.DEV.DEBUG_SERVICESaligns with the centralized configuration approach.src/aiperf/ui/dashboard/progress_dashboard.py (1)
81-81: LGTM: UI refresh rate now environment-configurable.Replacing the hardcoded
SPINNER_REFRESH_RATEwithEnvironment.UI.SPINNER_REFRESH_RATEenables runtime configuration of UI timing.src/aiperf/common/bootstrap.py (1)
56-57: LGTM: YAPPI profiling now environment-controlled.The migration to
Environment.DEV.ENABLE_YAPPIprovides a cleaner way to control profiling behavior.src/aiperf/post_processors/record_export_results_processor.py (2)
46-51: LGTM: Metric visibility flags now environment-driven.The migration to
Environment.DEV.SHOW_INTERNAL_METRICSand introduction ofEnvironment.DEV.SHOW_EXPERIMENTAL_METRICSprovides flexible control over metric visibility in exports.
58-58: LGTM: Export batch size now configurable.Using
Environment.RECORD.EXPORT_BATCH_SIZEenables tuning of batch processing performance.src/aiperf/ui/dashboard/rich_log_viewer.py (1)
109-109: LGTM: Log refresh interval now environment-configurable.The migration to
Environment.UI.LOG_REFRESH_INTERVALenables tuning of log consumption frequency.src/aiperf/zmq/dealer_request_client.py (1)
130-130: LGTM: Request timeout now environment-configurable.The migration to
Environment.SERVICE.COMMS_REQUEST_TIMEOUTenables runtime configuration of communication timeouts. This is a public API change, but should be transparent if the environment default matches the previous constant value.tests/gpu_telemetry/test_telemetry_manager.py (1)
53-53: LGTM: Test suite updated for environment-based configuration.All test assertions now reference
Environment.GPU.DEFAULT_DCGM_ENDPOINTS, ensuring tests validate the centralized configuration system.src/aiperf/gpu_telemetry/telemetry_manager.py (4)
100-111: LGTM: Endpoint aggregation now environment-driven.The migration to
Environment.GPU.DEFAULT_DCGM_ENDPOINTSfor default endpoint handling maintains the existing deduplication and ordering logic while enabling centralized configuration.
113-113: LGTM: Collection interval now configurable.Using
Environment.GPU.COLLECTION_INTERVALenables runtime tuning of telemetry collection frequency.
206-210: LGTM: Reachable defaults computation updated.Correctly references
Environment.GPU.DEFAULT_DCGM_ENDPOINTSfor filtering reachable default endpoints.
295-295: LGTM: Shutdown delay now configurable.The migration to
Environment.GPU.SHUTDOWN_DELAYprovides flexibility for adjusting the delay before service shutdown.src/aiperf/zmq/zmq_defaults.py (1)
4-4: LGTM! Clean migration to Environment-based configuration.The ZMQ socket defaults are now sourced from the centralized Environment configuration, allowing runtime tunability. The updated docstring clearly documents this behavior change.
Also applies to: 32-42
src/aiperf/controller/system_controller.py (4)
84-86: LGTM! Developer mode warning at initialization.The developer mode check now uses
Environment.DEV.MODE, providing a centralized way to gate developer-only features.
211-211: LGTM! Profile timeouts now environment-driven.The profile configuration and start timeouts are now sourced from
Environment.SERVICE, enabling runtime configuration.Also applies to: 225-225
412-414: LGTM! Record processor scaling now environment-driven.The scaling factor is sourced from
Environment.RECORD.PROCESSOR_SCALE_FACTOR, making it configurable at runtime.
564-566: LGTM! Developer mode warning on exit.Consistent with the initialization warning, this ensures developers are reminded that developer mode is active throughout the execution lifecycle.
src/aiperf/controller/base_service_manager.py (1)
8-8: LGTM! Service manager timeouts now environment-driven.The abstract method signatures now use
Environment.SERVICE.REGISTRATION_TIMEOUTandEnvironment.SERVICE.START_TIMEOUT, ensuring consistent timeout defaults across all service manager implementations.Also applies to: 104-116
src/aiperf/common/protocols.py (1)
13-13: LGTM! Protocol timeouts now environment-driven.The protocol definitions for
RequestClientProtocolandServiceManagerProtocolnow use Environment-based timeout defaults, ensuring all implementing classes inherit consistent, configurable behavior.Also applies to: 170-170, 462-462, 468-468
src/aiperf/controller/kubernetes_service_manager.py (1)
10-10: LGTM! Kubernetes service manager aligned with Environment-based configuration.The method signatures now use
Environment.SERVICE.REGISTRATION_TIMEOUTandEnvironment.SERVICE.START_TIMEOUT. The unused parameter warnings from static analysis are expected since these are stub methods that raiseNotImplementedError.Also applies to: 67-93
tests/config/test_dev_mode.py (1)
9-25: LGTM! Tests updated for Environment-based configuration.The tests now properly reload the environment module after setting environment variables via
monkeypatch, then assert againstEnvironment.DEV.MODE. This correctly validates the environment-driven developer mode configuration.src/aiperf/common/config/service_config.py (1)
9-9: LGTM! ServiceConfig migrated from BaseSettings to BaseConfig.The change from
BaseSettingstoBaseConfigaligns with the shift to centralized environment-driven configuration via theEnvironmentsingleton, removing the need for per-config environment variable loading.Also applies to: 28-28
src/aiperf/zmq/push_client.py (1)
99-99: LGTM!The direct use of
Environment.ZMQ.PUSH_RETRY_DELAYis correct and follows best practices for environment-driven configuration.src/aiperf/records/records_manager.py (3)
82-82: LGTM!The direct use of
Environment.ZMQ.PULL_MAX_CONCURRENCYproperly centralizes the concurrency configuration.
378-380: LGTM!Using
Environment.RECORD.PROGRESS_REPORT_INTERVALfor the background task interval correctly externalizes the reporting frequency configuration.
426-426: LGTM!The use of
Environment.UI.REALTIME_METRICS_INTERVALcorrectly makes the realtime metrics update interval configurable via environment settings.src/aiperf/common/mixins/command_handler_mixin.py (2)
169-172: LGTM! Excellent pattern.The direct default
timeout: float = Environment.SERVICE.COMMAND_RESPONSE_TIMEOUTis clean and straightforward. This pattern avoids the complexity of lazy resolution while achieving the same goal of environment-driven configuration.
194-194: LGTM!Consistent use of
Environment.SERVICE.COMMAND_RESPONSE_TIMEOUTas the default timeout value.src/aiperf/gpu_telemetry/telemetry_data_collector.py (2)
76-76: LGTM!The direct use of
Environment.GPU.REACHABILITY_TIMEOUTcorrectly centralizes the timeout configuration for the HTTP client.
122-122: LGTM!Consistent use of
Environment.GPU.REACHABILITY_TIMEOUTfor the temporary session timeout.
|
nice! I like the consolidation, makes it easier to find constants in the future. |
Summary by CodeRabbit
Release Notes
New Features
Refactor