Skip to content

Commit a98a644

Browse files
authored
Add over saturation constraint (#438)
## Summary This PR adds over-saturation stopping to the GuideLLM CLI. It's based on the OSD (Over-Saturation Detection) algorithm we developed and evaluated at Jounce. Use `--stop-over-saturated` or `--stop-osd` to enable. ## Details This PR adds: - [x] Over-saturation stopping (`--stop-over-saturated`) - [x] Comprehensive OSD unit tests ## Test Plan - [x] Currently, only unit tests - [x] When #440 lands, we'll enable its over-saturation stopping E2E test --- - [x] "I certify that all code in this PR is my own, except as noted below." ## Use of AI - [x] Includes AI-assisted code completion - [x] Includes code generated by an AI application - [x] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`)
2 parents 76e1da0 + 913a601 commit a98a644

24 files changed

+3859
-1171
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,7 @@ guidellm benchmark \
234234
--warmup 0.1 \
235235
--cooldown 0.1 \
236236
--max-errors 5
237+
--detect-saturation
237238
```
238239

239240
**Key parameters:**
@@ -243,6 +244,7 @@ guidellm benchmark \
243244
- `--max-seconds`: Maximum duration in seconds for each benchmark before automatic termination
244245
- `--max-requests`: Maximum number of requests per benchmark before automatic termination
245246
- `--max-errors`: Maximum number of individual errors before stopping the benchmark entirely
247+
- `--detect-saturation`: Enable over-saturation detection to automatically stop benchmarks when the model becomes over-saturated (see also `--over-saturation` for more advanced control)
246248

247249
## Development and Contribution
248250

docs/guides/index.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,4 +60,12 @@ Whether you're interested in understanding the system architecture, exploring su
6060

6161
[:octicons-arrow-right-24: SLO Guide](service_level_objectives.md)
6262

63+
- :material-stop-circle-outline:{ .lg .middle } Over-Saturation Stopping
64+
65+
______________________________________________________________________
66+
67+
Automatically detect and stop benchmarks when models become over-saturated to prevent wasted compute resources and ensure valid results.
68+
69+
[:octicons-arrow-right-24: Over-Saturation Guide](over_saturation_stopping.md)
70+
6371
</div>
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Over-Saturation Stopping
2+
3+
GuideLLM provides over-saturation detection (OSD) to automatically stop benchmarks when a model becomes over-saturated. This feature helps prevent wasted compute resources and ensures that benchmark results remain valid by detecting when the response rate can no longer keep up with the request rate.
4+
5+
## What is Over-Saturation?
6+
7+
Over-saturation occurs when an LLM inference server receives requests faster than it can process them, causing a queue to build up. As the queue grows, the server takes progressively longer to start handling each request, leading to degraded performance metrics. When a performance benchmarking tool oversaturates an LLM inference server, the metrics it measures become significantly skewed, rendering them useless.
8+
9+
Think of it like a cashier getting flustered during a sudden rush. As the line grows (the load), the cashier can't keep up, the line gets longer, and there is no room for additional customers. This waste of costly machine time can be prevented by automatically detecting and stopping benchmarks when over-saturation is detected.
10+
11+
## How It Works
12+
13+
GuideLLM's Over-Saturation Detection (OSD) algorithm uses statistical slope detection to identify when a model becomes over-saturated. The algorithm tracks two key metrics over time:
14+
15+
1. **Concurrent Requests**: The number of requests being processed simultaneously
16+
2. **Time-to-First-Token (TTFT)**: The latency for the first token of each response
17+
18+
For each metric, the algorithm:
19+
20+
- Maintains a sliding window of recent data points
21+
- Calculates the linear regression slope using online statistics
22+
- Computes the margin of error (MOE) using t-distribution confidence intervals
23+
- Detects positive slopes with low MOE, indicating degradation
24+
25+
Over-saturation is detected when:
26+
27+
- Both concurrent requests and TTFT show statistically significant positive slopes
28+
- The minimum duration threshold has been met
29+
- Sufficient data points are available for reliable slope estimation
30+
31+
When over-saturation is detected, the constraint automatically stops request queuing and optionally stops processing of existing requests, preventing further resource waste.
32+
33+
## Usage
34+
35+
### Basic Usage
36+
37+
Enable over-saturation detection with default settings:
38+
39+
```bash
40+
guidellm benchmark \
41+
--target http://localhost:8000 \
42+
--profile throughput \
43+
--rate 10 \
44+
--detect-saturation
45+
```
46+
47+
### Advanced Configuration
48+
49+
Configure detection parameters using a JSON dictionary:
50+
51+
```bash
52+
guidellm benchmark \
53+
--target http://localhost:8000 \
54+
--profile concurrent \
55+
--rate 16 \
56+
--over-saturation '{"enabled": true, "min_seconds": 60, "max_window_seconds": 300, "moe_threshold": 1.5}'
57+
```
58+
59+
## Configuration Options
60+
61+
The following parameters can be configured when enabling over-saturation detection:
62+
63+
- **`enabled`** (bool, default: `True`): Whether to stop the benchmark if over-saturation is detected
64+
- **`min_seconds`** (float, default: `30.0`): Minimum seconds before checking for over-saturation. This prevents false positives during the initial warm-up phase.
65+
- **`max_window_seconds`** (float, default: `120.0`): Maximum time window in seconds for data retention. Older data points are automatically pruned to maintain bounded memory usage.
66+
- **`moe_threshold`** (float, default: `2.0`): Margin of error threshold for slope detection. Lower values make detection more sensitive to degradation.
67+
- **`minimum_ttft`** (float, default: `2.5`): Minimum TTFT threshold in seconds for violation counting. Only TTFT values above this threshold are counted as violations.
68+
- **`maximum_window_ratio`** (float, default: `0.75`): Maximum window size as a ratio of total requests. Limits memory usage by capping the number of tracked requests.
69+
- **`minimum_window_size`** (int, default: `5`): Minimum data points required for slope estimation. Ensures statistical reliability before making detection decisions.
70+
- **`confidence`** (float, default: `0.95`): Statistical confidence level for t-distribution calculations (0-1). Higher values require stronger evidence before detecting over-saturation.
71+
72+
## Use Cases
73+
74+
Over-saturation detection is particularly useful in the following scenarios:
75+
76+
### Stress Testing and Capacity Planning
77+
78+
When testing how your system handles increasing load, over-saturation detection automatically stops benchmarks once the system can no longer keep up, preventing wasted compute time on invalid results.
79+
80+
```bash
81+
guidellm benchmark \
82+
--target http://localhost:8000 \
83+
--profile sweep \
84+
--rate 5 \
85+
--detect-saturation
86+
```
87+
88+
### Cost-Effective Benchmarking
89+
90+
When running large-scale benchmark matrices across multiple models, GPUs, and configurations, over-saturation detection can significantly reduce costs by stopping invalid runs early.
91+
92+
### Finding Safe Operating Ranges
93+
94+
Use over-saturation detection to identify the maximum sustainable throughput for your deployment, helping you set appropriate rate limits and capacity planning targets.
95+
96+
## Interpreting Results
97+
98+
When over-saturation detection is enabled, the benchmark output includes metadata about the detection state. This metadata is available in the scheduler action metadata and includes:
99+
100+
- **`is_over_saturated`** (bool): Whether over-saturation was detected at the time of evaluation
101+
- **`concurrent_slope`** (float): The calculated slope for concurrent requests
102+
- **`concurrent_slope_moe`** (float): The margin of error for the concurrent requests slope
103+
- **`concurrent_n`** (int): The number of data points used for concurrent requests slope calculation
104+
- **`ttft_slope`** (float): The calculated slope for TTFT
105+
- **`ttft_slope_moe`** (float): The margin of error for the TTFT slope
106+
- **`ttft_n`** (int): The number of data points used for TTFT slope calculation
107+
- **`ttft_violations`** (int): The count of TTFT values exceeding the minimum threshold
108+
109+
These metrics can help you understand why over-saturation was detected and fine-tune the detection parameters if needed.
110+
111+
## Example: Complete Benchmark with Over-Saturation Detection
112+
113+
```bash
114+
guidellm benchmark \
115+
--target http://localhost:8000 \
116+
--profile concurrent \
117+
--rate 16 \
118+
--data "prompt_tokens=256,output_tokens=128" \
119+
--max-seconds 300 \
120+
--over-saturation '{"enabled": true, "min_seconds": 30, "max_window_seconds": 120}' \
121+
--outputs json,html
122+
```
123+
124+
This example:
125+
126+
- Runs a concurrent benchmark with 16 simultaneous requests
127+
- Uses synthetic data with 256 prompt tokens and 128 output tokens
128+
- Enables over-saturation detection with custom timing parameters
129+
- Sets a maximum duration of 300 seconds (as a fallback)
130+
- Outputs results in both JSON and HTML formats
131+
132+
## Additional Resources
133+
134+
For more in-depth information about over-saturation detection, including the algorithm development, evaluation metrics, and implementation details, see the following Red Hat Developer blog posts:
135+
136+
- [Reduce LLM benchmarking costs with oversaturation detection](https://developers.redhat.com/articles/2025/11/18/reduce-llm-benchmarking-costs-oversaturation-detection) - An introduction to the problem of over-saturation and why it matters for LLM benchmarking
137+
- [Defining success: Evaluation metrics and data augmentation for oversaturation detection](https://developers.redhat.com/articles/2025/11/20/oversaturation-detection-evaluation-metrics) - How to evaluate the performance of an OSD algorithm through custom metrics, dataset labeling, and load augmentation techniques
138+
- [Building an oversaturation detector with iterative error analysis](https://developers.redhat.com/articles/2025/11/24/building-oversaturation-detector-iterative-error-analysis) - A detailed walkthrough of how the OSD algorithm was built

src/guidellm/__main__.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -384,7 +384,27 @@ def benchmark():
384384
default=BenchmarkGenerativeTextArgs.get_default("max_global_error_rate"),
385385
help="Maximum global error rate across all benchmarks.",
386386
)
387-
def run(**kwargs):
387+
@click.option(
388+
"--over-saturation",
389+
"over_saturation",
390+
callback=cli_tools.parse_json,
391+
default=None,
392+
help=(
393+
"Enable over-saturation detection. "
394+
"Pass a JSON dict with configuration "
395+
'(e.g., \'{"enabled": true, "min_seconds": 30}\'). '
396+
"Defaults to None (disabled)."
397+
),
398+
)
399+
@click.option(
400+
"--detect-saturation",
401+
"--default-over-saturation",
402+
"over_saturation",
403+
callback=cli_tools.parse_json,
404+
flag_value='{"enabled": true}',
405+
help="Enable over-saturation detection with default settings.",
406+
)
407+
def run(**kwargs): # noqa: C901
388408
# Only set CLI args that differ from click defaults
389409
kwargs = cli_tools.set_if_not_default(click.get_current_context(), **kwargs)
390410

src/guidellm/benchmark/entrypoints.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -323,6 +323,7 @@ async def resolve_profile(
323323
max_errors: int | None,
324324
max_error_rate: float | None,
325325
max_global_error_rate: float | None,
326+
over_saturation: dict[str, Any] | None = None,
326327
console: Console | None = None,
327328
) -> Profile:
328329
"""
@@ -343,6 +344,7 @@ async def resolve_profile(
343344
:param max_errors: Maximum number of errors before stopping
344345
:param max_error_rate: Maximum error rate threshold before stopping
345346
:param max_global_error_rate: Maximum global error rate threshold before stopping
347+
:param over_saturation: Over-saturation detection configuration (dict)
346348
:param console: Console instance for progress reporting, or None
347349
:return: Configured Profile instance ready for benchmarking
348350
:raises ValueError: If constraints are provided with a pre-configured Profile
@@ -359,6 +361,7 @@ async def resolve_profile(
359361
"max_errors": max_errors,
360362
"max_error_rate": max_error_rate,
361363
"max_global_error_rate": max_global_error_rate,
364+
"over_saturation": over_saturation,
362365
}.items():
363366
if val is not None:
364367
constraints[key] = val
@@ -500,6 +503,7 @@ async def benchmark_generative_text(
500503
max_errors=args.max_errors,
501504
max_error_rate=args.max_error_rate,
502505
max_global_error_rate=args.max_global_error_rate,
506+
over_saturation=args.over_saturation,
503507
console=console,
504508
)
505509
output_formats = await resolve_output_formats(

src/guidellm/benchmark/progress.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212

1313
from abc import ABC, abstractmethod
1414
from dataclasses import dataclass
15-
from datetime import datetime
1615
from typing import Any, Generic, Literal
1716

1817
from rich.console import Group
@@ -37,7 +36,7 @@
3736
GenerativeBenchmarkAccumulator,
3837
)
3938
from guidellm.scheduler import SchedulerState, SchedulingStrategy
40-
from guidellm.utils import Colors, format_value_display
39+
from guidellm.utils import Colors, format_value_display, safe_format_timestamp
4140

4241
__all__ = ["BenchmarkerProgress", "GenerativeConsoleBenchmarkerProgress"]
4342

@@ -390,7 +389,7 @@ def formatted_start_time(self) -> str:
390389
if self.start_time < 0.0:
391390
return "--:--:--"
392391

393-
return datetime.fromtimestamp(self.start_time).strftime("%H:%M:%S")
392+
return safe_format_timestamp(self.start_time, format_="%H:%M:%S")
394393

395394
@property
396395
def formatted_progress_status(self) -> str:

src/guidellm/benchmark/schemas/generative/entrypoints.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -283,6 +283,14 @@ def get_default(cls: type[BenchmarkGenerativeTextArgs], field: str) -> Any:
283283
max_global_error_rate: float | None = Field(
284284
default=None, description="Maximum global error rate (0-1) before stopping"
285285
)
286+
over_saturation: dict[str, Any] | None = Field(
287+
default=None,
288+
description=(
289+
"Over-saturation detection configuration. A dict with configuration "
290+
"parameters (enabled, min_seconds, max_window_seconds, "
291+
"moe_threshold, etc.)."
292+
),
293+
)
286294

287295
@field_validator("data", "data_args", "rate", mode="wrap")
288296
@classmethod

src/guidellm/scheduler/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919
MaxErrorsConstraint,
2020
MaxGlobalErrorRateConstraint,
2121
MaxNumberConstraint,
22+
OverSaturationConstraint,
23+
OverSaturationConstraintInitializer,
2224
PydanticConstraintInitializer,
2325
SerializableConstraintInitializer,
2426
UnserializableConstraintInitializer,
@@ -66,6 +68,8 @@
6668
"MaxNumberConstraint",
6769
"MultiTurnRequestT",
6870
"NonDistributedEnvironment",
71+
"OverSaturationConstraint",
72+
"OverSaturationConstraintInitializer",
6973
"PydanticConstraintInitializer",
7074
"RequestT",
7175
"ResponseT",

0 commit comments

Comments
 (0)