Skip to content

Commit 0cffffa

Browse files
juliayakovlevfruch
authored andcommitted
improvement(sct utility): enhance the analysis of HDR files
Adding new SCT utility to enhance the analysis of HDR files by running build_histograms_summary_with_interval with a shorter interval. This will help identify the specific process causing high latency (cherry picked from commit 968cb0d)
1 parent 5eb8c94 commit 0cffffa

File tree

4 files changed

+194
-13
lines changed

4 files changed

+194
-13
lines changed

README.md

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -165,14 +165,4 @@ the available configuration options are listed in [configuration_options](./docs
165165
### Manager Tests (TODO: write explanation for them)
166166
### [K8S Functional Tests](./docs/k8s-functional-test.md)
167167
### [Microbenchmarking Tests](./docs/microbenchmarking.md)
168-
169-
# Local developer setup
170-
171-
* Requires [uv](https://github.com/astral-sh/uv#installation)
172-
173-
```bash
174-
uv python install
175-
uv venv
176-
export UV_PROJECT_ENVIRONMENT=.venv
177-
uv sync
178-
```
168+
### [Performance Tests](./docs/performance-tests.md)

docs/performance-tests.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Performance Tests
2+
3+
## HDR investigate utility
4+
5+
The hdr_investigate utility is essential for performance analysis because it allows users to scan HDR (High Dynamic Range) histogram files
6+
with fine-grained time intervals, rather than only looking at overall or coarse-grained metrics.
7+
By analyzing latency metrics (such as P99) in smaller intervals, the tool helps pinpoint the exact time windows where latency spikes occur.
8+
This makes it possible to correlate these spikes with specific events or Scylla processes, enabling users to identify which Scylla process
9+
or operation is causing performance problems.
10+
This targeted approach greatly improves the efficiency and accuracy of performance troubleshooting in distributed database environments.
11+
12+
Key features:
13+
- Supports multiple stress tools and operations (READ/WRITE).
14+
- Can fetch HDR files from Argus by test ID or use a local folder.
15+
- Allows specifying the time window and scan interval for analysis.
16+
- Reports intervals where P99 latency exceeds a user-defined threshold.
17+
18+
Usage example:
19+
20+
```bash
21+
hydra hdr-investigate \
22+
--stress-operation READ \
23+
--throttled-load true \
24+
--test-id 8732ecb1-7e1f-44e7-b109-6d789b15f4b5 \
25+
--start-time "2025-09-14\ 20:45:18" \
26+
--duration-from-start-min 30
27+
```
28+
29+
Main options:
30+
- --test-id: Test run identifier (fetches logs from Argus if --hdr-folder is not provided).
31+
- --stress-tool: Name of the stress tool (cassandra-stress, scylla-bench, or latte) (default: cassandra-stress).
32+
- --stress-operation: Operation type (READ or WRITE).
33+
- --throttled-load: Whether the load was throttled (True or False).
34+
- --start-time: Start time for analysis (format: YYYY-MM-DD\ HH:MM:SS).
35+
- --duration-from-start-min: Duration in minutes to analyze from the start time.
36+
- --error-threshold-ms: P99 latency threshold in milliseconds (default: 10).
37+
- --hdr-summary-interval-sec: Interval in seconds for summary scan (default: 600).
38+
- --hdr-folder: Path to local folder with HDR files (optional).
39+
40+
This utility is useful for performance engineers and developers investigating latency issues in distributed database clusters.

sct.py

Lines changed: 105 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# Copyright (c) 2021 ScyllaDB
1414

1515
from collections import defaultdict
16-
from datetime import datetime, timezone, timedelta
16+
from datetime import datetime, timezone, timedelta, UTC
1717
import os
1818
import re
1919
import sys
@@ -51,6 +51,7 @@
5151
from sdcm.sct_provision.instances_provider import provision_sct_resources
5252
from sdcm.sct_runner import AwsSctRunner, GceSctRunner, AzureSctRunner, get_sct_runner, clean_sct_runners, \
5353
update_sct_runner_tags, list_sct_runners
54+
5455
from sdcm.utils.ci_tools import get_job_name, get_job_url
5556
from sdcm.utils.git import get_git_commit_id, get_git_status_info
5657
from sdcm.utils.argus import argus_offline_collect_events, create_proxy_argus_s3_url, get_argus_client
@@ -83,7 +84,8 @@
8384
list_resources_docker,
8485
list_parallel_timelines_report_urls,
8586
search_test_id_in_latest,
86-
get_latest_scylla_release, images_dict_in_json_format,
87+
get_latest_scylla_release, images_dict_in_json_format, get_hdr_tags,
88+
download_and_unpack_logs,
8789
)
8890
from sdcm.utils.nemesis_generation import generate_nemesis_yaml, NemesisJobGenerator
8991
from sdcm.utils.open_with_diff import OpenWithDiff, ErrorCarrier
@@ -1933,6 +1935,107 @@ def upload_artifact_file(test_id: str, file_path: str, use_argus: bool, public:
19331935
return
19341936

19351937

1938+
@cli.command("hdr-investigate", help="Analyze HDR file for latency spikes.\n"
1939+
"Usage example:\n"
1940+
"hydra hdr-investigate --stress-operation READ --throttled-load true "
1941+
"--test-id 8732ecb1-7e1f-44e7-b109-6d789b15f4b5 --start-time \"2025-09-14\\ 20:45:18\" "
1942+
"--duration-from-start-min 30")
1943+
@click.option("--test-id", type=str, required=False, help="If hdr_folder is not provided, logs will be downloaded from argus "
1944+
"using this test_id")
1945+
@click.option("--stress-tool", default="cassandra-stress", required=False,
1946+
type=click.Choice(['cassandra-stress', 'scylla-bench', 'latte'], case_sensitive=False),
1947+
help="stress tool name. Supported tools: cassandra-stress|scylla-bench|latte")
1948+
@click.option("--stress-operation", required=True,
1949+
type=click.Choice(["READ", "WRITE"], case_sensitive=False),
1950+
help="Supported stress operations: READ|WRITE")
1951+
@click.option("--throttled-load", type=bool, required=True, help="Is the load throttled or not")
1952+
@click.option("--start-time", type=str, required=True, help="Start time in format 'YYYY-MM-DD\\ HH:MM:SS'")
1953+
@click.option("--duration-from-start-min", type=int, required=True,
1954+
help="Time period in minutes in HDR file to investigate, started from start-time ")
1955+
@click.option("--error-threshold-ms", type=int, default=10, required=False,
1956+
help="Error threshold in ms for P99 to consider it as a spike")
1957+
@click.option("--hdr-summary-interval-sec", type=int, default=600, required=False, help="Interval in seconds for scan")
1958+
@click.option("--hdr-folder", type=str, default=None, required=False, help="Path to folder with hdr files. ")
1959+
def hdr_investigate(test_id: str, stress_tool: str, stress_operation: str, throttled_load: bool, start_time: str,
1960+
duration_from_start_min: int, error_threshold_ms: int, hdr_summary_interval_sec: int, hdr_folder: str) -> None:
1961+
"""
1962+
Analyze HDR file for latency spikes.
1963+
1964+
This function scans HDR files to identify intervals with high latency spikes.
1965+
It performs a scan to find intervals with the highest P99 latency
1966+
Args:
1967+
test_id (str): Test ID. If `hdr_folder` is not provided, logs will be downloaded from Argus using this ID to /tmp/<test-id> folder.
1968+
stress_tool (str): Name of the stress tool. Supported: cassandra-stress, scylla-bench, latte.
1969+
stress_operation (str): Stress operation type. Supported: READ, WRITE.
1970+
throttled_load (bool): Whether the load is throttled.
1971+
start_time (str): Start time in format 'YYYY-MM-DD HH:MM:SS'.
1972+
duration_from_start_min (int): Time period in minutes in HDR file to investigate, starting from `start_time`.
1973+
error_threshold_ms (int): Error threshold in ms for P99 to consider it as a spike.
1974+
hdr_summary_interval_sec (int): Interval in seconds for scan.
1975+
hdr_folder (str): Path to folder with HDR files. If not provided, logs will be downloaded from Argus.
1976+
1977+
Returns:
1978+
None
1979+
1980+
Usage example:
1981+
hydra hdr-investigate --stress-operation READ --throttled-load true --test-id 8732ecb1-7e1f-44e7-b109-6d789b15f4b5
1982+
--start-time \"2025-09-14\\ 20:45:18\" --duration-from-start-min 30
1983+
"""
1984+
from sdcm.utils.hdrhistogram import make_hdrhistogram_summary_by_interval
1985+
stress_operation = stress_operation.upper()
1986+
1987+
try:
1988+
start_time_ms = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S').replace(tzinfo=UTC).timestamp()
1989+
except ValueError:
1990+
raise ValueError("start_time must be in 'YYYY-MM-DD HH:MM:SS' format")
1991+
1992+
if not hdr_folder:
1993+
if not test_id:
1994+
raise ValueError("Either test_id or hdr_folder must be provided")
1995+
1996+
hdr_folder = download_and_unpack_logs(test_id, log_type='loader-set')
1997+
1998+
hdr_tags = get_hdr_tags(stress_tool, stress_operation, throttled_load)
1999+
2000+
# Step 1: Coarse scan
2001+
end_time_ms = start_time_ms + duration_from_start_min * 60
2002+
hdr_summaries = make_hdrhistogram_summary_by_interval(
2003+
hdr_tags, stress_operation, hdr_folder, start_time_ms, end_time_ms, interval=hdr_summary_interval_sec
2004+
)
2005+
2006+
summaries = []
2007+
for tag in hdr_tags:
2008+
# Find intervals for this tag
2009+
tag_hdr_summaries = [
2010+
(summary.get(f"{stress_operation}--{tag}", {}))
2011+
for summary in hdr_summaries
2012+
]
2013+
# Filter intervals where percentile_99 > error threshold. Meaning it is a spike
2014+
tag_summaries = [
2015+
s for s in tag_hdr_summaries if s and "percentile_99" in s and s["percentile_99"] > error_threshold_ms
2016+
]
2017+
# Sort by percentile_99 descending
2018+
tag_summaries.sort(key=lambda x: x["percentile_99"], reverse=True)
2019+
for num, operation in enumerate(tag_summaries, start=1):
2020+
p99 = {}
2021+
# Convert to human readable time
2022+
dt_start = datetime.fromtimestamp(operation["start_time"] / 1000, UTC).strftime('%Y-%m-%d %H:%M:%S')
2023+
dt_end = datetime.fromtimestamp(operation["end_time"] / 1000, UTC).strftime('%Y-%m-%d %H:%M:%S')
2024+
if operation['percentile_99'] > error_threshold_ms:
2025+
p99[f"{dt_start} - {dt_end}"] = operation['percentile_99']
2026+
summaries.append({"tag": tag, "spikes": p99})
2027+
2028+
hdr_table = PrettyTable(["Tag", "Timeframe", "P99"])
2029+
hdr_table.align = "l"
2030+
for group in summaries:
2031+
# Example: group: {'tag': 'READ-st', 'spikes': {'2025-08-25 10:22:40 - 2025-08-25 10:32:40': 487.85}}
2032+
for time_frame, p99 in group["spikes"].items():
2033+
hdr_table.add_row([group["tag"], time_frame, p99])
2034+
click.echo(
2035+
f"\nFound P99 spikes higher than {error_threshold_ms} ms for tags {hdr_tags} with interval {hdr_summary_interval_sec} seconds\n")
2036+
click.echo(hdr_table.get_string(title="HDR Latency Spikes"))
2037+
2038+
19362039
cli.add_command(sct_ssh.ssh)
19372040
cli.add_command(sct_ssh.tunnel)
19382041
cli.add_command(sct_ssh.copy_cmd)

sdcm/utils/common.py

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2568,3 +2568,51 @@ def format_size(size_in_bytes):
25682568
if size_in_bytes < 1024:
25692569
return f"{size_in_bytes:.2f} {unit}"
25702570
size_in_bytes /= 1024
2571+
2572+
2573+
def get_hdr_tags(stress_tool: str, stress_operation: str, throttled_load: bool) -> list:
2574+
match stress_tool:
2575+
case "cassandra-stress":
2576+
suffix = "-rt" if throttled_load else "-st"
2577+
if stress_operation == "MIXED":
2578+
hdr_tags = [f"WRITE{suffix}", f"READ{suffix}"]
2579+
elif stress_operation == "READ":
2580+
hdr_tags = [f"READ{suffix}"]
2581+
elif stress_operation == "WRITE":
2582+
hdr_tags = [f"WRITE{suffix}"]
2583+
else:
2584+
raise ValueError(f"Unsupported stress_operation: {stress_operation}")
2585+
case "scylla-bench":
2586+
# TODO: will be defined later
2587+
raise NotImplementedError("get_hdr_tags: 'scylla-bench' is not yet supported.")
2588+
case "latte":
2589+
# TODO: will be defined later
2590+
raise NotImplementedError("get_hdr_tags: 'scylla-bench' is not yet supported.")
2591+
case _:
2592+
raise NotImplementedError("get_hdr_tags: 'scylla-bench' is not yet supported.")
2593+
2594+
return hdr_tags
2595+
2596+
2597+
def download_and_unpack_logs(test_id: str, log_type: str, download_to: str = None) -> str:
2598+
logs_links = list_logs_by_test_id(test_id)
2599+
tmp_dir = download_to or os.path.join('/tmp/', test_id)
2600+
if not os.path.exists(tmp_dir):
2601+
os.makedirs(tmp_dir)
2602+
logs_file = ""
2603+
for log in logs_links:
2604+
if log["type"] == log_type:
2605+
logs_file = S3Storage().download_file(link=log["link"], dst_dir=tmp_dir)
2606+
LOGGER.debug("Downloaded %slog to %s", log_type, logs_file)
2607+
2608+
if not logs_file:
2609+
raise ValueError("%s not found in argus logs", log_type)
2610+
2611+
LOGGER.debug("Unpacking loader logs...")
2612+
from sdcm.monitorstack import extract_file_from_tar_archive
2613+
hdr_folder = extract_file_from_tar_archive(pattern=log_type, archive=logs_file, extract_dir=tmp_dir)
2614+
LOGGER.debug("%s logs unpacked to %s", log_type, hdr_folder[test_id])
2615+
if not hdr_folder:
2616+
raise ValueError(f"Failed to unpack logs {logs_file} for test_id {test_id}")
2617+
2618+
return hdr_folder[test_id]

0 commit comments

Comments
 (0)