TraceLens is a Python library focused on automating analysis from trace files and enabling rich performance insights. Designed with simplicity and extensibility in mind, this library provides tools to simplify the process of profiling and debugging complex distributed training and inference systems.
Hierarchical Performance Breakdowns — Pinpoint bottlenecks with a top-down view, moving from the overall GPU timeline (idle/busy) to operator categories, individual operators, and right down to unique argument shapes.
Compute & Roofline Modeling — Automatically translate raw timings into efficiency metrics like TFLOP/s and TB/s for popular operations. Determine if an op is compute- or memory-bound and see how effectively your code uses the hardware.
Multi-GPU Communication Analysis — Accurately diagnose scaling issues by dissecting collective operations. TraceLens separates pure communication time from synchronization skew and calculates effective bandwidth on your real workload.
Trace Comparison — Quantify the impact of your changes with powerful trace diffing. By analyzing performance at the CPU dispatch level, TraceLens enables meaningful side-by-side comparisons across different hardware and software versions.
Event Replay — Isolate any operation for focused debugging. TraceLens generates minimal, self-contained replay scripts directly from trace metadata, making it simple to share IP-safe test cases with kernel developers.
Extensible SDK — Get started instantly with ready-to-use scripts, then build your own custom workflows using a flexible and hackable Python API.
pip install git+https://github.com/AMD-AGI/TraceLens.gitTraceLens_generate_perf_report_pytorch --profile_json_path path/to/your/trace.jsonThis produces an Excel workbook with GPU timeline breakdown, ops summary, roofline metrics, and more. See Performance Report Column Definitions for what each column means.
Don't have a trace yet? Follow the PyTorch profiling guide, or use the demo traces bundled in the repo.
For the full CLI reference (JAX, rocprofv3, trace comparison, multi-rank collectives), see Supported Profile Formats below.
Hands-on notebooks that walk through the core TraceLens features:
| Example | What it covers |
|---|---|
| Trace2Tree | Navigate the hierarchical event tree — link Python ops, CPU dispatches, and GPU kernels |
| TreePerf | GPU timeline breakdown, per-op performance, and roofline metrics via the SDK |
| NN Module View | See GPU time broken down by nn.Module — useful for model developers |
| NCCL Analyser | Multi-rank collective analysis: latency, bandwidth, skew |
| Trace Diff | Morphological comparison of two trace trees to pinpoint structural divergences |
| Event Replay | Extract and replay operations for isolated debugging |
| Trace Fusion | Merge multi-rank PyTorch traces into a single file for Perfetto visualization |
| Roofline Plots | Build roofline-style visualizations for specific operators |
| JAX NCCL Analyser | Bandwidth analysis for JAX collective operations from XPlane traces |
For community-contributed utilities — including interactive trace dashboards (traceMap), roofline analysis tooling, and a Streamlit UI — see examples/custom_workflows/.
| Format | Tool | Documentation |
|---|---|---|
| PyTorch | torch.profiler |
docs/generate_perf_report.md |
| JAX | XPlane protobuf | docs/jax_analyses.md |
| rocprofv3 JSON | AMD ROCm rocprofiler-sdk | docs/generate_perf_report_rocprof.md |
| rocprofv3 pftrace | Perfetto-style | docs/generate_perf_report_rocprof_pftrace.md |
TraceLens_generate_perf_report_pytorch --profile_json_path path/to/your/trace.jsonDetailed docs: generate_perf_report.md. Supports compressed traces (.zip, .gz).
TraceLens_compare_perf_reports_pytorch \
baseline.xlsx candidate.xlsx \
--names baseline candidate \
--sheets all \
-o comparison.xlsxDetailed docs: compare_perf_reports_pytorch.md.
TraceLens_generate_multi_rank_collective_report_pytorch \
--trace_dir /path/to/traces \
--world_size 8Detailed docs: generate_multi_rank_collective_report_pytorch.md.
For *_results.json from rocprofv3:
TraceLens_generate_perf_report_rocprof \
--profile_json_path trace_results.json \
--short_kernel_study --kernel_detailsDetailed docs: generate_perf_report_rocprof.md.
For Perfetto-style traces (e.g. rocprofv3 --output-format pftrace):
# Record a pftrace
rocprofv3 --hip-trace --kernel-trace --memory-copy-trace --rccl-trace \
--output-format pftrace -d ./v3_traces -- python3 your_app.py
# Activity report (NSYS-style category summary, optional Markdown)
TraceLens_generate_perf_report_pftrace_hip_activity --trace_path sample.pftrace --write_md
# API↔Kernel report (latency breakdown T = A + Q + K)
TraceLens_generate_perf_report_pftrace_hip_api --trace_path sample.pftrace
# Memory copy report
TraceLens_generate_perf_report_pftrace_memory_copy --trace_path sample.pftrace.pftrace is Perfetto's binary format that needs to be converted to JSON for parsing. traceconv (a Perfetto tool) is optional — if not on PATH, it is downloaded automatically. You can also pass --traceconv /path/to/traceconv explicitly.
Deeper dives on the core modules:
| Module | Doc |
|---|---|
| Trace2Tree | docs/Trace2Tree.md |
| TreePerf | docs/TreePerf.md |
| NCCL Analyser | docs/NcclAnalyser.md |
| TraceDiff | docs/TraceDiff.md |
| Event Replay | docs/EventReplay.md |
| TraceFusion | docs/TraceFusion.md |
| GPU Event Analyser | docs/gpu_event_analyser.md |
| JAX Analyses | docs/jax_analyses.md |
| pftrace Reports | docs/generate_perf_report_rocprof_pftrace.md |
| Performance Report Columns | docs/perf_report_columns.md |
git clone https://github.com/AMD-AGI/TraceLens.git && cd TraceLens
pip install -e .[dev]
python -m pytest tests/ -vWe welcome contributions across the entire project — new analysis modules, performance models, docs, examples, or bug fixes. Whether you're adding a new metric or building a custom workflow, the SDK is designed to make that easy.
Please see CONTRIBUTING.md for guidelines on branching, commit style, and project structure.