Background
TraceLens's core philosophy (as outlined in trace2tree_motivation.md) is to analyze kernels through the lens of top-level CPU operation names. However, in some cases we need kernel name analysis - whether because CPU op information is not available, not yet implemented for certain frameworks, or when additional kernel-level categorization provides valuable insights.
For example:
- Graph Launch Operations:
cudaGraphLaunch/hipGraphLaunch may contain heterogeneous kernels (GEMM, attention, elementwise ops, etc.)
- JAX/XLA Analysis: Kernel-level categorization helps distinguish operation types
Existing Implementation
TraceLens already has kernel name categorization for JAX analysis in util.py#L333:
class JaxOpKeys:
# keywords for splitting jax events
GemmKeys = ["Cijk", "gemm", "nvjet", "cublasLt"]
FABwdKeys = ["FmhaBwd", "flash_bprop", "ck_fused_attn::dk_dv_reduce_thd", "fmha_bwd"]
FAFwdKeys = ["FmhaFwd", "flash_fprop", "fmha_fwd"]
ConvKeys = ["FillBuffer", "conv_", "conv.", "conv-"]
CommunicationKeys = ["rccl", "nccl"]
# ... more categories
This categorization is currently specific to JAX workflows and not exposed as a general feature.
Proposal
To maintain TraceLens's charter as a one-stop solution for trace analysis, we should provide optional kernel name categorization in the GPU event analysis pipeline, specifically in TreePerfAnalyzer's kernel summary tables. This would add a kernel_category column based on built-in categorization patterns, enabling better grouping and analysis.
Proposed API
# Simple opt-in via boolean flag
perf_analyzer = TreePerfAnalyzer(
trace_file,
use_kernel_name_categories=True # optional, defaults to False
)
# Also available in reporting
generate_perf_report_pytorch(
trace_file,
use_kernel_name_categories=True,
)
Example: Before and After
Before (current behavior)
Kernel summary table for cudaGraphLaunch operations:
| cpu_op |
kernel_name |
count |
total_time_ms |
avg_time_us |
| graph |
kernel_mha |
1 |
0.15 |
150 |
| graph |
void tensorrt_llm::common::scaleMatrix<...> |
1 |
0.08 |
80 |
| graph |
nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT |
1 |
0.22 |
220 |
Issue: All kernels grouped under generic "graph" CPU op name - difficult to see what types of operations dominate.
After (with kernel categorization enabled)
Kernel summary table with kernel_category column:
| cpu_op |
kernel_category |
kernel_name |
count |
total_time_ms |
avg_time_us |
| graph |
Flash Attention Forward |
kernel_mha |
1 |
0.15 |
150 |
| graph |
Memory Ops |
void tensorrt_llm::common::scaleMatrix<...> |
1 |
0.08 |
80 |
| graph |
GEMM |
nvjet_tst_320x128_64x3_1x2_h_bz_coopB_TNT |
1 |
0.22 |
220 |
Grouped by kernel_category:
| cpu_op |
kernel_category |
count |
total_time_ms |
% of graph time |
| graph |
GEMM |
1 |
0.22 |
48.9% |
| graph |
Flash Attention Forward |
1 |
0.15 |
33.3% |
| graph |
Memory Ops |
1 |
0.08 |
17.8% |
Benefit: Immediately see that GEMM operations dominate the graph launch, even though all have the same CPU op parent.
Implementation Approach
Incremental implementation:
-
This Issue: Add kernel name categorization directly in GPU event analyzer and then use in the TreePerfAnalyzer and generate perf report.
-
Future PRs:
- Refactor categorization logic from JAX-specific
JaxOpKeys (in util.py) to use the new general categorizer
- Other analysis modules can leverage the same categorizer as needed
- Add extension to user can customize further
Background
TraceLens's core philosophy (as outlined in trace2tree_motivation.md) is to analyze kernels through the lens of top-level CPU operation names. However, in some cases we need kernel name analysis - whether because CPU op information is not available, not yet implemented for certain frameworks, or when additional kernel-level categorization provides valuable insights.
For example:
cudaGraphLaunch/hipGraphLaunchmay contain heterogeneous kernels (GEMM, attention, elementwise ops, etc.)Existing Implementation
TraceLens already has kernel name categorization for JAX analysis in
util.py#L333:This categorization is currently specific to JAX workflows and not exposed as a general feature.
Proposal
To maintain TraceLens's charter as a one-stop solution for trace analysis, we should provide optional kernel name categorization in the GPU event analysis pipeline, specifically in
TreePerfAnalyzer's kernel summary tables. This would add akernel_categorycolumn based on built-in categorization patterns, enabling better grouping and analysis.Proposed API
Example: Before and After
Before (current behavior)
Kernel summary table for
cudaGraphLaunchoperations:Issue: All kernels grouped under generic "graph" CPU op name - difficult to see what types of operations dominate.
After (with kernel categorization enabled)
Kernel summary table with
kernel_categorycolumn:Grouped by kernel_category:
Benefit: Immediately see that GEMM operations dominate the graph launch, even though all have the same CPU op parent.
Implementation Approach
Incremental implementation:
This Issue: Add kernel name categorization directly in GPU event analyzer and then use in the TreePerfAnalyzer and generate perf report.
Future PRs:
JaxOpKeys(inutil.py) to use the new general categorizer