feat: Add TraceLens integration for trace analysis with MLflow upload by gphuang · Pull Request #439 · AMD-AGI/Primus

gphuang · 2025-12-18T09:07:49Z

feat: Add TraceLens integration for trace analysis with MLflow upload

Adds TraceLens trace analysis capability to automatically generate performance
reports from PyTorch profiler traces and upload them to MLflow.

Addresses review feedback and adds tests for TraceLens report generation and MLflow artifact upload. Keeps MLflow opt-in, makes local-only TraceLens work without MLflow, and tightens safety/docs.

Features

Generate TraceLens analysis reports (XLSX, CSV formats)
Auto-install TraceLens from GitHub if not present
Upload reports to MLflow artifacts/trace_analysis/

Config Options

mlflow_upload_tracelens_report: false # Enable TraceLens analysis
mlflow_tracelens_output_format: all # all, xlsx, or csv

Example

Code
`
export CONFIG_NAME="deepseek_v2_lite-FP8-pretrain"

export EXP="examples/megatron/configs/MI300X/${CONFIG_NAME}.yaml"

export MLFLOW_RUN_NAME="${CONFIG_NAME}.single-node.baseline"

bash ./examples/run_pretrain.sh
--train_iters=5
--profile_step_start=2
--profile_step_end=4
--profile_ranks=ALL
--mlflow_run_name=${MLFLOW_RUN_NAME}
--mlflow_experiment_name=/Performance-data/Megatron-LM/primus-test
--mlflow_upload_performance_metrics=True
--mlflow_upload_tracelens_report=True
`

Output
Single node run
Multi (2) node

- Add TraceLens trace analysis report generation (XLSX, CSV formats) - Add mlflow_upload_tracelens_report config option (default: false) - Add mlflow_tracelens_ranks, mlflow_tracelens_max_reports options - Add mlflow_tracelens_output_format option (all, xlsx, csv) - Auto-install TraceLens from GitHub if not present - Upload analysis reports to MLflow artifacts/trace_analysis/

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

This PR adds TraceLens integration to automatically generate performance analysis reports from PyTorch profiler traces and upload them to MLflow. TraceLens is auto-installed from GitHub if not present, and users can configure rank filtering, report limits, and output formats (XLSX/CSV/HTML).

Key changes:

New module for MLflow artifact management with TraceLens integration
Automatic TraceLens installation from GitHub with fallback CSV generation
Configuration options to control trace analysis (ranks, max reports, output formats)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.

File	Description
primus/backends/megatron/training/mlflow_artifacts.py	New 725-line module implementing trace/log file uploads and TraceLens report generation with fallback CSV summary
primus/backends/megatron/training/global_vars.py	Adds import and wrapper function `upload_mlflow_artifacts` to expose artifact upload functionality
primus/modules/trainer/megatron/trainer.py	Calls `upload_mlflow_artifacts` before ending MLflow run with configuration parameters from args
primus/configs/modules/megatron/primus_megatron_module.yaml	Adds 6 new configuration options for controlling trace/log uploads and TraceLens analysis

Comments suppressed due to low confidence (2)

primus/backends/megatron/training/mlflow_artifacts.py:382

Variable dfs is not used.

            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

primus/backends/megatron/training/mlflow_artifacts.py:370

This assignment to 'dfs' is unnecessary as it is redefined before this value is used.

            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

primus/configs/modules/megatron/primus_megatron_module.yaml

primus/backends/megatron/training/global_vars.py

primus/backends/megatron/training/mlflow_artifacts.py

…N files

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

…config parser Addresses Copilot review comment: if mlflow_tracelens_ranks is configured as a string in YAML (e.g., '[0,8]' instead of [0, 8]), the code would receive a string instead of a list, causing _filter_traces_by_rank to silently filter out all trace files. Added ast.literal_eval() conversion in: - generate_tracelens_reports() - upload_tracelens_reports_to_mlflow() Falls back to None (process all ranks) with a warning if parsing fails.

When output_format='all', previously the trace file was parsed twice: - Once for XLSX generation - Once for CSV generation Now when format is 'all', we call generate_perf_report_pytorch once with both output_xlsx_path and output_csvs_dir parameters, parsing the trace file only once and generating both formats from the same data. This improves performance significantly for the common use case of generating both report formats.

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

After TraceLens reports are successfully uploaded to MLflow, the local tracelens_reports directory is automatically cleaned up to save disk space. This addresses the issue of temporary directories not being cleaned up after artifact upload. The reports remain accessible in MLflow while freeing up local storage. Other directories checked: - tensorboard_dir: Contains original trace files, NOT temporary - exp_root_path/logs: Contains original log files, NOT temporary - tracelens_reports: Processed reports uploaded to MLflow, safe to cleanup

Added mlflow_tracelens_cleanup_after_upload parameter to control whether local TraceLens reports are removed after upload to MLflow. Default: True (cleanup to save disk space) Set to False to keep reports locally for inspection/debugging Changes: - Added cleanup_after_upload parameter to upload_tracelens_reports_to_mlflow() - Added tracelens_cleanup_after_upload to upload_artifacts_to_mlflow() - Added mlflow_tracelens_cleanup_after_upload config in YAML (default: true) - Updated trainer to pass through the parameter Use cases: - True (default): Production runs, save disk space - False: Development/debugging, keep local copies for inspection

Update docs to state the last rank (writer) performs TraceLens artifact uploads in distributed runs. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.

primus/backends/megatron/training/mlflow_artifacts.py

tests/unit_tests/backends/megatron/test_mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

primus/core/utils/rocm_mem_info.py

primus/configs/modules/megatron/primus_megatron_module.yaml

primus/backends/megatron/training/mlflow_artifacts.py

primus/backends/megatron/training/mlflow_setup.py

primus/modules/trainer/megatron/trainer.py

primus/backends/megatron/training/mlflow_setup.py

Align mlflow_setup.py docstring with the actual default of 'xlsx'. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

primus/backends/megatron/training/mlflow_artifacts.py

primus/core/utils/rocm_mem_info.py

primus/backends/megatron/training/mlflow_artifacts.py

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

primus/backends/megatron/training/mlflow_artifacts.py

Co-authored-by: Cursor <cursoragent@cursor.com>

gphuang · 2026-02-18T07:49:53Z

@wenxie-amd @Xiaoming-AMD @limou102 Could you please review?

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

primus/configs/modules/megatron/primus_megatron_module.yaml:36

The comment text on line 35-36 spans multiple physical lines in the YAML file. Consider breaking this into separate comment lines (each starting with #) for better readability and to avoid potential YAML parsing issues in some parsers, e.g.:

# TraceLens report format: xlsx (default; single parse, fastest), csv, or all
# (xlsx+csv; parses each trace twice so ~2x processing time; use only when both formats are needed)

# TraceLens report format: xlsx (default; single parse, fastest), csv, or all (xlsx+csv;
# parses each trace twice so ~2x processing time; use only when both formats are needed)

primus/modules/trainer/megatron/trainer.py:2228

The all_gather operation at line 2228 introduces a synchronization point that occurs on every log_interval when mlflow_upload_performance_metrics is enabled. This could impact training performance, especially with frequent logging. The config documentation mentions this (line 66), but consider also adding a code comment here explaining the performance trade-off and why all ranks must participate regardless of whether they have mlflow_writer.

                    dist.all_gather(gathered_utils, util_tensor)

primus/backends/megatron/training/mlflow_artifacts.py

Co-authored-by: Cursor <cursoragent@cursor.com>

primus/backends/megatron/training/mlflow_artifacts.py

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

primus/backends/megatron/training/mlflow_artifacts.py

primus/backends/megatron/training/mlflow_setup.py

Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

primus/backends/megatron/training/mlflow_artifacts.py:24

The module-level MLflow artifact structure example shows trace files like rank_0_step_2.json.gz, but _get_all_trace_files() only searches for *.pt.trace.json / *.pt.trace.json.gz. Either the example is outdated/misleading, or trace discovery is currently too strict and would miss valid traces with the rank_0_step_2.json.gz naming.

Please align the docstring with the actual supported trace filename patterns, or expand the discovery patterns (with tests) if non-.pt.trace.json* traces are expected in practice.

MLflow Artifact Structure:
    artifacts/
    ├── traces/              # PyTorch profiler trace files
    │   ├── rank_0_step_2.json.gz
    │   └── ...
    ├── logs/                # Training log files

Copilot AI review requested due to automatic review settings December 18, 2025 09:07

Copilot started reviewing on behalf of gphuang December 18, 2025 09:08 View session

github-code-quality bot found potential problems Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang added 2 commits December 18, 2025 09:23

docs: Fix TraceLens CSV format description (multiple files per rank)

bbfa9d3

fix: Remove unsupported HTML format option from TraceLens

0759122

Copilot AI review requested due to automatic review settings December 18, 2025 09:25

Copilot started reviewing on behalf of gphuang December 18, 2025 09:26 View session

github-code-quality bot found potential problems Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

fix: Use specific trace file patterns to avoid matching unrelated JSO…

4c908e5

…N files

github-code-quality bot found potential problems Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang requested a review from Copilot December 18, 2025 10:10

Copilot started reviewing on behalf of gphuang December 18, 2025 10:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

docs: Clarify MLflow upload defaults are opt-out when MLflow enabled

2861bdf

Copilot AI review requested due to automatic review settings December 18, 2025 10:31

Copilot started reviewing on behalf of gphuang December 18, 2025 10:31 View session

gphuang force-pushed the feat/12-enable-tracelens-analysis branch from df2e40a to 2861bdf Compare December 18, 2025 10:33

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang added 2 commits December 18, 2025 11:26

Copilot AI review requested due to automatic review settings December 18, 2025 12:39

Copilot started reviewing on behalf of gphuang December 18, 2025 12:40 View session

github-code-quality bot found potential problems Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang added 2 commits December 18, 2025 12:56

Copilot AI review requested due to automatic review settings December 18, 2025 13:08

Clarify MLflow writer rank in TraceLens upload

6198960

Update docs to state the last rank (writer) performs TraceLens artifact uploads in distributed runs. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI reviewed Feb 11, 2026

View reviewed changes

gphuang and others added 2 commits February 11, 2026 09:45

Fix TraceLens output_format docstring default

c05c743

Align mlflow_setup.py docstring with the actual default of 'xlsx'. Co-authored-by: Cursor <cursoragent@cursor.com>

Improve TraceLens install diagnostics and metrics safety.

f86556a

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings February 11, 2026 09:52

Copilot started reviewing on behalf of gphuang February 11, 2026 09:53 View session

github-code-quality bot found potential problems Feb 11, 2026

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

Copilot AI reviewed Feb 11, 2026

View reviewed changes

gphuang and others added 2 commits February 11, 2026 13:18

Remove unused openpyxl install result.

8891434

Co-authored-by: Cursor <cursoragent@cursor.com>

Harden TraceLens install checks and ROCm parsing.

3a1c3c5

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings February 11, 2026 13:23

Copilot started reviewing on behalf of gphuang February 11, 2026 13:24 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

Handle TraceLens SHA verification.

195f28b

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into feat/12-enable-tracelens-analysis

74fccb9

Copilot AI review requested due to automatic review settings February 20, 2026 07:54

Copilot started reviewing on behalf of gphuang February 20, 2026 07:55 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

Add TraceLens normalization coverage and import.

0b4b491

Co-authored-by: Cursor <cursoragent@cursor.com>

github-code-quality bot found potential problems Feb 20, 2026

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

Remove redundant re import.

d2c88d7

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI review requested due to automatic review settings February 23, 2026 07:47

Copilot started reviewing on behalf of gphuang February 23, 2026 07:48 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

primus/backends/megatron/training/mlflow_setup.py Show resolved Hide resolved

gphuang and others added 2 commits February 26, 2026 14:15

Merge branch 'main' into feat/12-enable-tracelens-analysis

eca1afe

Gate openpyxl installs and test mlflow setup.

88eda50

Made-with: Cursor

Copilot AI review requested due to automatic review settings February 26, 2026 12:21

Copilot started reviewing on behalf of gphuang February 26, 2026 12:22 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Conversation

gphuang commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!