feat(conversion): support distributed adapter export by cuichenx · Pull Request #4221 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-06-09T02:04:13Z

Summary

add dtype control for Hugging Face adapter export
support excluding Megatron adapter base prefixes before HF mapping lookup, used for MTP adapter exclusions
add a distributed GPU path for exporting large PEFT checkpoints with matching TP/PP/EP/ETP settings
document the CPU and distributed GPU adapter export workflows

Blast Radius / Test Assessment

Touches shared conversion and PEFT adapter export APIs, so this is broader than a model-only example change.
L0/unit coverage is needed for adapter export and LoRA task construction.
L1/L2 functional coverage is not added in this PR; a manual large-checkpoint adapter export is the appropriate higher-level validation.

Validation

uv run --no-sync pre-commit run --files examples/conversion/adapter/export_adapter.py src/megatron/bridge/models/conversion/auto_bridge.py src/megatron/bridge/models/conversion/model_bridge.py src/megatron/bridge/models/conversion/peft_bridge.py tests/unit_tests/models/test_adapter_export.py tests/unit_tests/models/test_model_bridge_lora.py examples/conversion/adapter/README.md
uv run --no-sync python -m py_compile examples/conversion/adapter/export_adapter.py src/megatron/bridge/models/conversion/auto_bridge.py src/megatron/bridge/models/conversion/model_bridge.py src/megatron/bridge/models/conversion/peft_bridge.py tests/unit_tests/models/test_adapter_export.py tests/unit_tests/models/test_model_bridge_lora.py

Focused pytest was attempted with uv run python -m pytest tests/unit_tests/models/test_adapter_export.py tests/unit_tests/models/test_model_bridge_lora.py -q, but local dependency resolution failed before test collection because nvidia-resiliency-ext==0.6.0 has no compatible wheel for this host platform.

claude · 2026-06-09T02:07:41Z

Light Code Review

Findings

Bug: unhandled StopIteration in distributed export -- export_adapter.py:193 uses a bare next(generator) to find the model key in the loaded state dict. If no key starts with model, this raises StopIteration instead of a clear error. See inline comment for a suggested fix.
Missing error handling in _load_lora_config -- export_adapter.py:129-131 calls read_run_config without try/except, unlike the equivalent code in auto_bridge.py:export_adapter_ckpt which gracefully falls back to defaults on corrupt YAML. A malformed run_config.yaml will crash the distributed export path with an unhelpful traceback.
Test coverage gap: distributed export path -- _export_adapter_distributed, _load_lora_config, _configure_cuda_device, and _uses_distributed_export in export_adapter.py have no unit tests. The library-side changes in auto_bridge.py and peft_bridge.py are well-covered by the new tests in test_adapter_export.py and test_model_bridge_lora.py.

Suggested test cases

No perf tests impacted.

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx · 2026-06-09T22:27:23Z

/claude review

claude · 2026-06-09T22:31:10Z

+
+def _export_adapter_distributed(args: argparse.Namespace) -> None:
+    device = _configure_cuda_device()
+    ckpt_path = Path(args.lora_checkpoint).expanduser().resolve()


Bug (minor): The CPU path in export_adapter_ckpt (auto_bridge.py:1433-1434) validates that the checkpoint path exists before doing any work:

if not ckpt_path.exists(): raise FileNotFoundError(f"PEFT checkpoint not found: {ckpt_path}")

This distributed path skips that check, so a typo'd path would surface as a cryptic dist_checkpointing.load error instead of a clear FileNotFoundError.

Suggested change

def _export_adapter_distributed(args: argparse.Namespace) -> None:

device = _configure_cuda_device()

ckpt_path = Path(args.lora_checkpoint).expanduser().resolve()

def _export_adapter_distributed(args: argparse.Namespace) -> None:

device = _configure_cuda_device()

ckpt_path = Path(args.lora_checkpoint).expanduser().resolve()

if not ckpt_path.exists():

raise FileNotFoundError(f"PEFT checkpoint not found: {ckpt_path}")

config = AutoConfig.from_pretrained(args.hf_model_path, trust_remote_code=args.trust_remote_code)

claude · 2026-06-09T22:31:18Z

+
+        sharded_state_dict = _generate_model_state_dict(model, {})
+        sharded_state_dict = apply_peft_adapter_filter_to_state_dict(sharded_state_dict, lora)
+        loaded_sd = dist_checkpointing.load(sharded_state_dict, str(ckpt_path))
+        model_key = _get_loaded_model_key(loaded_sd, ckpt_path)
+        model[0].load_state_dict(loaded_sd[model_key], strict=False)


Bug: When PP > 1 with virtual pipeline parallelism, provide_distributed_model returns multiple model chunks. _get_loaded_model_key finds only the first key (e.g. "model0"), and model[0].load_state_dict(...) only loads into chunk 0 — subsequent chunks stay uninitialized.

If PP > 1 export isn't expected to be used yet, consider adding a guard:

Suggested change

sharded_state_dict = _generate_model_state_dict(model, {})

sharded_state_dict = apply_peft_adapter_filter_to_state_dict(sharded_state_dict, lora)

loaded_sd = dist_checkpointing.load(sharded_state_dict, str(ckpt_path))

model_key = _get_loaded_model_key(loaded_sd, ckpt_path)

model[0].load_state_dict(loaded_sd[model_key], strict=False)

sharded_state_dict = _generate_model_state_dict(model, {})

sharded_state_dict = apply_peft_adapter_filter_to_state_dict(sharded_state_dict, lora)

loaded_sd = dist_checkpointing.load(sharded_state_dict, str(ckpt_path))

model_key = _get_loaded_model_key(loaded_sd, ckpt_path)

for chunk in model:

chunk.load_state_dict(loaded_sd[model_key], strict=False)

Or if only one chunk is valid, assert len(model) == 1 to catch misuse early.

claude · 2026-06-09T22:33:00Z

test

claude · 2026-06-09T22:34:09Z

Code Review

Findings

1. Missing checkpoint path validation in distributed export (export_adapter.py:180-183)

The CPU path in export_adapter_ckpt (auto_bridge.py:1433) validates ckpt_path.exists() and raises a clear FileNotFoundError. The new _export_adapter_distributed skips this check, so a bad path would surface as a cryptic dist_checkpointing.load error. See inline comment with suggestion.

2. model[0].load_state_dict may be incomplete for PP > 1 (export_adapter.py:212)

When provide_distributed_model returns multiple chunks (virtual pipeline parallelism), only chunk 0 gets its weights loaded. The remaining chunks stay uninitialized. If PP > 1 is not expected to work yet, add assert len(model) == 1 to catch misuse; otherwise iterate over all chunks. See inline comment.

3. Duplicated dtype alias map

_DTYPE_ALIASES in export_adapter.py and _ADAPTER_EXPORT_DTYPE_ALIASES in auto_bridge.py are identical dicts with identical normalization logic. Consider importing _normalize_adapter_export_dtype from auto_bridge in the script to avoid the two copies drifting apart.

Test coverage gaps

No test for _load_lora_config returning VLMLoRA when _target_ contains "VLMLoRA"
No test for _parse_dtype with an invalid dtype string (verifying the ArgumentTypeError)
No test for _export_adapter_distributed with PP > 1 (related to finding 2)

Suggested test cases

No perf tests impacted.

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 9, 2026 02:04 Inactive

copy-pr-bot Bot had a problem deploying to test June 9, 2026 02:04 Error

claude Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread examples/conversion/adapter/export_adapter.py Outdated

claude Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread examples/conversion/adapter/export_adapter.py Outdated

copy-pr-bot Bot temporarily deployed to public June 9, 2026 02:16 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 02:17 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 02:27 Inactive

copy-pr-bot Bot temporarily deployed to test June 9, 2026 02:27 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 02:38 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 02:59 Inactive

yaoyu-33 added area:peft Parameter-efficient fine-tuning (LoRA, adapters) feature New capabilities, enhancements, or enablement work waiting-on-customer Waiting on the original author to respond labels Jun 9, 2026

cuichenx added 3 commits June 9, 2026 14:44

feat(conversion): support distributed adapter export

1ea007a

Signed-off-by: Chen Cui <chcui@nvidia.com>

[conversion] fix: harden adapter export errors

398706e

Signed-off-by: Chen Cui <chcui@nvidia.com>

[conversion] test: update adapter export mocks

121ef0c

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx force-pushed the chcui/adapter-export-gpu branch from 50e8cc2 to 121ef0c Compare June 9, 2026 21:46

copy-pr-bot Bot temporarily deployed to public June 9, 2026 21:47 Inactive

copy-pr-bot Bot temporarily deployed to test June 9, 2026 21:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 21:56 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 22:16 Inactive

claude Bot reviewed Jun 9, 2026

View reviewed changes

test(conversion): harden distributed adapter export

730d3f3

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot Bot requested a deployment to public June 9, 2026 22:38 In progress

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(conversion): support distributed adapter export#4221

feat(conversion): support distributed adapter export#4221
cuichenx wants to merge 4 commits into
mainfrom
chcui/adapter-export-gpu

cuichenx commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 9, 2026

Uh oh!

cuichenx commented Jun 9, 2026

Uh oh!

claude Bot Jun 9, 2026

Uh oh!

claude Bot Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cuichenx commented Jun 9, 2026

Summary

Blast Radius / Test Assessment

Validation

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 9, 2026

Light Code Review

Uh oh!

cuichenx commented Jun 9, 2026

Uh oh!

claude Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026

Code Review

Findings

Test coverage gaps

Suggested test cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants