Skip to content

feat(conversion): support distributed adapter export#4221

Open
cuichenx wants to merge 4 commits into
mainfrom
chcui/adapter-export-gpu
Open

feat(conversion): support distributed adapter export#4221
cuichenx wants to merge 4 commits into
mainfrom
chcui/adapter-export-gpu

Conversation

@cuichenx

@cuichenx cuichenx commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add dtype control for Hugging Face adapter export
  • support excluding Megatron adapter base prefixes before HF mapping lookup, used for MTP adapter exclusions
  • add a distributed GPU path for exporting large PEFT checkpoints with matching TP/PP/EP/ETP settings
  • document the CPU and distributed GPU adapter export workflows

Blast Radius / Test Assessment

  • Touches shared conversion and PEFT adapter export APIs, so this is broader than a model-only example change.
  • L0/unit coverage is needed for adapter export and LoRA task construction.
  • L1/L2 functional coverage is not added in this PR; a manual large-checkpoint adapter export is the appropriate higher-level validation.

Validation

  • uv run --no-sync pre-commit run --files examples/conversion/adapter/export_adapter.py src/megatron/bridge/models/conversion/auto_bridge.py src/megatron/bridge/models/conversion/model_bridge.py src/megatron/bridge/models/conversion/peft_bridge.py tests/unit_tests/models/test_adapter_export.py tests/unit_tests/models/test_model_bridge_lora.py examples/conversion/adapter/README.md
  • uv run --no-sync python -m py_compile examples/conversion/adapter/export_adapter.py src/megatron/bridge/models/conversion/auto_bridge.py src/megatron/bridge/models/conversion/model_bridge.py src/megatron/bridge/models/conversion/peft_bridge.py tests/unit_tests/models/test_adapter_export.py tests/unit_tests/models/test_model_bridge_lora.py

Focused pytest was attempted with uv run python -m pytest tests/unit_tests/models/test_adapter_export.py tests/unit_tests/models/test_model_bridge_lora.py -q, but local dependency resolution failed before test collection because nvidia-resiliency-ext==0.6.0 has no compatible wheel for this host platform.

Comment thread examples/conversion/adapter/export_adapter.py Outdated
Comment thread examples/conversion/adapter/export_adapter.py Outdated
@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Light Code Review

Findings

  1. Bug: unhandled StopIteration in distributed export -- export_adapter.py:193 uses a bare next(generator) to find the model key in the loaded state dict. If no key starts with model, this raises StopIteration instead of a clear error. See inline comment for a suggested fix.

  2. Missing error handling in _load_lora_config -- export_adapter.py:129-131 calls read_run_config without try/except, unlike the equivalent code in auto_bridge.py:export_adapter_ckpt which gracefully falls back to defaults on corrupt YAML. A malformed run_config.yaml will crash the distributed export path with an unhelpful traceback.

  3. Test coverage gap: distributed export path -- _export_adapter_distributed, _load_lora_config, _configure_cuda_device, and _uses_distributed_export in export_adapter.py have no unit tests. The library-side changes in auto_bridge.py and peft_bridge.py are well-covered by the new tests in test_adapter_export.py and test_model_bridge_lora.py.


Suggested test cases

No perf tests impacted.

@yaoyu-33 yaoyu-33 added area:peft Parameter-efficient fine-tuning (LoRA, adapters) feature New capabilities, enhancements, or enablement work waiting-on-customer Waiting on the original author to respond labels Jun 9, 2026
cuichenx added 3 commits June 9, 2026 14:44
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx

cuichenx commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/claude review

Comment on lines +180 to +183

def _export_adapter_distributed(args: argparse.Namespace) -> None:
device = _configure_cuda_device()
ckpt_path = Path(args.lora_checkpoint).expanduser().resolve()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (minor): The CPU path in export_adapter_ckpt (auto_bridge.py:1433-1434) validates that the checkpoint path exists before doing any work:

if not ckpt_path.exists():
    raise FileNotFoundError(f"PEFT checkpoint not found: {ckpt_path}")

This distributed path skips that check, so a typo'd path would surface as a cryptic dist_checkpointing.load error instead of a clear FileNotFoundError.

Suggested change
def _export_adapter_distributed(args: argparse.Namespace) -> None:
device = _configure_cuda_device()
ckpt_path = Path(args.lora_checkpoint).expanduser().resolve()
def _export_adapter_distributed(args: argparse.Namespace) -> None:
device = _configure_cuda_device()
ckpt_path = Path(args.lora_checkpoint).expanduser().resolve()
if not ckpt_path.exists():
raise FileNotFoundError(f"PEFT checkpoint not found: {ckpt_path}")
config = AutoConfig.from_pretrained(args.hf_model_path, trust_remote_code=args.trust_remote_code)

Comment on lines +207 to +212

sharded_state_dict = _generate_model_state_dict(model, {})
sharded_state_dict = apply_peft_adapter_filter_to_state_dict(sharded_state_dict, lora)
loaded_sd = dist_checkpointing.load(sharded_state_dict, str(ckpt_path))
model_key = _get_loaded_model_key(loaded_sd, ckpt_path)
model[0].load_state_dict(loaded_sd[model_key], strict=False)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: When PP > 1 with virtual pipeline parallelism, provide_distributed_model returns multiple model chunks. _get_loaded_model_key finds only the first key (e.g. "model0"), and model[0].load_state_dict(...) only loads into chunk 0 — subsequent chunks stay uninitialized.

If PP > 1 export isn't expected to be used yet, consider adding a guard:

Suggested change
sharded_state_dict = _generate_model_state_dict(model, {})
sharded_state_dict = apply_peft_adapter_filter_to_state_dict(sharded_state_dict, lora)
loaded_sd = dist_checkpointing.load(sharded_state_dict, str(ckpt_path))
model_key = _get_loaded_model_key(loaded_sd, ckpt_path)
model[0].load_state_dict(loaded_sd[model_key], strict=False)
sharded_state_dict = _generate_model_state_dict(model, {})
sharded_state_dict = apply_peft_adapter_filter_to_state_dict(sharded_state_dict, lora)
loaded_sd = dist_checkpointing.load(sharded_state_dict, str(ckpt_path))
model_key = _get_loaded_model_key(loaded_sd, ckpt_path)
for chunk in model:
chunk.load_state_dict(loaded_sd[model_key], strict=False)

Or if only one chunk is valid, assert len(model) == 1 to catch misuse early.

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

test

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Code Review

Findings

1. Missing checkpoint path validation in distributed export (export_adapter.py:180-183)

The CPU path in export_adapter_ckpt (auto_bridge.py:1433) validates ckpt_path.exists() and raises a clear FileNotFoundError. The new _export_adapter_distributed skips this check, so a bad path would surface as a cryptic dist_checkpointing.load error. See inline comment with suggestion.

2. model[0].load_state_dict may be incomplete for PP > 1 (export_adapter.py:212)

When provide_distributed_model returns multiple chunks (virtual pipeline parallelism), only chunk 0 gets its weights loaded. The remaining chunks stay uninitialized. If PP > 1 is not expected to work yet, add assert len(model) == 1 to catch misuse; otherwise iterate over all chunks. See inline comment.

3. Duplicated dtype alias map

_DTYPE_ALIASES in export_adapter.py and _ADAPTER_EXPORT_DTYPE_ALIASES in auto_bridge.py are identical dicts with identical normalization logic. Consider importing _normalize_adapter_export_dtype from auto_bridge in the script to avoid the two copies drifting apart.

Test coverage gaps

  • No test for _load_lora_config returning VLMLoRA when _target_ contains "VLMLoRA"
  • No test for _parse_dtype with an invalid dtype string (verifying the ArgumentTypeError)
  • No test for _export_adapter_distributed with PP > 1 (related to finding 2)

Suggested test cases

No perf tests impacted.

Signed-off-by: Chen Cui <chcui@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:peft Parameter-efficient fine-tuning (LoRA, adapters) feature New capabilities, enhancements, or enablement work waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants