Skip to content

Fix trust_remote_code and torchrun path for custom models#78

Merged
Maxusmusti merged 9 commits intomainfrom
fix/v0.7.0-model-validation
Mar 31, 2026
Merged

Fix trust_remote_code and torchrun path for custom models#78
Maxusmusti merged 9 commits intomainfrom
fix/v0.7.0-model-validation

Conversation

@RobotSail
Copy link
Copy Markdown
Collaborator

@RobotSail RobotSail commented Mar 27, 2026

Summary

Fixes model loading and torchrun execution issues for models with custom code (Nemotron, Ministral, Qwen3.5, etc.).

This is part of the v0.7.0 model validation effort.

Changes

1. Add trust_remote_code=True to all AutoConfig and AutoTokenizer calls

Files Modified:

  • src/mini_trainer/setup_model_for_training.py
  • src/mini_trainer/train.py

Why: Models like Nemotron, Ministral, and Qwen3.5 have custom modeling code that requires trust_remote_code=True. Without this flag, config loading fails with KeyError or architecture detection errors.

Locations:

# setup_model_for_training.py
base_model_args = {
    "pretrained_model_name_or_path": model_name_or_path,
    "torch_dtype": train_dtype,
    "trust_remote_code": True,  # Added
}

original_config = AutoConfig.from_pretrained(
    model_name_or_path, trust_remote_code=True  # Added
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path, trust_remote_code=True  # Added
)

# train.py
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path, trust_remote_code=True  # Added
)

2. Fix torchrun path resolution in api_train.py

What: Use shutil.which("torchrun") with fallback to sys.executable directory

Why: Subprocess calls couldn't find torchrun in PATH. Using shutil.which() provides robust path resolution.

import shutil
import sys

torchrun_path = shutil.which("torchrun")
if not torchrun_path:
    torchrun_path = str(Path(sys.executable).parent / "torchrun")

Testing

Validated with models:

  • ✅ Nemotron OSFT - passes with trust_remote_code
  • ✅ Ministral SFT/OSFT - passes with trust_remote_code
  • ✅ Granite MoE OSFT - passes (loss: 11.5440)

Related PRs

Summary by CodeRabbit

  • New Features

    • Added a --trust-remote-code option and training/config flag to allow loading models/tokenizers that require custom repository code.
    • Extended supported model types for kernel prefetch/patch triggers.
  • Bug Fixes

    • Preserved and restored FP8 quantization metadata across distributed init, export, and checkpoint save/load.
    • Ensured tokenizers/configs respect the trust-remote-code setting during export.
  • Chores

    • Ignore local compiled cache directory in version control.

- Add trust_remote_code=True to all AutoConfig/AutoTokenizer.from_pretrained() calls
- Add trust_remote_code=True to base_model_args
- Add torchrun path resolution (shutil.which with sys.executable fallback)

This fixes OSFT training failures for models like Nemotron that use remote code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Threads a new boolean trust_remote_code flag through training CLI, model/tokenizer/config loading, subprocess invocation, and checkpoint export; adds FP8 dequantize/requantize utilities and preserves FP8 metadata during distributed model setup and extraction.

Changes

Cohort / File(s) Summary
Training Configuration
src/mini_trainer/training_types.py
Added trust_remote_code: bool = False to TrainingArgs.
Model Setup & Config
src/mini_trainer/setup_model_for_training.py
Added trust_remote_code parameter to get_model_save_dtype() and setup_model(); pass flag into AutoConfig.from_pretrained() and AutoTokenizer.from_pretrained(); derive from base_model_args; preserve FP8 metadata across broadcast/recreation; broaden kernel prefetch condition to include nemotron_h.
Training Execution & CLI
src/mini_trainer/train.py
Added trust_remote_code to save_model() and train(), forwarded to checkpoint saves; added --trust-remote-code CLI option; tokenizer loading during export honours the flag; export now requantizes FP8 state dicts and strips internal _fp8_* keys from saved config.
API Training
src/mini_trainer/api_train.py
run_training() conditionally appends --trust-remote-code to the torchrun subprocess command when train_args.trust_remote_code is true.
VLM / FP8 Utilities
src/mini_trainer/vlm_utils.py
Added _dequantize_fp8_model() to dequantize FP8 layers in-place and persist _fp8_scales/_fp8_quantization_config; added requantize_fp8_state_dict() to rebuild FP8-compatible state dicts; updated extract_causal_lm_from_vlm() to propagate quantization_config and dequantize the extracted causal LM.
Utils / Config class lookup
src/mini_trainer/utils.py
Replaced a TODO with clarified comment; continues to call AutoConfig.from_pretrained(..., trust_remote_code=True) unchanged.
Version Control
.gitignore
Added unsloth_compiled_cache/ to ignore list.

Sequence Diagram(s)

(skip)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through flags and config light,

threaded trust into the night,
FP8 scales I set aright,
saved the state and kept it tight,
a carrot patch, the model’s bright.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding trust_remote_code support for custom models and improving torchrun path resolution.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/v0.7.0-model-validation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mini_trainer/api_train.py`:
- Around line 106-113: Tests asserting command[0] equals the literal "torchrun"
need to accept a resolved path because api_train.py now sets torchrun_path to
either shutil.which("torchrun") or Path(sys.executable).parent / "torchrun";
update the test in tests/test_api_train.py to relax the expectation by comparing
os.path.basename(command[0]) == "torchrun" or otherwise asserting that
command[0].endswith(os.path.sep + "torchrun") or matches either "torchrun" or a
path whose basename is "torchrun" so both PATH and sys.executable-based
resolution pass.

In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 691-693: Add a user-controlled trust_remote_code boolean (default
False) and thread it through the CLI/API into the training/setup flow instead of
hard-coding True; update the call sites that currently call
AutoConfig.from_pretrained, AutoModel...from_pretrained (the model loading), and
AutoTokenizer.from_pretrained to pass the new trust_remote_code flag, and use
the same flag when saving/loading model artifacts referenced in train.py (save
path logic around the save call at the code handling saving in train.py). Wire
the CLI/API option into the existing TODO in src/mini_trainer/utils.py
(expose/parse the flag there), propagate it into setup_model_for_training (to
the AutoConfig, model, and tokenizer loads), and ensure defaults remain False so
only explicit opt-in enables trust_remote_code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 56d2ce76-acd9-4c71-b481-bbe0bb6a69e1

📥 Commits

Reviewing files that changed from the base of the PR and between 5730d5a and 7379470.

📒 Files selected for processing (3)
  • src/mini_trainer/api_train.py
  • src/mini_trainer/setup_model_for_training.py
  • src/mini_trainer/train.py

Address reviewer feedback and CI failures:

- Make trust_remote_code an explicit opt-in flag (default False) instead of
  hardcoding True, threading it through TrainingArgs, CLI, setup_model,
  get_model_save_dtype, train, and save_model. This addresses the security
  concern raised in review about executing arbitrary Hub code by default.

- Fix test_run_training_command_construction to compare Path basename instead
  of literal "torchrun", since shutil.which() returns a full path.

- Fix ruff format violations in setup_model_for_training.py and train.py
  (collapse single-line-fitting calls from multi-line).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 15.84158% with 85 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/mini_trainer/vlm_utils.py 3.17% 61 Missing ⚠️
src/mini_trainer/train.py 31.57% 13 Missing ⚠️
src/mini_trainer/setup_model_for_training.py 37.50% 10 Missing ⚠️
src/mini_trainer/api_train.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

claude added 4 commits March 27, 2026 23:26
If shutil.which('torchrun') returns None, that indicates a broken
PyTorch installation — raise RuntimeError instead of silently
constructing a potentially nonexistent path.
NemotronH has Mamba layers like GraniteMoeHybrid and needs the same
hub kernel cache pre-population to avoid causal_conv1d_cuda import
failures in torchrun subprocesses.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mini_trainer/api_train.py`:
- Around line 6-8: The file currently imports shutil, subprocess, and sys but
the symbol sys is unused; remove the unused import by deleting the standalone
"import sys" (or removing sys from the import list) so only shutil and
subprocess remain, leaving no other code changes necessary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f56d93b5-5bf5-4428-ac41-1008ad293ceb

📥 Commits

Reviewing files that changed from the base of the PR and between 7379470 and 772610d.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • .gitignore
  • src/mini_trainer/api_train.py
  • src/mini_trainer/setup_model_for_training.py
  • src/mini_trainer/train.py
  • src/mini_trainer/training_types.py
  • src/mini_trainer/utils.py
  • tests/test_api_train.py
✅ Files skipped from review due to trivial changes (2)
  • .gitignore
  • src/mini_trainer/utils.py

claude and others added 2 commits March 28, 2026 00:36
The torchrun-not-found issue was caused by the venv not being
activated in the test shell, not an actual installation problem.
Remove shutil.which() and the error fallback, revert to the original
simple 'torchrun' command string.
Ministral-3-3B ships with FP8 quantized weights that include scalar
parameters (weight_scale_inv, activation_scale) which FSDP rejects.
This change dequantizes FP8 weights to bf16 after VLM extraction for
training compatibility, preserves the original scales, and requantizes
back to FP8 at checkpoint save time so saved checkpoints match the
original FP8 format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 324ec763-4e8f-4cdb-bcf6-55cfb0dbe3b0

📥 Commits

Reviewing files that changed from the base of the PR and between bbf2764 and 63d40ff.

📒 Files selected for processing (3)
  • src/mini_trainer/setup_model_for_training.py
  • src/mini_trainer/train.py
  • src/mini_trainer/vlm_utils.py

Comment on lines +235 to +279
def requantize_fp8_state_dict(
state_dict: dict[str, torch.Tensor],
fp8_scales: dict[str, dict[str, torch.Tensor]],
) -> dict[str, torch.Tensor]:
"""Re-quantize a dequantized state dict back to FP8 for checkpoint saving.

This is the inverse of :func:`_dequantize_fp8_model`. It converts
bfloat16 weights back to ``float8_e4m3fn`` and restores the original
``weight_scale_inv`` and ``activation_scale`` entries so the saved
checkpoint matches the original FP8 format.

Args:
state_dict: The model state dict with bfloat16 weights.
fp8_scales: The ``_fp8_scales`` dict stored by
:func:`_dequantize_fp8_model`, mapping module paths to their
original scale tensors.

Returns:
A new state dict with FP8 weights and restored scale entries.
"""
out = {}
for key, tensor in state_dict.items():
out[key] = tensor

for mod_path, scales in fp8_scales.items():
weight_key = f"{mod_path}.weight"
if weight_key not in out:
continue

weight = out[weight_key]

# Re-quantize: fp8_weight = real_weight / weight_scale_inv
if "weight_scale_inv" in scales:
scale_inv = scales["weight_scale_inv"]
requantized = (weight.to(torch.float32) / scale_inv.to(torch.float32)).to(
torch.float8_e4m3fn
)
out[weight_key] = requantized
out[f"{mod_path}.weight_scale_inv"] = scale_inv

# Restore activation_scale as-is
if "activation_scale" in scales:
out[f"{mod_path}.activation_scale"] = scales["activation_scale"]

return out
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

What is the representable range of torch.float8_e4m3fn dtype and how does PyTorch handle overflow when casting to this dtype?

💡 Result:

The representable range of torch.float8_e4m3fn (E4M3FN format: 1 sign bit, 4 exponent bits with bias 7, 3 mantissa bits, finite-only, no infinities) is approximately from 1.5e-2 (smallest normal) to 448 (largest finite), symmetric around zero (i.e., -448 to 448). Subnormals extend down to around 6e-3. Exact values can be queried in PyTorch via torch.finfo(torch.float8_e4m3fn), which provides max, min, tiny, etc. When casting to this dtype (e.g., tensor.to(torch.float8_e4m3fn)), values outside the representable range result in NaN due to exponent overflow, as this format lacks infinity representation (all-inf exponent patterns encode NaN). Clamping to max finite value (448) is not the default behavior; a feature request exists for a clamped cast mode.

Citations:


Verify requantization precision with FP8 range constraints.

The requantization formula divides the weight by scale_inv to recover the original FP8 representation. However, casting back to torch.float8_e4m3fn (line 269-271) may cause overflow for values outside the FP8 representable range (approximately ±448 for e4m3fn). When overflow occurs, PyTorch produces NaN values rather than infinity or saturation, since the e4m3fn format lacks an infinity representation.

If the dequantized weights were modified during training (e.g., fine-tuning), the new values might exceed the original FP8 range, resulting in NaN. This is likely acceptable since this is designed for checkpointing models that were originally FP8, but worth noting.

@Maxusmusti Maxusmusti force-pushed the fix/v0.7.0-model-validation branch from afc97ae to d75c5d4 Compare March 31, 2026 14:47
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Maxusmusti Maxusmusti force-pushed the fix/v0.7.0-model-validation branch from d75c5d4 to c54e4a8 Compare March 31, 2026 14:56
Copy link
Copy Markdown
Contributor

@Maxusmusti Maxusmusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Maxusmusti Maxusmusti merged commit f5d63c2 into main Mar 31, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants