Fix trust_remote_code and torchrun path for custom models by RobotSail · Pull Request #78 · Red-Hat-AI-Innovation-Team/mini_trainer

RobotSail · 2026-03-27T21:32:57Z

Summary

Fixes model loading and torchrun execution issues for models with custom code (Nemotron, Ministral, Qwen3.5, etc.).

This is part of the v0.7.0 model validation effort.

Changes

1. Add `trust_remote_code=True` to all AutoConfig and AutoTokenizer calls

Files Modified:

src/mini_trainer/setup_model_for_training.py
src/mini_trainer/train.py

Why: Models like Nemotron, Ministral, and Qwen3.5 have custom modeling code that requires trust_remote_code=True. Without this flag, config loading fails with KeyError or architecture detection errors.

Locations:

# setup_model_for_training.py
base_model_args = {
    "pretrained_model_name_or_path": model_name_or_path,
    "torch_dtype": train_dtype,
    "trust_remote_code": True,  # Added
}

original_config = AutoConfig.from_pretrained(
    model_name_or_path, trust_remote_code=True  # Added
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path, trust_remote_code=True  # Added
)

# train.py
tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path, trust_remote_code=True  # Added
)

2. Fix torchrun path resolution in `api_train.py`

What: Use shutil.which("torchrun") with fallback to sys.executable directory

Why: Subprocess calls couldn't find torchrun in PATH. Using shutil.which() provides robust path resolution.

import shutil
import sys

torchrun_path = shutil.which("torchrun")
if not torchrun_path:
    torchrun_path = str(Path(sys.executable).parent / "torchrun")

Testing

Validated with models:

✅ Nemotron OSFT - passes with trust_remote_code
✅ Ministral SFT/OSFT - passes with trust_remote_code
✅ Granite MoE OSFT - passes (loss: 11.5440)

Related PRs

training_hub: Fix model validation issues for v0.7.0 release training_hub#53
instructlab-training: Fix trust_remote_code and gradient checkpointing for custom models instructlab/training#696

Summary by CodeRabbit

New Features
- Added a --trust-remote-code option and training/config flag to allow loading models/tokenizers that require custom repository code.
- Extended supported model types for kernel prefetch/patch triggers.
Bug Fixes
- Preserved and restored FP8 quantization metadata across distributed init, export, and checkpoint save/load.
- Ensured tokenizers/configs respect the trust-remote-code setting during export.
Chores
- Ignore local compiled cache directory in version control.

- Add trust_remote_code=True to all AutoConfig/AutoTokenizer.from_pretrained() calls - Add trust_remote_code=True to base_model_args - Add torchrun path resolution (shutil.which with sys.executable fallback) This fixes OSFT training failures for models like Nemotron that use remote code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

coderabbitai · 2026-03-27T21:33:12Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Threads a new boolean trust_remote_code flag through training CLI, model/tokenizer/config loading, subprocess invocation, and checkpoint export; adds FP8 dequantize/requantize utilities and preserves FP8 metadata during distributed model setup and extraction.

Changes

Cohort / File(s)	Summary
Training Configuration `src/mini_trainer/training_types.py`	Added `trust_remote_code: bool = False` to `TrainingArgs`.
Model Setup & Config `src/mini_trainer/setup_model_for_training.py`	Added `trust_remote_code` parameter to `get_model_save_dtype()` and `setup_model()`; pass flag into `AutoConfig.from_pretrained()` and `AutoTokenizer.from_pretrained()`; derive from `base_model_args`; preserve FP8 metadata across broadcast/recreation; broaden kernel prefetch condition to include `nemotron_h`.
Training Execution & CLI `src/mini_trainer/train.py`	Added `trust_remote_code` to `save_model()` and `train()`, forwarded to checkpoint saves; added `--trust-remote-code` CLI option; tokenizer loading during export honours the flag; export now requantizes FP8 state dicts and strips internal `_fp8_*` keys from saved config.
API Training `src/mini_trainer/api_train.py`	`run_training()` conditionally appends `--trust-remote-code` to the `torchrun` subprocess command when `train_args.trust_remote_code` is true.
VLM / FP8 Utilities `src/mini_trainer/vlm_utils.py`	Added `_dequantize_fp8_model()` to dequantize FP8 layers in-place and persist `_fp8_scales`/`_fp8_quantization_config`; added `requantize_fp8_state_dict()` to rebuild FP8-compatible state dicts; updated `extract_causal_lm_from_vlm()` to propagate `quantization_config` and dequantize the extracted causal LM.
Utils / Config class lookup `src/mini_trainer/utils.py`	Replaced a TODO with clarified comment; continues to call `AutoConfig.from_pretrained(..., trust_remote_code=True)` unchanged.
Version Control `.gitignore`	Added `unsloth_compiled_cache/` to ignore list.

Sequence Diagram(s)

(skip)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through flags and config light,

threaded trust into the night,
FP8 scales I set aright,
saved the state and kept it tight,
a carrot patch, the model’s bright.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: adding trust_remote_code support for custom models and improving torchrun path resolution.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/v0.7.0-model-validation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mini_trainer/api_train.py`:
- Around line 106-113: Tests asserting command[0] equals the literal "torchrun"
need to accept a resolved path because api_train.py now sets torchrun_path to
either shutil.which("torchrun") or Path(sys.executable).parent / "torchrun";
update the test in tests/test_api_train.py to relax the expectation by comparing
os.path.basename(command[0]) == "torchrun" or otherwise asserting that
command[0].endswith(os.path.sep + "torchrun") or matches either "torchrun" or a
path whose basename is "torchrun" so both PATH and sys.executable-based
resolution pass.

In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 691-693: Add a user-controlled trust_remote_code boolean (default
False) and thread it through the CLI/API into the training/setup flow instead of
hard-coding True; update the call sites that currently call
AutoConfig.from_pretrained, AutoModel...from_pretrained (the model loading), and
AutoTokenizer.from_pretrained to pass the new trust_remote_code flag, and use
the same flag when saving/loading model artifacts referenced in train.py (save
path logic around the save call at the code handling saving in train.py). Wire
the CLI/API option into the existing TODO in src/mini_trainer/utils.py
(expose/parse the flag there), propagate it into setup_model_for_training (to
the AutoConfig, model, and tokenizer loads), and ensure defaults remain False so
only explicit opt-in enables trust_remote_code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 56d2ce76-acd9-4c71-b481-bbe0bb6a69e1

📥 Commits

Reviewing files that changed from the base of the PR and between 5730d5a and 7379470.

📒 Files selected for processing (3)

src/mini_trainer/api_train.py
src/mini_trainer/setup_model_for_training.py
src/mini_trainer/train.py

src/mini_trainer/api_train.py

src/mini_trainer/setup_model_for_training.py

Address reviewer feedback and CI failures: - Make trust_remote_code an explicit opt-in flag (default False) instead of hardcoding True, threading it through TrainingArgs, CLI, setup_model, get_model_save_dtype, train, and save_model. This addresses the security concern raised in review about executing arbitrary Hub code by default. - Fix test_run_training_command_construction to compare Path basename instead of literal "torchrun", since shutil.which() returns a full path. - Fix ruff format violations in setup_model_for_training.py and train.py (collapse single-line-fitting calls from multi-line). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

codecov · 2026-03-27T22:57:36Z

Codecov Report

❌ Patch coverage is 15.84158% with 85 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/mini_trainer/vlm_utils.py	3.17%	61 Missing ⚠️
src/mini_trainer/train.py	31.57%	13 Missing ⚠️
src/mini_trainer/setup_model_for_training.py	37.50%	10 Missing ⚠️
src/mini_trainer/api_train.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

If shutil.which('torchrun') returns None, that indicates a broken PyTorch installation — raise RuntimeError instead of silently constructing a potentially nonexistent path.

NemotronH has Mamba layers like GraniteMoeHybrid and needs the same hub kernel cache pre-population to avoid causal_conv1d_cuda import failures in torchrun subprocesses.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mini_trainer/api_train.py`:
- Around line 6-8: The file currently imports shutil, subprocess, and sys but
the symbol sys is unused; remove the unused import by deleting the standalone
"import sys" (or removing sys from the import list) so only shutil and
subprocess remain, leaving no other code changes necessary.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f56d93b5-5bf5-4428-ac41-1008ad293ceb

📥 Commits

Reviewing files that changed from the base of the PR and between 7379470 and 772610d.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

.gitignore
src/mini_trainer/api_train.py
src/mini_trainer/setup_model_for_training.py
src/mini_trainer/train.py
src/mini_trainer/training_types.py
src/mini_trainer/utils.py
tests/test_api_train.py

✅ Files skipped from review due to trivial changes (2)

.gitignore
src/mini_trainer/utils.py

src/mini_trainer/api_train.py

The torchrun-not-found issue was caused by the venv not being activated in the test shell, not an actual installation problem. Remove shutil.which() and the error fallback, revert to the original simple 'torchrun' command string.

Ministral-3-3B ships with FP8 quantized weights that include scalar parameters (weight_scale_inv, activation_scale) which FSDP rejects. This change dequantizes FP8 weights to bf16 after VLM extraction for training compatibility, preserves the original scales, and requantizes back to FP8 at checkpoint save time so saved checkpoints match the original FP8 format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 324ec763-4e8f-4cdb-bcf6-55cfb0dbe3b0

📥 Commits

Reviewing files that changed from the base of the PR and between bbf2764 and 63d40ff.

📒 Files selected for processing (3)

src/mini_trainer/setup_model_for_training.py
src/mini_trainer/train.py
src/mini_trainer/vlm_utils.py

coderabbitai · 2026-03-30T19:41:56Z

src/mini_trainer/vlm_utils.py

+def requantize_fp8_state_dict(
+    state_dict: dict[str, torch.Tensor],
+    fp8_scales: dict[str, dict[str, torch.Tensor]],
+) -> dict[str, torch.Tensor]:
+    """Re-quantize a dequantized state dict back to FP8 for checkpoint saving.
+
+    This is the inverse of :func:`_dequantize_fp8_model`.  It converts
+    bfloat16 weights back to ``float8_e4m3fn`` and restores the original
+    ``weight_scale_inv`` and ``activation_scale`` entries so the saved
+    checkpoint matches the original FP8 format.
+
+    Args:
+        state_dict: The model state dict with bfloat16 weights.
+        fp8_scales: The ``_fp8_scales`` dict stored by
+            :func:`_dequantize_fp8_model`, mapping module paths to their
+            original scale tensors.
+
+    Returns:
+        A new state dict with FP8 weights and restored scale entries.
+    """
+    out = {}
+    for key, tensor in state_dict.items():
+        out[key] = tensor
+
+    for mod_path, scales in fp8_scales.items():
+        weight_key = f"{mod_path}.weight"
+        if weight_key not in out:
+            continue
+
+        weight = out[weight_key]
+
+        # Re-quantize: fp8_weight = real_weight / weight_scale_inv
+        if "weight_scale_inv" in scales:
+            scale_inv = scales["weight_scale_inv"]
+            requantized = (weight.to(torch.float32) / scale_inv.to(torch.float32)).to(
+                torch.float8_e4m3fn
+            )
+            out[weight_key] = requantized
+            out[f"{mod_path}.weight_scale_inv"] = scale_inv
+
+        # Restore activation_scale as-is
+        if "activation_scale" in scales:
+            out[f"{mod_path}.activation_scale"] = scales["activation_scale"]
+
+    return out


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

What is the representable range of torch.float8_e4m3fn dtype and how does PyTorch handle overflow when casting to this dtype?

💡 Result:

The representable range of torch.float8_e4m3fn (E4M3FN format: 1 sign bit, 4 exponent bits with bias 7, 3 mantissa bits, finite-only, no infinities) is approximately from 1.5e-2 (smallest normal) to 448 (largest finite), symmetric around zero (i.e., -448 to 448). Subnormals extend down to around 6e-3. Exact values can be queried in PyTorch via torch.finfo(torch.float8_e4m3fn), which provides max, min, tiny, etc. When casting to this dtype (e.g., tensor.to(torch.float8_e4m3fn)), values outside the representable range result in NaN due to exponent overflow, as this format lacks infinity representation (all-inf exponent patterns encode NaN). Clamping to max finite value (448) is not the default behavior; a feature request exists for a clamped cast mode.

Citations:

1: Add clamped FP8 (E4M3) cast for overflow-safe inference pytorch/pytorch#154028

2: https://discuss.pytorch.org/t/why-tensor-to-convert-fp32-to-fp8-e4m3-nan-if-overflow/212648

3: https://github.com/IntelLabs/FP8-Emulation-Toolkit

4: https://onnx.ai/onnx/_sources/technical/float8.md.txt

5: Fix isinf for Float8 types without infinity representation pytorch/pytorch#170526

Verify requantization precision with FP8 range constraints.

The requantization formula divides the weight by scale_inv to recover the original FP8 representation. However, casting back to torch.float8_e4m3fn (line 269-271) may cause overflow for values outside the FP8 representable range (approximately ±448 for e4m3fn). When overflow occurs, PyTorch produces NaN values rather than infinity or saturation, since the e4m3fn format lacks an infinity representation.

If the dequantized weights were modified during training (e.g., fine-tuning), the new values might exceed the original FP8 range, resulting in NaN. This is likely acceptable since this is designed for checkpointing models that were originally FP8, but worth noting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Maxusmusti

LGTM

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

src/mini_trainer/api_train.py Outdated Show resolved Hide resolved

src/mini_trainer/setup_model_for_training.py Outdated Show resolved Hide resolved

claude added 4 commits March 27, 2026 23:26

Error if torchrun is not found instead of falling back

3f38d96

If shutil.which('torchrun') returns None, that indicates a broken PyTorch installation — raise RuntimeError instead of silently constructing a potentially nonexistent path.

Remove accidentally committed unsloth cache files

b68deb5

Add unsloth_compiled_cache to .gitignore

c07b9da

Enable local mamba kernel pre-population for NemotronH models

772610d

NemotronH has Mamba layers like GraniteMoeHybrid and needs the same hub kernel cache pre-population to avoid causal_conv1d_cuda import failures in torchrun subprocesses.

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

src/mini_trainer/api_train.py Outdated Show resolved Hide resolved

claude and others added 2 commits March 28, 2026 00:36

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

Maxusmusti force-pushed the fix/v0.7.0-model-validation branch from afc97ae to d75c5d4 Compare March 31, 2026 14:47

Ruff formatting fixes and guard FP8 check against MagicMock in tests

c54e4a8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Maxusmusti force-pushed the fix/v0.7.0-model-validation branch from d75c5d4 to c54e4a8 Compare March 31, 2026 14:56

Maxusmusti approved these changes Mar 31, 2026

View reviewed changes

Maxusmusti merged commit f5d63c2 into main Mar 31, 2026
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix trust_remote_code and torchrun path for custom models#78

Fix trust_remote_code and torchrun path for custom models#78
Maxusmusti merged 9 commits intomainfrom
fix/v0.7.0-model-validation

RobotSail commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 30, 2026

Uh oh!

Maxusmusti left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RobotSail commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Add trust_remote_code=True to all AutoConfig and AutoTokenizer calls

2. Fix torchrun path resolution in api_train.py

Testing

Related PRs

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Maxusmusti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RobotSail commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

1. Add `trust_remote_code=True` to all AutoConfig and AutoTokenizer calls

2. Fix torchrun path resolution in `api_train.py`

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

codecov bot commented Mar 27, 2026 •

edited

Loading