Fix trust_remote_code and torchrun path for custom models#78
Fix trust_remote_code and torchrun path for custom models#78Maxusmusti merged 9 commits intomainfrom
Conversation
- Add trust_remote_code=True to all AutoConfig/AutoTokenizer.from_pretrained() calls - Add trust_remote_code=True to base_model_args - Add torchrun path resolution (shutil.which with sys.executable fallback) This fixes OSFT training failures for models like Nemotron that use remote code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThreads a new boolean Changes
Sequence Diagram(s)(skip) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mini_trainer/api_train.py`:
- Around line 106-113: Tests asserting command[0] equals the literal "torchrun"
need to accept a resolved path because api_train.py now sets torchrun_path to
either shutil.which("torchrun") or Path(sys.executable).parent / "torchrun";
update the test in tests/test_api_train.py to relax the expectation by comparing
os.path.basename(command[0]) == "torchrun" or otherwise asserting that
command[0].endswith(os.path.sep + "torchrun") or matches either "torchrun" or a
path whose basename is "torchrun" so both PATH and sys.executable-based
resolution pass.
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 691-693: Add a user-controlled trust_remote_code boolean (default
False) and thread it through the CLI/API into the training/setup flow instead of
hard-coding True; update the call sites that currently call
AutoConfig.from_pretrained, AutoModel...from_pretrained (the model loading), and
AutoTokenizer.from_pretrained to pass the new trust_remote_code flag, and use
the same flag when saving/loading model artifacts referenced in train.py (save
path logic around the save call at the code handling saving in train.py). Wire
the CLI/API option into the existing TODO in src/mini_trainer/utils.py
(expose/parse the flag there), propagate it into setup_model_for_training (to
the AutoConfig, model, and tokenizer loads), and ensure defaults remain False so
only explicit opt-in enables trust_remote_code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 56d2ce76-acd9-4c71-b481-bbe0bb6a69e1
📒 Files selected for processing (3)
src/mini_trainer/api_train.pysrc/mini_trainer/setup_model_for_training.pysrc/mini_trainer/train.py
Address reviewer feedback and CI failures: - Make trust_remote_code an explicit opt-in flag (default False) instead of hardcoding True, threading it through TrainingArgs, CLI, setup_model, get_model_save_dtype, train, and save_model. This addresses the security concern raised in review about executing arbitrary Hub code by default. - Fix test_run_training_command_construction to compare Path basename instead of literal "torchrun", since shutil.which() returns a full path. - Fix ruff format violations in setup_model_for_training.py and train.py (collapse single-line-fitting calls from multi-line). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
If shutil.which('torchrun') returns None, that indicates a broken
PyTorch installation — raise RuntimeError instead of silently
constructing a potentially nonexistent path.
NemotronH has Mamba layers like GraniteMoeHybrid and needs the same hub kernel cache pre-population to avoid causal_conv1d_cuda import failures in torchrun subprocesses.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mini_trainer/api_train.py`:
- Around line 6-8: The file currently imports shutil, subprocess, and sys but
the symbol sys is unused; remove the unused import by deleting the standalone
"import sys" (or removing sys from the import list) so only shutil and
subprocess remain, leaving no other code changes necessary.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f56d93b5-5bf5-4428-ac41-1008ad293ceb
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
.gitignoresrc/mini_trainer/api_train.pysrc/mini_trainer/setup_model_for_training.pysrc/mini_trainer/train.pysrc/mini_trainer/training_types.pysrc/mini_trainer/utils.pytests/test_api_train.py
✅ Files skipped from review due to trivial changes (2)
- .gitignore
- src/mini_trainer/utils.py
The torchrun-not-found issue was caused by the venv not being activated in the test shell, not an actual installation problem. Remove shutil.which() and the error fallback, revert to the original simple 'torchrun' command string.
Ministral-3-3B ships with FP8 quantized weights that include scalar parameters (weight_scale_inv, activation_scale) which FSDP rejects. This change dequantizes FP8 weights to bf16 after VLM extraction for training compatibility, preserves the original scales, and requantizes back to FP8 at checkpoint save time so saved checkpoints match the original FP8 format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 324ec763-4e8f-4cdb-bcf6-55cfb0dbe3b0
📒 Files selected for processing (3)
src/mini_trainer/setup_model_for_training.pysrc/mini_trainer/train.pysrc/mini_trainer/vlm_utils.py
| def requantize_fp8_state_dict( | ||
| state_dict: dict[str, torch.Tensor], | ||
| fp8_scales: dict[str, dict[str, torch.Tensor]], | ||
| ) -> dict[str, torch.Tensor]: | ||
| """Re-quantize a dequantized state dict back to FP8 for checkpoint saving. | ||
|
|
||
| This is the inverse of :func:`_dequantize_fp8_model`. It converts | ||
| bfloat16 weights back to ``float8_e4m3fn`` and restores the original | ||
| ``weight_scale_inv`` and ``activation_scale`` entries so the saved | ||
| checkpoint matches the original FP8 format. | ||
|
|
||
| Args: | ||
| state_dict: The model state dict with bfloat16 weights. | ||
| fp8_scales: The ``_fp8_scales`` dict stored by | ||
| :func:`_dequantize_fp8_model`, mapping module paths to their | ||
| original scale tensors. | ||
|
|
||
| Returns: | ||
| A new state dict with FP8 weights and restored scale entries. | ||
| """ | ||
| out = {} | ||
| for key, tensor in state_dict.items(): | ||
| out[key] = tensor | ||
|
|
||
| for mod_path, scales in fp8_scales.items(): | ||
| weight_key = f"{mod_path}.weight" | ||
| if weight_key not in out: | ||
| continue | ||
|
|
||
| weight = out[weight_key] | ||
|
|
||
| # Re-quantize: fp8_weight = real_weight / weight_scale_inv | ||
| if "weight_scale_inv" in scales: | ||
| scale_inv = scales["weight_scale_inv"] | ||
| requantized = (weight.to(torch.float32) / scale_inv.to(torch.float32)).to( | ||
| torch.float8_e4m3fn | ||
| ) | ||
| out[weight_key] = requantized | ||
| out[f"{mod_path}.weight_scale_inv"] = scale_inv | ||
|
|
||
| # Restore activation_scale as-is | ||
| if "activation_scale" in scales: | ||
| out[f"{mod_path}.activation_scale"] = scales["activation_scale"] | ||
|
|
||
| return out |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the representable range of torch.float8_e4m3fn dtype and how does PyTorch handle overflow when casting to this dtype?
💡 Result:
The representable range of torch.float8_e4m3fn (E4M3FN format: 1 sign bit, 4 exponent bits with bias 7, 3 mantissa bits, finite-only, no infinities) is approximately from 1.5e-2 (smallest normal) to 448 (largest finite), symmetric around zero (i.e., -448 to 448). Subnormals extend down to around 6e-3. Exact values can be queried in PyTorch via torch.finfo(torch.float8_e4m3fn), which provides max, min, tiny, etc. When casting to this dtype (e.g., tensor.to(torch.float8_e4m3fn)), values outside the representable range result in NaN due to exponent overflow, as this format lacks infinity representation (all-inf exponent patterns encode NaN). Clamping to max finite value (448) is not the default behavior; a feature request exists for a clamped cast mode.
Citations:
- 1: Add clamped FP8 (E4M3) cast for overflow-safe inference pytorch/pytorch#154028
- 2: https://discuss.pytorch.org/t/why-tensor-to-convert-fp32-to-fp8-e4m3-nan-if-overflow/212648
- 3: https://github.com/IntelLabs/FP8-Emulation-Toolkit
- 4: https://onnx.ai/onnx/_sources/technical/float8.md.txt
- 5: Fix isinf for Float8 types without infinity representation pytorch/pytorch#170526
Verify requantization precision with FP8 range constraints.
The requantization formula divides the weight by scale_inv to recover the original FP8 representation. However, casting back to torch.float8_e4m3fn (line 269-271) may cause overflow for values outside the FP8 representable range (approximately ±448 for e4m3fn). When overflow occurs, PyTorch produces NaN values rather than infinity or saturation, since the e4m3fn format lacks an infinity representation.
If the dequantized weights were modified during training (e.g., fine-tuning), the new values might exceed the original FP8 range, resulting in NaN. This is likely acceptable since this is designed for checkpointing models that were originally FP8, but worth noting.
afc97ae to
d75c5d4
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d75c5d4 to
c54e4a8
Compare
Summary
Fixes model loading and torchrun execution issues for models with custom code (Nemotron, Ministral, Qwen3.5, etc.).
This is part of the v0.7.0 model validation effort.
Changes
1. Add
trust_remote_code=Trueto all AutoConfig and AutoTokenizer callsFiles Modified:
src/mini_trainer/setup_model_for_training.pysrc/mini_trainer/train.pyWhy: Models like Nemotron, Ministral, and Qwen3.5 have custom modeling code that requires
trust_remote_code=True. Without this flag, config loading fails with KeyError or architecture detection errors.Locations:
2. Fix torchrun path resolution in
api_train.pyWhat: Use
shutil.which("torchrun")with fallback tosys.executabledirectoryWhy: Subprocess calls couldn't find torchrun in PATH. Using shutil.which() provides robust path resolution.
Testing
Validated with models:
Related PRs
Summary by CodeRabbit
New Features
Bug Fixes
Chores